DataSailr internal
Overview of DataSailr internal
DataSailr is an R package which conducts numerical calculation and string manipulation, and is implemented using C/C++ library via Rcpp. Dataframes are passed from R world to C++ world, or are accessed via Rcpp from C++ world. C++ extracts each record (each row) and passes it to the core engine, called libsailr. Libsailr takes data for each record as a table with pairs of variable name and pointer to object, which is internally called ptr_table.
With that ptr_table, libsailr conducts numerical caulcuation and manipulate strings following the DataSailr script. In more detail, DataSailr script is not directly used. DataSailr script is parsed and conveted to AST (abstract syntax tree), and finally is converted to Sailr VM instructions by libsailr. Libsailr VM works as a virtual stack machine, and works with ptr_table. When virtual machine finishes executing all the VM instructions, the results are on the ptr_table.
- Chart: Internal from user's view
The result on ptr_table is copied back to Rcpp dataframe which is finally returned to R.
- Chart: Internal from developer's view
How libsailr works
Variable sources
What are called as varialbes can come from three different sources.
- LHS on assignment in DataSailr script
- RHS on assignment (or used as value) in Dataailr script
- Preexisting variables on ptr_table
These three types can overlap. For example, it's possible the variable preexits before execution and used in DataSailr script as RHS value, and also redefined on LHS in DataSailr script.
Varibles of some type should not exist. For example, variables that appear on RHS, but that do not preexist or appear on LHS must cause errors, because they are not defined even when they appear on RHS.
Availble types and instructions at each component in libsailr
Roughly speacking, only integer, double and sring are the available types in libsailr. Integer and doule values can be dealt both as value itself and as pointers. Name beginning with "PTR_" suggest that they are pointer and beginning with "PP_" suggets pointer to pointer.
Source file description
For the time when you find some bugs or inconsistent behaviors and you try to fix it by yourself, I will show you which files you may need to see. Also, if you happend to have interest in datasailr/libsair, I hope this guide to be helpful.
- <datasailr package root>/src/data_sailr_cpp_main.cpp
- Main C++ function receives R dataframe, and map elements of each row onto libsailr's ptr_table. Run calculation following DataSailr script, and return the result of each row as dataframe to R.
- <libsailr root>/sailr.c(.h)
- libsailr C interface. Library users include this sailr.h.
- <libsailr root>/parse.y
- bison file for DataSailr Script
- <libsailr root>/lex.l
- flex file to tokenize DataSailr Script
- <libsailr root>/node.c(.h)
- Functions to construct abstract syntax tree (AST).
- Parser generated by bison call these functions to create AST.
- <libsailr root>/string directory
- Be sure to manipulate strings using common_string.h interface.
- Internally string object is implemented using C++ std::string.
- <libsailr root>/simple_re directory
- Be sure to use regular expression using simple_re.h interface.
- Internally regular expression object is Onigmo object.
- <libsailr root>/gen_code.c(.h)
- gen_code() tries to recursively call functions to generate VM insctructions based on AST. This file does not generate VM instruction itself, just call functions.
- VM instruction generating functions exist in gen_code_util.c.
- <libsailr root>/gen_code_util.c(.h)
- Functions to generate corresponding vm instruction.
- First, VM instructions are stored as a linked list of VM instructions.
- Finally, vm_inst_list_to_code converts the linked list form into array form, which is referred to as vm_code.
- <libsailr root>/ptr_table.c(h)
- ptr_table is key component for libsailr. This table manages all the variables and pointers to their values. Also, anonymous objects that are created by literals in DataSailr Script are also managed by this.
- The data structure is built using UT_HASH.
- <libsailr root>/vm directory
- VM instructions run on virtual machine. Source files related to this virtual machine exist in this directory.
- <libsailr root>/vm/vm.c(.h)
- vm_exec_code and vm_run_code runs the VM instructions.
- <libsailr root>/vm/vm_cmd.h
- VM commands are defined here.
- <libsailr root>/vm/vm_code.c(.h)
- The structure of VM instruction is defined here.
- Roughly speaking, VM instruction consists of VM command and its options.
- <libsailr root>/vm/vm_stack.c(.h)
- This file defines VM stack. VM stack consists of things called items.
- <libsailr root>/vm/vm_assign.c(.h) vm_calc.c(.h) vm_rexp.c(.h)
- For VM commands of assignment, calculation and regular expression matching, how they manipulate vm stack is defined in the following places.
- vm_assign.c : assignment operation
- vm_calc.c : arithmetic calculation
- vm_rexp.c : regular expression matching
- <libsailr root>/vm/vm_call_func.c(.h)
- Sailr functions call corresponding C defined functions.
- <libsailr root>/vm/func/c_func/c_func.c(.h)
- How DataSailr built-in functions manage VM stack is defined here.