About DataSailr

Introduction

DataSailr package brings intuitive and fast row by row data manipulation to R. The data manipulation instruction can be writtein in Sailr script, which is designed specially for data manipulation. DataSailr’s main function, sail(), takes dataframe and Sailr script as its arguments, processes each row (i.e. record), following Sailr script.

DataSailr input & ouput

Features

Variable names in Sailr script correspond to column names of dataframe. Assigning some value to a variable results in updating the corresponding column with the new values, or creating a new column if the column name does not exist yet. Variables appearing on the right hand side of assignment operator or appearing in arguments of functions refer to the value of the corresponding column. Supported variable types are integer, double, string, and boolean. Dataframe columns with these types can be referred from Sailr script. Regular expression can also be used, but only available as regular expression literal in Sailr script.

Arithmetic operators, such as addition, subtraction, multiplication, division, power are supported. Built-in functions are provided. Strings can be stripped, concatenated, and subset using these functions.

About flow control statements, if-else statement is available. Assigning different values based on other column values or based on calculation result is frequently used, in such a situation as adding flags to dataframe.

Another interesting feature is regular expression support. Sailr script allows you to use regular expression literal. These regular expressions can be matched with strings and can be used within if-else conditions, , which enables flexible value assignment with the use of if-else condition. For example, using regular expression, you can add flags to rows (i.e. records) that begin with some characters, end with some characters, and include some specific sequence of characters. Also, it is possible to extract some specific pattern of characters from string. For example, this is useful when you extract date information from string.

Internally DataSailr is implemented using libsailr C/C++ engine, which follows Sailr script and conducts arithmetic operations and string manipulation, and applies built-in functions. Sailr scripts are parsed and converted into virtual machine(VM) code, and executed on VM. DataSailr passes values of each row to libsailr engine, and return the results of each row as dataframe.

Though the vanilla R and other packages already have provided enough functionalities to manipulate data, I believe DataSailr package provides another intuitive way and flexibility for data manipulation.