DataSailr package brings intuitive row by row data manipulation to R. The data manipulation instruction is writtein in Sailr scripting language, which is designed specially for data manipulation. In contrast to vanilla R, in which dataframe is manipulated using column vector and vector operations, the row wise data manipulation is more natural.
For example, when calculating body mass index (BMI) from body weight and height, calculation needs to be done for each row. Categorizing each person based on his/her BMI is also done for each row.
# Pass the following script to datasailr::sail() function.
# Example of Sailr script
code = '
// When calculating BMI, multipllication of 703 is required in the U.S. (using lbs and inches)
// In other countires using meter and kilograms, 703 should be omitted.
if( us == 1){
bmi = weight / (height * height) * 703
}else{
bmi = weight / (height * height)
}
if(bmi >= 40){ weight_level = . }
else if( bmi >= 35 ){ weight_level = . }
else if( bmi >= 30 ){ weight_level = 3 }
else if( bmi >= 25 ){ weight_level = 2 }
else if( bmi >= 20 ){ weight_level = 1 }
else { weight_level = . }
'
DataSailr's main function, sail(), works by taking two arguments. It takes dataset and Sailr script, and it processes each row following the Sailr script.
library(datasailr) df = data.frame(us=c(1,1,1,1,1,0,0,0,0), weight=c(150,120,175,160,180,80,60,50,90), height=c(70,60,60,70,65,1.7,1.6,1.7,1.9)) sail(df , code)
Examples are shown in documents.
From my personal experience of data analysis in epidemiology field, I wanted to have a way to manipulate data in a row direction. R did not have this kind of functionality, and I started to develop DataSailr.
The Sailr script language is designed for data anaysis and statistics. People in these fields must feel natural. Compared with general purpose programming languages, Sailr script's functionality is very limited, but this limitation results in great fit for data manipulation.
When DataSailr is available on CRAN, please use the following code to install. (When there are problems to fix, the package may be archived and not available on CRAN.)
# R interpreter
install.packages("datasailr")
Another way is to download a binary package and install it.
# Download binary package appropriate for your environment. # Linux 64bit R CMD INSTALL datasailr_0.8.6_R_x86_64-pc-linux-gnu.tar.gz # Windows 64bit R CMD INSTALL datasailr_0.8.6.zip
Examples are shown in documents.
This package was accepted as a regualr presentation at UseR!2020 (which was originally planned to take place in St.Louis, and finally was held online). In this presentation, I introduced functionality of DataSailr and how to write Sailr script. When the presentation was made, the DataSailr version was 0.8.5. More features and bug fixes have been added since then.
Link to YouTube video (This link opens YouTube video.)
A famous R package, dplyr, has been improving the same kind of points. It enables data manipulation without thinking much about column vectors. Pipe operator, %>% in magrittr package, and dplyr functions realize intuitive data manipulation flow. The DataSailr package enables the same kind of thing with a single Sailr code. The two packages do not compete, and I intend to implement DataSailr as it also can work with dplyr.
| DataSailr | dplyr | |
|---|---|---|
| How to manipulate data | Apply a single Sailr code (datasailr::sail()) | Apply multiple functions using (%>%) |
| Create new column | Assign value to new variable | mutate() |
| Keep some columns | (Not for this purpose) | select() |
| Keep some rows | discard!() drops rows | filter() |
| Summarize columns | (Not for this purpose) | summarize() |
| Sort rows | (Not for this purpose) | arrange() |
| Regular expression | Built-in | Partially available with another R package |
| Available functions | Only Sailr built-in functions are available | Can call R functions |
| Convert wide to long format | push!() function | (use reshape2 package instead) |
| Convert long to wide format | (Not implemented yet) | (use reshape2 package instead) |