DataSailr package brings intuitive row by row data manipulation to R. The data manipulation instruction is writtein in DataSailr script, which is designed specially for data manipulation. In contrast to vanilla R, in which dataframe is manipulated using column vector and vector operations, the row wise data manipulation is more natural.
For example, when calculating body mass index (BMI) from body weight and height, calculation needs to be done for each row. Categorizing each person based on his/her BMI is also done for each row.
# Pass the following script to datasailr::sail() function. # Example of DataSailr script code = ' // When calculating BMI, multipllication of 703 is required in the U.S. (using lbs and inches) // In other countires using meter and kilograms, 703 should be omitted. if( us == 1){ bmi = weight / (height * height) * 703 }else{ bmi = weight / (height * height) } if(bmi >= 40){ weight_level = . } else if( bmi >= 35 ){ weight_level = . } else if( bmi >= 30 ){ weight_level = 3 } else if( bmi >= 25 ){ weight_level = 2 } else if( bmi >= 20 ){ weight_level = 1 } else { weight_level = . } '
DataSailr's main function, sail(), works by taking two arguments. It takes dataset and DataSailr script, and it processes each row following the DataSailr script.
library(datasailr) df = data.frame(us=c(1,1,1,1,1,0,0,0,0), weight=c(150,120,175,160,180,80,60,50,90), height=c(70,60,60,70,65,1.7,1.6,1.7,1.9)) sail(df , code)
Examples are shown in documents.
From my personal experience of data analysis in epidemiology field, I wanted to have a way to manipulate data in a row direction. R did not have this kind of functionality, and I started to develop DataSailr.
The DataSailr script is designed for data anaysis and statistics. People in these fields must feel natural. Compared with general purpose programming languages, DataSailr script's functionality is very limited, but this limitation results in great fit for data manipulation.
When DataSailr is available on CRAN, please use the following code to install.
# R interpreter install.packages("datasailr")
Another way is to download a binary package and install it.
# Download binary package appropriate for your environment. # Linux 64bit R CMD INSTALL datasailr_0.8.7_R_x86_64-pc-linux-gnu.tar.gz # Windows 64bit R CMD INSTALL datasailr_0.8.7.zip
Examples are shown in documents.
This package was accepted as a regualr presentation at UseR!2020 (which was originally planned to take place in St.Louis, and finally was held online). In this presentation, I introduced functionality of DataSailr and how to write DataSailr script. When the presentation was made, the DataSailr version was 0.8.5. More features and bug fixes have been added since then. Also, at that time DataSailr script was called just Sailr script, but it is now called DataSailr script.
Link to YouTube video (This link opens YouTube video.)Presentation materials can be obtained from UseR!2020 website. https://user2020.r-project.org/program/contributed/
This document introduces datasailr package, and shows potential benefits of using domain specific language for data processing.
A famous R package, dplyr, has been improving the same kind of points. It enables data manipulation without thinking much about column vectors. Pipe operator, %>% in magrittr package, and dplyr functions realize intuitive data manipulation flow. The DataSailr package enables the same kind of thing with a single DataSailr code. The two packages do not compete, and I intend to implement DataSailr as it also can work with dplyr.
DataSailr | dplyr | |
---|---|---|
How to manipulate data | Apply a single DataSailr code (datasailr::sail()) | Apply multiple functions using (%>%) |
Create new column | Assign value to new variable | mutate() |
Keep some columns | (Not for this purpose) | select() |
Keep some rows | discard!() drops rows | filter() |
Summarize columns | (Not for this purpose) | summarize() |
Sort rows | (Not for this purpose) | arrange() |
Regular expression | Built-in | Partially available with another R package |
Available functions | Only DataSailr built-in functions are available | Can call R functions |
Convert wide to long format | push!() function | (use reshape2 package instead) |
Convert long to wide format | (Not implemented yet) | (use reshape2 package instead) |
Please report issues or problems on Github.
When you need to cite this package, please use the following bibtex citation.
@Article{, title = {datasailr - An R Package for Row by Row Data Processing, Using DataSailr Script}, author = {Toshihiro Umehara}, year = {2021}, journal = {Journal of Open Source Software}, volume = {6}, number = {61}, pages = {3166}, doi = {10.21105/joss.03166}, url = {https://doi.org/10.21105/joss.03166}, }