Regular expression

In DataSailr (Sailr script), regular expressions can be written by encapsulating pattern with slashes (/pattern/). Sailr's regular expression engine is different from R's regular expression. Sailr's engine is Onigmo, which is used as Ruby's regular expression engine. Currently, the regular expression syntax follows Ruby's syntax which is PCRE (Perl compatible regular expression). This option may be changed in the future, but anyway I will continue to choose PCRE syntax.

In addition to how to write regular expression, there are three more important things

  1. How to match
    • Use =~ operator. =~ conducts regualr expression matching.
    • or =~ also works :)
  2. How to backreference (extract matched substrings)
    • Regular expressions have capture group functionality denoted by parentheses, whcih memorize matched substrings.
      • e.g. /(\w+)\s+(\w+)/ have two cature groups.
    • rexp_matched(num) extracts substrings from the last matching. (NOTE: Matching reuslts are not carried over to the next row processing.)
    • rexp_matched(1) means the 1st captured substring, rexp_matched(2) means second substring, and so on.
  3. Assigning regular expression to variable does not create new column.
    • In contrast to assigning numbers and strings to variables, assigning regular expression does not create new column.
    • Varialbes holding regular expressions can be used for matching. This behavior saves you from wrting the same regular expressions more than once.

Regular expressoin syntax

[abcd]  One character that is a,b,c or d
[^a-z]  ^ means negation. In this case, one character that is not small letter.
.   Any charater exept new line
\w  A word character ([a-zA-Z0-9_]
\W  A non-word character ([^a-zA-Z0-9_])
\s  A whitespace character: /[ \t\r\n\f\v]/
\S  A non-whitespace character: /[^ \t\r\n\f\v]/

and so on.

Currently, options for regular expression are not spported, such as multi-line options (/pattern/m).

See more in Ruby-lang's regular expression manual

Also use https://rubular.com/ to check whether regular expression works as you want.

Example: Categorize car data and assign company information using regular expression

data(mtcars)
code = '
germany = re/(^Merc|^Porsche|^Volvo)/
usa = re/(^Hornet|^Cadillac|^Lincoln|^Chrysler|^Dodge|^AMC|^Camaro|^Chevrolet|^Pontiac|^Ford)/
japan = re/(^Mazda|^Datsun|^Honda|^Toyota)/

if ( _rowname_ =~ germany ) { country = "Germany" ; type = rexp_matched(1); }
else if( _rowname_ =~ usa ) { country = "USA"  ; type = rexp_matched(1);  }
else if( _rowname_ =~ japan ) { country = "Japan"  ; type = rexp_matched(1); }
else { country = "Other" }
'
library(datasailr)
sail(mtcars, code, fullData = F)
##    country     type
## 1    Japan    Mazda
## 2    Japan    Mazda
## 3    Japan   Datsun
## 4      USA   Hornet
## 5      USA   Hornet
## 6    Other         
## 7    Other         
## 8  Germany     Merc
## 9  Germany     Merc
## 10 Germany     Merc
## 11 Germany     Merc
## 12 Germany     Merc
## 13 Germany     Merc
## 14 Germany     Merc
## 15     USA Cadillac
## 16     USA  Lincoln
## 17     USA Chrysler
## 18   Other         
## 19   Japan    Honda
## 20   Japan   Toyota
## 21   Japan   Toyota
## 22     USA    Dodge
## 23     USA      AMC
## 24     USA   Camaro
## 25     USA  Pontiac
## 26   Other         
## 27 Germany  Porsche
## 28   Other         
## 29     USA     Ford
## 30   Other         
## 31   Other         
## 32 Germany    Volvo