Hi readers,
Sorry for the delay in posting, I’ve been hard at work getting my talk ready for the ESA annual meeting in Sacramento next week. I’ll definitely post a recap and link to my slides after the conference.
I’m on a workacation on Lake of Bays in Ontario (my home province) at the moment, enjoying the lake-side life on my breaks.
I’m continuing work on my Build Your Bar project, dabbling in natural language processing. I find myself using regular expressions (regex) frequently enough to think it warrants a post. The R help on Regex ?regex
is long and technical and makes no attempt to convince you of it’s importance. So here is a quick primer on using regex in R, the focus will be on perl-like regular expressions.
Regular Expressions have been a part of computing since the early days of home computing. When the primary input-output interface to a computer was text, an efficient means of finding the exact text you were interested in was necessary. “Simple” shorthands for complicated series of characters were developed. The idea for regular expressions originated in the work of Stephen Kleene(1956) on regular sets. Just over a decade later, Ken Thompson imported the ideas into the Unix operating system via the text editing software QED. For a more detailed history of regex I suggest reading Staffan Noteberg’s brief review.
R supports two, or imprecisely three, regex schemes. The two true regex schemes are Extended which follows the POSIX 1003.2 standard and Perl-Like my personal favourite that adapts the incredibly powerful regex syntax of the perl programming language. The last scheme is Fixed which is less of a regex style than a lack thereof, fixed mode will take characters absolutely literally, except for R’s own escape characters.
By default R uses extended regex, the unfortunate consequence is if you choose to follow my lead and use perl-style regex, you have to be a bit more verbose in your code and specify perl = TRUE
each time you use a regex function1. I however feel it’s a small price to pay for getting to use perl-like regex.
The help file for regex offers the advice to head to pcre web page, but you will likely want to go straight to the source: http://perldoc.perl.org/perlre.html. The perl documentation page on the current state of their regex syntax is a handy reference for anyone who plans on using perl-like regex.
Unfortunately for us R users, many of perl’s really handy regex features aren’t available to us. Native perl regex allow for concurrent search and accession, meaning you can search for multiple tokens within a string and access the returned tokens without storing the matches in an intermediary object first. More simply, in perl, searching and editing a string is essentially the same action. In R, searching and accession are handled by completely seperate functions, and regular expressions aren’t tokenized to allow multiple operations on the same search criterion.
These differences are due to the fact that perl was designed for text manipulation, it is self-described as the “swiss army knife” of scripting languages allowing you to functionality from multiple languages with perl as the middle-man. The lingua franca of all programming languages is the source code, which perl can elegantly cut, trim, and glue together. R was built for statistics, and so string manipulation is more of an acquired skill. However, let it never be said you can’t do powerful string manipulation in R, it’s just not true. String manipulation in R is clunky, maybe, but powerful enough to do anything you need.
I love string manipulation because it appeals to my love of puzzle solving. Finding just the right regex to get what you’re looking for really gives me the thrill of the hunt.
Hopefully your interest is piqued and you now want to learn all about using regex like a wizard. If so, great, because I’m going to show you how! But first, a key R idiosyncracy that needs to be addressed.
R by default recognizes backslashes “\” as the beginning of an escape sequence, also called a metacharacter. Perl also uses a backslash to indicate a metacharacter so there is a clash. In native perl “\d” is a metacharacter meaning all digits, but when R sees “\d” it thinks you’re invoking the metacharacter in an inappropriate way. To get the behaviour you’d expect, you need to escape the behaviour of the backslash to have it pass the un-escaped regex. You do this by adding an extra backslash “\\d” behaves exactly as you’d expect “\d” to.
This situation gets particularly comical when you want to use a literal backslash in a regular expression. In native perl “\\” gets you a literal backslash, but in R you need to escape BOTH backslashes to get the desired behaviour “\\\\” gets you a literal backslash (as I am typing this in R markdown which by default follows R escaping rules, I just typed 8 backslashes to show you four, ugh).
The first things you’re going to need to learn to become a Perl-Like regex master is to use wild-cards and quantifiers. You may have run into wild-cards in any number of different places, google used to support them for example. Wild-cards are regex speak for “just give me anything”.
Unless you’ve worked with regex before you’ve probably never worked closely with quantifiers. Quantifiers are regex speak for “give me some number of these”. If you’ve seen the wild-cards “*” and “?” from DOS, “?” is a wild-card meaning any one character, “*” is a wild-card with an indefinite quantifier meaning give me any number of any character.
Character | Meaning |
---|---|
. | Match any 1 character |
* | Match ZERO or more of the previous character |
+ | Match ONE or more of the previous character |
? | Match ZERO or ONE of the previous character |
{N, M} | Match atleast N but fewer than M of the previous character Leaving M empty ({N,}) means N or more |
With these tools at your disposal you can begin your oddessey into the world of pattern matching. Let’s see some examples
Regex | Sample Matches |
---|---|
"Al.*" | "Alfred", "Albert", "Alphonso", "Alex likes going to the movies" |
"5? ?Apples" | "5 Apples", "5Apples", " Apples", "Apples" |
"(Na )+Batman!" | "Na Na Na Na Na Na Na Na Batman!" but not "Batman!" |
As you may have noticed, the special quantifiers (“*“,”+“, and”?“) are all just convenient shorthands for (”{0,}“,”{1,}“, and”{0,1}“), you’ll find you use the special quantifiers more frequently.
I’ve hinted at this next group of behaviours in the previous set of examples. The “(Na )+” I used to match the batman theme song is an example of grouping. Using parantheses groups a series of characters such that they can be quantified like a single character. The above regex states that I want one or more sets of the sequence “Na”.
Suppose you weren’t satisfied that you found all the ways people transcribe the batman theme, for example you have a friend use the phoneme “Da” to represent the notes in the song. Never fear, bracketed character classes to the rescue. You can allow the regular expression to choose one of several characters. Bracketed character classes are denoted with square braces “[]”.
Lastly, maybe the vowel is all wrong too. Maybe you think people might be using phonemes like “Duh” or “Nah”. You can specify these specific combinations within a group, and let the regex match any group with an alternation. Alternations are specified with “|”.
Regex | Sample Matches |
---|---|
"(Na )+Batman!" | "Na Na Na Na Na Na Na Na Batman!" |
"([ND]a )+Batman!" | "Na Na Na Na Na Na Na Na Batman!" and "Da Na Na Na Da Na Na Na Batman!" |
"[ND]((a)|(uh)|(ah))" | "Duh", "Na", "Dah" |
Now you should be able to get just about any set of characters, and in any amount as you could possibly want.
Another useful tool is the ability to specificy where in a string you’d like to match. Two handy positioning modifiers are allowed in a regular experession. You can specify the beginning of a string “^” and the end of a string “$”
Regex | Sample Matches |
---|---|
"^Hi" | "Hi there reader" but not "You there reader, Hi" |
"Goodbye$" | "Dear reader, Goodbye" but not "Goodbye dear reader" |
Quantifier modification is a topic I’ve touched on in a previous post where I demonstrated how to process html to extract tag values. The question mark has a double meaning depending on where it is found in a regular expression. Normally it is a treated as a quantifier meaning {0,1}, however when it appears immediately after another quantifier, it instructs that quantifier to behave non-greedily.
Greedy is a reference to a quantifier’s habit of trying to capture as many characters as possible. If given a choice between matching one or seven characters, it will choose seven. We can modify this behaviour to just take one.
Take for example the body of a very simple html document
<body>
<p> Generic things on the <b>Internet<b> have grown <b> tiresome</b></p>
</body>
If I was curious about all the things the author thought worthy of bolding, I might first try the regex
"<b>.+</b>"
But I would get one match <b>Internet<b> have grown <b> tiresome</b>
My indefinite quantifier would match all the way to the very last </b> in the document, obviously not what I wanted. So instead I should tell the “+” quantifier to not be so greedy "<b>.+?</b>"
Then it would correctly match "<b>Internet</b>" and "<b>Tiresome</b>".
The last type of metacharacter I’d like to show you are the special character classes, useful in bracketed character classes and on there own or combined with quantifiers. Some useful character class metacharacters are
Character | Meaning |
---|---|
\w | any word character - all numbers, letters, and underscores |
\W | non-word character - anything except the above |
\d | any single digit number |
\s | any space character (regular space or tab) |
And with that you should be able to match just about any pattern in a string that your heart desires, here are a few interesting examples.
Stringr comes pre-loaded with a method to trim whitespace from a string, with regex we can do that by saying match some number of whitespaces at the beginning or end of the string like so:
gsub(pattern = \"(^ +)|( +$)\", replacement = "", string)
2
Or as I mentioned earlier, maybe you want to find brand names in a set of ingredients, the ingredient set was comma delimited, and brands of note were tagged with ® the html code for the tm symbol. Brand names also have each word capitalized and tend to be between 1-4 words and may include hyphens. To pull out each brand in the ingredient list you could do this:
matchPull(pattern = "([A-Z][\\w-']*? ){0,3}[A-Z][\\w-']+?®")
3
The one thing you may not recognize is the bracketed character class "[A-Z]" which means exactly what it looks, all the capitals from A through Z.
I hope from this post you’ve learned a little bit about the application of regular expressions in R and can see some opportunities to use them in your own work. The ability to precisely and effectively specify exactly the string you’re interested makes performing any text mining tast orders of magnitude easier. Also if you’re just interested in organizing and curating data sets for thesis work for example the ability to accurately edit information stored in strings is a massive advantage over doing manual edits in software such as excel.
As always, thanks for reading!
-Chris
I recently discovered the wonderful package stringr another of Hadley Wickham’s packages. The syntax for specifying perl-like mode with stringr functions is different, you just wrap your regex string in a call to the perl
function like so: stringr_foo("string", pattern = perl("pattern"))
. Stringr is a dependency of several of Wickham’s other packages so you likely already have it installed.↩
The function gsub is R’s equivalent of find and replace all↩
The matchPull function is my convenience function for finding and extracting a matched text from a string. Similar in idea to str_extract from stringr, but I wrote matchPull before I knew about stringr, and I’m too lazy to learn someone else’s convenience function when I have my own. Plus as a hold over from my days coding in java I think underscores in function names are yucky. Code for matchPull is available in my post here and also comes in my friendlyShiny package.↩