Learning About Regular Expressions

Learning About Regular Expressions

Most likely you have come across Regular Expressions at some point during your software development. Regular expressions are one of those things that tend to make people take sides and form strong opinions. Some swear by their use, while others have a deep disdain for them. Either way, they are a necessary evil – we need to learn about them in order to round out our skills so to speak. In this episode, we’ll take a quick look at some of the basics in getting up and running with regular expressions. We’ll also build a fun one page PHP application to test our own regular expressions with. Let’s jump right in.


Matching Literal Text

The easiest type of regular expression is by matching things with literal characters. Of course since it is the easiest, it is also the least powerful and useful. It does get us going however with how these things work, so let’s take a look. Here is our first regular expression:

/yaba daba/

There are a few things to note here. First, you’ll notice that at the beginning and end of the pattern are forward slashes. These are needed to signify the beginning and end of the pattern. When you have a pattern, you need a subject to match against. Let’s create an example sentence of, “I have no idea why I chose yaba daba in the regular expression pattern” Our regular expression will match the string “yaba daba” in the subject, and nothing else. The pattern /Yaba Daba/ would not match at all, due to case sensitivity.

Metacharacters

At this point its good to mention the special characters in regular expressions. Here is a table of them.

Metacharacter

Meaning

this is the escape character used for several things
^ refers to the start of the string
$ refers to the end of the string
. powerful match any character except newlines
[ begin a character class
] end a character class
| the pipe is for alternation – basically an ‘or’
( start a capture group
) end a capture group
? several uses – mostly with capture groups
* match zero or more
+ match 1 or more
{ begin a number range match
} end begin a number range match

Fantastic. We have a nice little overview of what all these characters do. Fear not, if they mean nothing right now, they will make more sense as we move on.

Since these characters above have special meaning, if you are trying to match them in a string, you must escape the character in your regular expression. Consider the string Do you know what 2 + 2 is equal to? {we will soon find out} [haha]. If we apply the regular expression /2 + 2/ it fails. If we try /2 + 2/ it now works. Again, this is because the plus sign is a special character so it must be escaped with the backslash character. What if we need to match that text in between the curly braces? You might think that /{[a-z ]*}/ but it does not, we must again escape the special characters like so /{[a-z ]*}/ and now it works.


Character Classes In Regular Expressions

This makes a nice segway into character classes, in fact that very last regular expression we used contained a character class. We can start by reviewing the prior example to see how it works. In fact, we’ll make use of an incredible online tool over at regexr. What’s really cool about regexr is that when you paste in your regular expression, you can hover over each character for it’s meaning which is really slick. Let’s examine the prior example.

/{[a-z ]*}/

Within this pattern is a character class, it is this piece of the pattern: [a-z ] By hovering over each character at regexr we can find the meaning. So in order, [ opens the character set, a-z says to match any lowercase character in the alphabet, then we have a space character, then the ] closes the character set. Here is what it looks like at regexr.

regular expression character class

Slick! The best thing to do is to simply paste in various subjects and patterns and have a play for yourself. If you need all the nitty gritty on character classes head right here.


Digits, Word Characters, and Whitespace Characters

There are some convenient shorthand character classes for digits, word characters, and whitespace characters. These are handled by d, w, and s respectively. Easily see these in action right here.

regular expression digit shorthand character class
regular expression word shorthand character class
regular expression space shorthand character class


Regular Expression Alternation

As ever, a simple concept is given a fancy term. As your mind explodes from the myriad of technical terms and acronyms from working in high tech, regular expression jargon continues to deliver this idea with alternation. What is alternation? Choose this or that. Done. Next lesson.

A bit tongue in cheek of course, but the idea is simple. It works much like the logic you might find in an if statement when programming where you say if this condition or that condition, or that condition, and so on. For example, continuing with the simple string of text we’ve been working with so far, we will apply a pattern that includes alternation in it. Here is the pattern /(to|we|ha)/ and it simply means to match to or we or ha in the subject. Regexr shows us this result.

regular expression alternation

Awesome! Prior to regexr one had to make use of something like RegEx Buddy, which while it is a great piece of software, it is not online, and it is not free. Thanks http://gskinner.com/ for making this free tool for all of us to use! Here is what the PHP manual has to say about alternation.


The Dot Character and the Asterisk Character

The de facto bazooka of regular expressions comes in the form of this simple combination. The dot followed by the asterisk, or .* Look at how unassuming that little combination of two characters is. Just a simple dot and asterisk. What this means however, is to match anything, any number of times. It can be used in certain situations, but most teachings will advise to use this combination only as a last resort. In fact, if we plug this pattern into regexr, it gives us an infinite warning.

regular expression dot and asterisk

Of course like a good hacker, this is the only pattern of regex I use when hacking in a playground environment, but seriously folks, careful with the big guns.


Greed is Good

Greed is good, or so said some nutjob on wall street many moons ago. Speaking of wall street, did you see that movie with that dude from the Titanic? Great flick, those guys were crazy. Anyhoo… In regular expression we have this concept of greediness. What it means when something is greedy in the regex sense, is that it will try the match as many times as it possibly can before it stops matching. If there is anything that can trip you up when working with regular expressions, it is definitely greediness vs laziness. When is a particular token greedy? When is it lazy? How does this affect the pattern I am trying to use? You’ll need to take all of these things into consideration when building your own patterns. Beyond rote memorization, which does have it’s merits, it is usually a matter of visiting a tool like regexr and simply testing out the various quantifiers to see what works. Let’s see an example of greediness in action. We’ll make use of this pattern /w/ which be reminded is a word character and is greedy. This is the result on our test string, note that it matches every single word character in the string.

regular expression greedy

Turn Down for What?

How can we turn off that greediness? Simply add the question mark character like so /w?/ and see the result.

regular expression lazy


Lookaheads and Lookbehinds

So far we have but only scratched the surface of regular expressions. If you’re new to them, you’re head is probably spinning. If you’re already familiar with them, this episode is nothing more than a refresher for you. In any event, we can now take a look at a regex favorite, the ability to look ahead or behind of the pattern match to determine the delimiters so to speak of what characters we will capture. These delimiters are just that, they are not included in the actual match. Let’s check out how these guys work.

Positive Lookahead

First up, we’ll take a look at the positive lookahead. This tells the pattern to look ahead, or look after the pattern for a specific pattern, and only match if that secondary pattern exists after the main pattern. It sounds strange, so lets just look at how this works. First, lets find any sequence of 3 word characters like so.

regular expression no look ahead

Ok, pretty neat. You can see we find any sequence of 3 characters in a row and it works pretty well. Now, lets change it up. Lets find three characters *only* if they are followed by a curly brace like this }. How can we do such a thing? We can do it with the positive lookahead just like this.

regular expression positive look ahead

You see that partner? Very cool – a match is found and the delimiter, or the specified character of the positive lookahead, is not contained in the match. This is immensely useful! Once again, (?=) is the syntax for a positive lookahead.

Positive Lookbehind

We can do the same thing for matching patterns in instances where we would like to look *before* the main pattern for a specific piece of text or a specific character. This is the positive lookbehind. We’ll modify our regex to match only 2 word characters in a row *only* if they are preceded by a curly brace like this {. This can be accomplished with the pattern of (?<={)[a-z]{2} which will match we in the string Do you know what 1 + 1 is equal to? {we will soon find out} [haha]. Note that (?<=) is the syntax for the positive lookbehind.

Negative Lookahead

The inverse of the positive lookahead and lookbehind are the negative lookahead and lookbehind. These are the exact opposite of the prior examples. Basically, only match the given pattern if it is *not* followed by or preceded by a given character or string. For example, if we want to find all groups of 3 characters *only* if they are not followed by a curly brace like this }, we can do that.

negative lookahead

Ah, yes. Look at that. Ain't she a thing of beauty? All of those nice three character combinations, but wait, look at that out right before the curly brace. It is not highlighted as a match. Yes that's right, that's due to the negative lookahead. Note that the syntax for the negative lookahead is (?!) where the thing that you do not want to match comes after the exclamation point.

Negative Lookbehind

Just like we have a negative lookahead, we have a negative lookbehind, and the syntax for this is (?<!) where the thing that you do not want to match comes after the exclamation point.

Additional Regular Expression Learning Resources

Conclusion

Regular expressions are dry stuff folks, much like the Mojave. In fact, the more outlandish the writing, the more boring the topic being covered. At about 1900 words, that's just about all I can muster for regular expressions today. I think we covered some good starting points for regular expressions in this episode. They may be a bit dry, but they are immensely powerful, and when you need them, they just might be the only way to solve a difficult string or character related problem. Tune in again when we build our very own regular expression application in our very next episode!