Stop Being Afraid Of Regular Expressions: 9 Essential Patterns You Should Learn

Last modified date

Comments: 0

Learn Regular Expressions 9 essential patterns
Learn the Essential Patterns of Regular Expressions – photo courtesy Annie Spratt, Unsplash

Regular expressions have a somewhat daunting reputation. Arguably one of the most useful inventions in computing history, they are nonetheless regarded with equal measures of awe and dread even by some experienced programmers.

Why is this? Regular expressions are actually straightforward to understand and use. But their reputation has not been helped by people copying large and complex patterns from websites such as Stackoverflow without understanding the basics first. It’s true that they can be terse and obfuscated, and they lend themselves to the kind of “code golf” some people like to show off with, but that’s simply a by-product of their concise syntax.

In this article I’ll concentrate on the essentials you need to get up and running with regular expressions. It’s a big topic and it can be hard to know where to start. There are a lot of resources to help, but it’s easy to dive too deep too quickly.

One great thing about regular expressions is that you can you can start to use them immediately, trying out new patterns as you go.The best way to learn is by doing, so I encourage you to do this, using a text editor that supports regular expression searches.

Why Should You Learn Regular Expressions?

So why should you learn regular expressions? Here are just a few reasons:-

  • You’ll be able to use command-line tools like grep and sed to their full power
  • You’ll be able to quickly clean up and reformat arbitrary data files
  • Your ability to search and replace in text editors will be vastly improved: you’ll be able to find generic syntax elements and rearrange code layout with ease.
  • Unlike learning a programming language, regular expressions transcend languages, so they’re a long-term investment
  • Most modern programming languages have regular expression libraries, so you’ll be able to greatly improve your string-parsing code whatever language you use

Regular Expression, regex or regexp?

Let’s get this out of the way right now.

Regular expression is a bit of a mouthful and you’ll often hear it shortened to regex, or sometimes regexp, which to my ears, especially when pluralised, is actually harder to say than the original.

You may occasionally hear purists arguing that a regex is not the same as a regular expression, because regular expressions are a pure concept based on the work of mathematician Stephen Kleene in the 1950s, whereas regexes are an impure syntactical construct influenced (or is it polluted?) by Larry Wall’s Perl programming language.

Frankly, those purists can stay in their ivory towers. Whatever the right and wrong of the argument, I’m calling them regexes.

Walk Faster Than You Can Run

Like learning any new skill, with regular expressions you really have to start with the basics. I like to think of them as akin to musical notation: a child learning to sight read music would never start with a piece by Rachmaninov; they would start with individual notes and move on to something like Twinkle, Twinkle Little Star.

But unlike musical notation, the great thing about regexes is that with even the basics you can perform some powerful tasks that would be very, very hard to do in any other way. Countless times, I have met programmers writing reams of fragile string-parsing code to do a job that a few characters of regular expression can achieve.

So with regexes, even though you must learn to walk before you can run, your walking speed will soon leave the non-regex sprinters standing!

Regular Expression Concepts

Before diving in, there’s a couple of concepts to get familiar with.

Metacharacters

Regexes are like wildcards on steroids. They use special characters (know as metacharacters) to represent not only sets of characters but also positions (called anchors) in the search text.

There are fourteen such metacharacters. I’ll lay them out here for you straight up. This is so you don’t use them in a pattern before you’re familiar with their special meaning:-

\ | ( ) [ ] { } ^ $ * + ? .

In reality, they won’t all cause you problems all the time, because some of them only taken on special meanings in certain contexts within a regex pattern. But if you see strange behaviour and you’re using any of these characters without knowing what their special meaning is, that should be a warning sign.

Escaping a Metacharacter’s Special Meaning

If you need to ignore a character’s special meaning, there’s no big secret: the backslash character \ is called an escape character and when you put it in front of another metacharacter it removes its special meaning. So to search for a literal question mark, use \? in the pattern. And to search for a backslash, use \\.

What Even Are Regular Expressions? How Do I Use One?

Regexes suffer from being a slightly abstract concept. The regex engine is a piece of code that can be embedded in many other programs.

The most common places you’ll find regex support are:-

  • inside programming languages
  • in text editor search and replace functions
  • in some command-line tools such as grep and sed

In some contexts you’ll see regular expression patterns delimited with slashes, for instance /pattern/, m/pattern/ or s/pattern/replacement/. This is because some environments need a way of delimiting a regex pattern. Perl in particular uses this syntax, and some other programming languages require delimiters, but not all. The command-line utility sed uses the s/pattern/replacement/ substitution syntax. grep on the other hand requires no delimiters.

What many people don’t realise is that the slash delimiters have nothing to do with the regex pattern. In Perl, for example, the delimiters don’t even need to be slashes (you can use just about any other character instead). They’re just a way of delimiting the regex from the Perl language syntax; many other languages use quotes instead.

In this article I’ll leave out the delimiters since I consider them an implementation detail. Instead I’ll concentrate on the regex pattern itself.

Learning Regexes With a Text Editor

A good way to start is to learn to use regular expressions in a textual search context. Many common text editors such as Sublime, vim, emacs, and most programmer IDEs support a regex search mode (sometimes called “grep” mode).

When you use a regex in a text editor search, the regex pattern is compared against the target text and it may or may not match against the text. There are two kinds of match: partial and entire (or complete).

A partial match means the regex pattern matches against part of the target text. This is equivalent to the search pattern a matching in the text cat.  An entire match is what it sounds like: the pattern must match against the entire text, like the search pattern cat entirely matches the text cat.

The great thing about learning with a text editor is that it shows you these matches by highlighting the matched text. This is a convenience you lose when you start using regexes in other environments such as programming languages.

The Essential Basic Regex Patterns

OK, time to dive in. If you only learn nine regular expression patterns, learn these…

0. Matching Literal Characters

The simplest regular expression matches a single character. For instance, the regex:-

a

entirely matches the string a. But it also partially matches the strings aa, abc, sam, mamma mia, and infinitely more. The regex a on its own just means “is the character “a” in the text?” It doesn’t care where or how many times, it’s simply looking for the character, and if it’s in the search text then the regex is said to match against the text. Often the word “partial” is implicit, so a regex match can often be assumed to be a partial match.

Of course, any combination of characters can be used in a regex. For example:-

dog

will match against the text dog, doggy, it’s a dog’s life, and so on. The regex engine will search in the text for occurrences of the pattern, and if one is found then the result is a match.

A regex is always looking for the simplest possible match. The trouble arises when what you think is simple does not align with what the regex engine thinks is simple.

OK, so I’ve numbered this pattern as zero because a literal match is technically regex pattern but it doesn’t really feel like one. Let’s get into the interesting stuff…

1. Match Any Character, the Dot (or Period) .

The first regular expression special character is . (the dot or period) which matches any character, alphanumeric or symbolic, including whitespace. For example:-

.

on its own will match a single character, and

..

would match two characters. The characters it matches do not have to be the same. They can be aa , !z, or a space and a tab.

That’s not very interesting, so something more useful would be:-

d.g

matches real words dog, dig, dug, as well as nonsense text such as d0g, d!g, etc. Because the regex matches anywhere in the text, it will also partially match longer strings that include the pattern, such as: doggy, adage, and the doge that dug.

It would not match dawg, doug, or dg, because . only matches a single character between d and g.

2. Optional Matching: the Question Mark ?

Sometimes you want to match a character optionally.

For example, if you were searching for word and you wanted to count plurals, you could match both singular and plural forms of the word dog like this:-

dogs?

That might look slightly confusing at first, but it’s really very simple.

The ? metacharacter is called a quantifier, and it’s associated only with the entity immediately preceding it, so in this case the s. It means match the preceding entity zero or one times. So the pattern will match either dog (where the s matches zero times), or dogs (where the s matches once).

Later on you’ll learn that the real power of the ? quantifier is in matching more than single characters.

3. Matching One Or More Times: the Plus +

Matching a single character is of limited use, and so there is more than one way to match multiple characters. The most common is by using the special quantifier character +.

The + character means “match one or more times.” The following:-

a+

matches aa, aaa, and so on, as well as aapl, Maastricht and an infinity of other strings with sequences of at least one “a”.

Again, this can be combined with literal characters. As with the ? you learned above, the quantifier only refers to the preceding character, so:-

Ab+a

looks for a capital A, followed by one or more bs, followed by a lowercase a. So it can match Swedish pop groups as well as Addis Ababa.

The + quantifier is associated with its preceding character (although in more advanced patterns, things get a little more complicated).

You can have as many quantifiers in a pattern as you like, so you can look for ice cream using:-

ha+gen da+sz

This is for illustration purposes of course. The only benefit this regex offers over using the literal string “haagen dasz” is that it will also match for people who don’t know how to spell it correctly, so it will match hagen daasz, haagen daasz, haaagen dasz, and so forth.

If the example above seemed somewhat artificial, it’s because the real power of + only becomes evident when it’s used with other metacharacters. One such character is one you’ve already met: the dot. When combined with the + modifier, the dot begins to become very useful indeed.

It’s important to understand that when using the + modifier with ., the characters are heterogeneous, that is, they don’t all have to be the same. So:-

.+

matches one or more characters: that is, any text at all containing at least one character.

To be useful, you’d generally want to use this pattern in combination with literal characters, e.g.:-

d.+g

The above pattern would match dig, dog, dawg and doug but not dg because it needs at least one character between the d and g to match.

4. Matching Zero Or More Times: the Star *

Closely related to + is the * character, which has a subtle difference in meaning: it means “match zero or more times”. So:-

.*

will match zero or more characters, or any text including an empty string.

The subtle difference between + and * is important because the regex:-

z*

means match zero or more occurrences of the character z in the search text, which seems fairly innocuous until you realise that there are always zero or more of any character in any text!

So just like the unconstrained .*, the regex above will match anything at all, from an empty string to War and Peace!

This is another area where regex novices can potentially come unstuck. Because a regex pattern is so eager to match against the search text, and * allows matching against nothing at all, it’s all too easy to match strings you didn’t want to if you over-use the * character.

Again, when constrained by using it in conjunction with literal characters, the * pattern comes into its own:-

d.*g

would match dig, dog, dawg, doug and dg.

5. Minimal Matching: One More Question ?

Once of the difficulties regular expression novices encounter is that of “greedy” matching. The regex engine will always, aggressively try to match every part (or “atom”) of the pattern to as much of the text as it can.

This can cause people to think that regexes are too hard or just don’t work. Take the following pattern:-

d.+g

This will match dog of course, but what about the text diggy dog? Will it match the dig part or the dog part? The answer is neither: it will match the entire string.

The reason is that .* will greedily match every character between the initial d and the final g, including the space. This promiscuous matching behaviour can be confusing when you’re not expecting it. The way humans read, we tend to find the shortest patterns first, whereas the regex engine sees the longest.

Luckily there’s a way to calm it down, and this is one of the most important techniques you can learn in regexes, which will immediately set you apart from those who have just dabbled in regexes a little.

The ? character has a secondary meaning when combined with another quantifier, and the common combinations:-

*?

and

+?

are known as minimal matching. Going back to the example, if you were to use the following minimal matching pattern:-

d.+?g

then the patten would partially match twice in the text diggy dog. The first match would be on dig and the second on dog. The +? quantifier means match at least one, but as few characters as possible.

You can also add the ? to the * quantifier to perform a minimal match zero or more times.

Think about what this means for a moment. It might sound like a contradiction, because the minimum number of times you can match something zero or more times is surely zero? This is the kind of brain-aching stuff that has earned regexes their fearsome reputation, but it makes sense when you think about it from the regex engine’s perspective.

The raw * quantifier is a greedy quantifier: despite its ability to successfully match zero times, its instinct is to match as much as possible, so it will match entirely against all the following strings:-

dg

daaaaaaawg

doggety dog

The first string matches because * can match zero or more characters. The last string is matched entirely because it is, in effect, no different to the second string.

When you add the ? to *, all you’re saying is “don’t be greedy.” You’re not saying “don’t match anything.” So the pattern:-

d.*?g

will match in all the three examples above, but in the last example it will match twice, both times partially on the substring dog, because that’s the minimum valid match it can make.

6. Character Classes: the Square Brackets [ ]

Matching literal or wildcard characters is quite powerful, but regexes give you the ability to do far more. Once you begin to understand character classes, you’ll find your ability to get creative with regexes increases dramatically.

Going back to the dig, dog, dug example, we can match only strings with the pattern:-

d[iou]g

Here the square brackets [ and ] are metacharacters used to define a character class. The pattern [iou] matches any single character i, o or u.

You can also specify character ranges using the hyphen metacharacter -:-

[a-z]

will match any lowercase letter, and

[A-Z]

will match any uppercase letter. Combing the two sequentially, and using a quantifier from earlier enables you to match any name or proper noun like this:-

[A-Z][a-z]*

This pattern looks for an uppercase letter followed by zero or more lowercase letters, which will match most proper nouns and names.

You may have noticed that the * quantifier seems to be behaving differently here than it did in the simpler examples above. You learnt that the quantifier characters act on the immediately preceding character or entity. But in this case the quantifier is preceded by a metacharacter, ], and the regex engine is smart enough to realise that the quantifier must refer to the entire character class, not the metacharacter itself, so it applies the * to the entire […] character class. In this sense, character classes behave like single characters in a pattern.

Ranges are based on Unicode character values, so be careful and don’t try to be too clever. For instance, if you wanted to match all alphabetic characters you might try:-

[A-z]+

but that’s incorrect because it would not only match dig, Dog, and dUG but also d_g which doesn’t look like what you intended. It’s because the _ character appears between the uppercase and lowercase alphabetic characters in the ASCII/Unicode encoding.

That’s not exactly obvious and you might spend some time trying to work out what went wrong. Worse still, your regex may well appear to work perfectly well on some test data, only to fail catastrophically later on when you run real-world input through it.

Luckily, there’s a safer and more intuitive way to specify the pattern:-

[A-Za-z]+

Although not as succinct or “clever”, this pattern is more readable and – importantly – correct.

You can specify multiple ranges and individual characters in the class. For example, to match hex numbers that may include a leading 0x or 0X, you could use this pattern:-

[xX0-9a-fA-F]+

That might look a little complicated but the rule to remember is this: the hyphen only applies to the single characters immediately to its left and right. So this pattern is looking for characters matching one of: x, X, 0-9, a-f or A-F.

Space characters are valid inside character classes. For instance:-

[ ]+

is a common way to match any number of spaces.

7. Negated Character Classes: the Caret ^

You can also use character classes to perform an inverse match, in other words to match text that does not contain some specified characters.

For this you must begin the character class with the caret metacharacter ^, which you can read as “not”. It must appear immediately after the opening bracket like this:-

[^…]

In all other ways, negated character classes behave exactly like normal character classes.

For example, to match any character that is not the letter a, use:-

[^a]

You can also use ranges, so to match a string containing no digits, you can use:-

[^0-9]+

Another common use of a negated character class is the pattern to match any number of non-space characters:-

[^ ]+

8. From Beginning to End: ^ and $

The next metacharacters are so-called anchors, because they anchor the match to a particular location in the text.

For instance, the ^ character means the beginning of a line. If you wanted to match the letter a at the start of a line you’d use the expression:-

^a

This would match any string that started with the letter a.

It would also match if the letter a appeared in the middle of a string but after a newline, but newlines are quite an advanced topic in regexes so that’s beyond the scope of this discussion. For the purposes of this lesson imagine we’re only dealing with single line strings.

It’s fairly common when using regular expressions to need to match something at the start of a line. A common example is whitespace. To match a varying amount of whitespace at the start of a line, use the following pattern:-

^[ \t]*

Note that the \t sequence means the tab character. This allows the character class to match both spaces and tabs.

Similarly, the $ character matches the end. The pattern:-

z$

would match a letter z at the end of the string (or line, in a multiline string).

Combining the beginning and end anchors is a very common way to force an entire regex match and reject partial matches:-

^dog$

will only match against the string dog and not against doggy or dog's life.

9. Alternating and Capturing: ( ) and |

So far you’ve learned a lot about matching characters. Before wrapping up, I want to show you that the regex engine understands more than just characters, and can match substrings (you can think of them as words, even though they’re not).

As a simple example, you can use alternation to match one of a choice of patterns like this:-

cat|dog|horse

will match either cat, dog or horse. You can have as many choices as you like. This is often one of the first regular expression patterns people learn, since the pipe delimiter looks like the logical-OR in many programming languages, and it’s a simple concept.

However, alternation is not as straightforward as it might seem. When you try to combine anchors or quantifiers with alternation patterns, things get more complicated, fast. Imagine you wanted to match the above cat, dog or horse alternation but only at the start of a line. You might try:-

^cat|dog|horse

But this would only restrict the matching of cat to the start of a line: it would happily match dog or horse anywhere in the text. That’s because the ^ anchor metacharacter associates more strongly with the surrounding text than the | character does, so the above pattern matches cat (at the start of the string), or dog (anywhere) or horse (anywhere).

One way to fix this is by repeating the caret:-

^cat|^dog|^horse

It’s valid as a one-off solution, but ultimately that way madness lies.

You need to be able to define the scope of the alternation. The way to do it is with parentheses:-

^(cat|dog|horse)

This has the advantage of being easy to read. The only disadvantage is that the parentheses now form a regex capture group and if you didn’t want to capture anything you’d need additional syntax to avoid it. However, that’s not a problem you need to care about just yet.

Beyond the Basics

Regular expressions are a big topic, and an entire language that have a lot more to offer than these basic constructs, but what you’ve learnt so far will cover the vast majority of things you’ll need to do, and what’s more it will empower you to do things you would find hard to achieve without them.

Languages such as Perl have regular expressions integrated deep within them, to the point that Perl regex patterns can actually include code inside the regex, which can make for truly mind-bending complexity!

Yes, going beyond the basics will improve your kudos as a power user, but remember that is likely to yield diminishing returns because the more obscure features are rarely used and the same results can often be done in simpler ways. In most cases it is better to use several simpler regexes than one complex one. The kind of “code golf” that drives people to solve a problem using the most cryptic regex possible is good for showing off but of limited practical use.

If you want to learn more about regular expressions, there are a couple of good books out there. I learned pretty much everything I know from a single (very long) chapter in Larry Wall’s book Programming Perl. Jeff Friedl’s book Mastering Regular Expressions is also very worthwhile. These are both affiliate links. And for very detailed information, especially concerning the differences between different regex flavours, I recommend visiting www.regular-expressions.info.

Good luck with your journey into regular expressions. Please join my mailing list for updates, and contact me or comment below if you have any questions or feedback on this article. Thanks for reading!

Lee

A veteran programmer, evolved from the primordial soup of 1980s 8-bit game development. I started coding in hex because I couldn't afford an assembler. Later, my software helped drill the Channel Tunnel, and I worked on some of the earliest digital mobile phones. I was making mobile apps before they were a thing, and I still am.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment