by Emily St.

Regular expressions are short pieces of text (often I’ll call a single piece of text a “string,” interchangeably) which describe patterns in text. These patterns can be used to identify parts of a larger text which conform to them. When this happens, the identified part is said to match the pattern. In this way, unknown text can be scanned for patterns, ranging from very simple (a letter or number) to quite complex (URLs, e-mail addresses, phone numbers, and so on).

The patterns shine in situations where you’re not precisely sure what you’re looking for or where to find it. For this reason, regular expressions are a feature common to many technical programs which focus on using lots of text. Most programming languages also incorporate them as a feature.

One common application of regular expressions is to move through a body of text to the first part which matches a pattern—in other words, to find something. It’s then possible to build on this search capability then to replace a pattern automatically. Another use is to validate text, determining whether it conforms to a pattern and acting accordingly. Finally, you (or your program) may only care about text which matches a pattern, and all other text is irrelevant noise. With regular expressions, you can cull a large text down to something easier to use, more meaningful, or suitable for further manipulation.

A Simple First Regular Expression

A regular expression, like I said, is itself a short piece of text. Often, it’s written in a special way to set it apart as a regular expression as opposed to normal text, usually by surrounding it with slashes. Whenever I write a regular expression in this post, I will also surround it with slashes on both sides. For example, /a/ is a valid regular expression which matches the string a. That particular expression could be used to find the first occurrence of the letter a in a longer string of text, such as, Where is the cat?. If the pattern /a/ were applied against that sentence, it would match the a in the middle of cat.

There’s a clear benefit to using regular expressions to do pattern matching in text. They let you ask for what you want rather than specifying how to find it. To be technical, we’d say that regular expressions are a kind of declarative syntax; contrast that with an imperative method of asking for the same thing. In this case, to do this in an imperative way, you’d have to write instructions to loop through each letter in the text, comparing it to the letter a. In the case of regular expressions, the how isn’t our problem. We’re left simply stating the pattern and letting the computer figure it out.

Regular expressions are rather rigid and will only do what you say, sometimes with surprising results. For example, /a/ only matches a single occurrence of a, never A, nor à, and will only match the first one. If it were applied to the phrase “At the car wash”, it would match against the first a in car. It would skip over the A at the beginning, and it would stop looking before even seeing the word wash.

As rigid as regular expressions are, they have an elaborate syntax which can describe vast varieties of patterns. It’s possible to create patterns which can look for entire words, multiple occurrences of words, words which only happen in certain places, optional words, and so on. It’s a question of learning the syntax.

While I intend to touch on the various features which allow flexible and useful patterns, I won’t exhaust all the options here, and I recommend consulting a syntax reference once the idea feels solid. (Before getting into some of the common features of regular expression syntax, it’s important to note that regular expressions vary from implementation to implementation. The idea has been around a long time and has been incorporated into countless programs, each in slightly different ways, and there have been multiple attempts to standardize them. Despite the confusion, though, there is a lot of middle ground. I’m going to try to stay firmly on this middle ground.)

Metacharacters

Let’s elaborate a bit on our first pattern. Suppose we’re not sure what we’re looking for, only that we know it begins with a c and ends with a t. Let’s think about what kinds of words we might want to match, so we can talk intelligently about what patterns exist in those words. We know that /a/ matches cat. What if we want to match cut instead? We could just use /u/, but we know this also matches unrelated strings, like bun or ambiguous.

Now, /cat/ is a perfectly reasonable pattern, and so is /cut/, but we’d probably have an easier go if we create a single pattern that says we expect the letter c, some other letter we don’t care about, and then the letter t. Regular expressions let us use metacharacters to describe the kinds of letters, numbers, or other symbols we might expect to find without naming them directly. (“Character” is a useful word to encompass letters, numbers, spaces, punctuation, and other symbols—anything that makes up part of a string—so “metacharacter” is a character describing other characters.) In this case, we’ll use a .—a simple dot. In regular expression patterns, a dot metacharacter matches any individual character whatsoever. Our regular expression now looks like /c.t/ and matches cat, cut, and cot, among other things.

In fact, we might describe metacharacters as being any character which does not have its literal meaning, and so regular expressions may contain either characters and metacharacters. Occasionally, it can be confusing to know which is which. Mostly, it will be necessary to consult a reference for regular expressions which best suits your situation. Sometimes, even more confusingly, we want to use a metacharacter as a character, or vice versa. In that situation, we need to escape the character.

Escaping

We can see in the above example that a dot has a special meaning in a regular expression. Sometimes, though, we might wish to describe a literal dot in a pattern. For this reason, we need a way to describe literal characters which don’t carry their ordinary meaning, as well as employ ordinary characters for new meanings. In a regular expression pattern (as in many other programming languages), a backslash (\\) does this job. Specifically, it means that the character directly after it should not be interpreted as usual.

Most often, it can be used to define a pattern containing a special character as an ordinary one. In this context, the backslash is said to be an escape character, which lets us write a character while escaping its usual meaning.

For example, suppose we cared about situations where a sentence ends in the letter t. The easiest pattern to describe that situation might be the letter, followed by a period and a space, but we can’t type a literal dot for that period, or else we’d match words like to. Therefore, our pattern must escape the dot. The pattern we want is written as /t\. /.

Quantifiers

Metacharacters may do more than stand in for another kind of character. They may modify the meaning of characters after it (as we’ve already seen with the escape metacharacter) or those before it. They may also stand in for more abstract concepts, such as word boundaries.

Let’s first consider a new situation, using a metacharacter to modify the preceding character. Think back to earlier, when we said we know we want something that begins with a c and ends with a t. Using the pattern /c.t/, we already know that we can match words like cut and cat.

We need a few more special metacharacters, though, before our expression meets our requirements. /c.t/ won’t match, for example, carrot, but it will match concatenate and subcutaneous.

First of all, we need to be able to describe a pattern that basically leaves the number of characters in the middle flexible. Quantifiers allow us to describe how many occurrences of the preceding character we may match. We can say if we expect zero or more, one or more, or even a very particular count of a character or larger expression.

Such patterns become far more versatile in practice. Take, for example, the quantifier +. It lets us specify that the character just before it may occur one or more times, but it doesn’t name an upper limit.

Remember the pattern we wrote to match sentences ending in t? What if we wanted to make sure we matched all the spaces which may come after the sentence? Some writers like to space twice between sentences, after all. In that case, our pattern could look like /t\. +/. This pattern describes a situation in which the letter t is followed by a literal dot and then any number of spaces.

Quantifiers may also modify metacharacters, which make them truly powerful and very useful. Using the + again, let’s insert it into our /c.t/ pattern to modify the dot metacharacter, giving us /c.+t/. Now we can match “carrot”! In fact, this pattern matches a c followed by any number of any character at all, as long as a t occurs sometime later on.

There are a few other quantifiers needed to cover all the bases. The following three quantifiers cover the vast majority of circumstances, in which you’re not particularly sure what number of characters you intend to match:

  • * matches zero or more times
  • + matches one or more times
  • ? matches exactly once or zero times

On the other hand, you may have a better idea about the minimum or maximum number of times you need to match, and the following expressions can be used as quantifiers as well.

  • {n} matches exactly n times
  • {n,} matches at least n or more times
  • {n,m} matches at least n but not more than m times

Anchors

We still have “concatenate” and “subcutaneous” to deal with, though. /c.+t/ matches those because it doesn’t care about what comes before or after the match. One strategy we can use is to anchor the beginning or end of the pattern to stipulate we want the text to begin or end there. This is a case where a metacharacter matches a more abstract concept.

Anchors, in this case, let us match the concept of the beginning or the end of a string. (Anchors really refer to the beginning and ends of lines, most of the time, but it comes to the same thing in this case. See a reference guide for more information on this point.) The ^ anchor, which may only begin a pattern, matches the beginning of a string. Likewise, a $ at the end means the text being matched must end there. Using both of these, our pattern becomes /^c.+t$/.

To break this pattern down, we’re matching a string which begins with a c, followed by some indeterminate number of characters, and finally ends with a t. As ^ and $ represent the very beginning and end of the string, we know that we won’t match any string containing anything at all on the line other than the pattern.

Character Classes

Using anchors, though, may not be the best solution. It assumes the string we’re searching within may only contain the pattern we’re looking for, and so often, this is not the case.

The dot is a very powerful metacharacter. Its biggest flaw is that it is too flexible. For example, /^c.+t$/ would match a string such as cat butt. Patterns try to match as much as possible. Some regular expression implementations allow you to specify a non-greedy pattern (which I won’t cover here—see a reference), but a better approach is to revisit our requirements and reword them slightly to be more explicit.

We want to match a single word (some combination of letters, unbroken by anything that’s not a letter) which begins with c and ends with t. Considering this in terms of the kinds of characters which may come before, during, and after the match, we want to match something which contains not-alphabetical characters before it, followed by the letter c, then some other alphabetical letters, then the letter t, and then something else that’s not alphabetical.

In the /^c.+t$/ pattern, we need to replace both of the anchors and the middle metacharacter .. Assuming words come surrounded by spaces, we can replace each anchor with just a space. Our pattern now looks like / c.+t /.

Now, as for the dot, we can use a character class instead. Character classes begin and end with a bracket. Anything between is treated as a list of possibilities for the character it may match. For example, /[abc]/ matches a single character which may be either a, b, or c. Ranges are also acceptable. /[0-9]/ matches any single-digit number.

We can use a range which captures the whole alphabet, and luckily, a character class is considered a single character in the context of a pattern, so the quantifier after refers to any character in the class. Putting all this together, we end up with the pattern / c[a-z]+t /.

If we want to mix up upper- and lower-case letters, character classes help in this situation, too: / [Cc][a-z]+t /. Now we can match on names like Curt.

Our assumption that words will be surrounded by spaces is a fragile one. It falls apart if the word we want to match is at the very beginning or end, or if it’s surrounded by quotation marks or other punctuation. Luckily, character classes may also list what they do not include by beginning the list with a ^. When ^ comes within brackets, instead of at the beginning of a pattern, instead of serving as an anchor, it inverts the meaning of the character class.

If we consider a word to be a grouping of alphabetical characters, then anything that’s around the word would be anything that’s not alphabetical. Let’s adjust our pattern accordingly: /[^A-Za-z0-9][Cc][a-z]+t[^A-Za-z0-9]/. We’re using the same pattern as before, but the beginning and ending space have become [^A-Za-z0-9].

Escape Sequences

If our pattern is starting to look cumbersome and odd to you, you’re not alone in thinking that. There’s absolutely nothing wrong with the pattern we just wrote, but it has gotten a bit long-winded. This makes it difficult to read, write, and later update.

In fact, many character classes get used so often (and can otherwise be so annoying to write repeatedly) that they’re usually also available as backslashed sequences, such as \b or \w. (This is escaping, again, as I mentioned before, but instead of escaping a special character’s meaning, we’re escaping these letters’ literal meaning. In other words, we’re imbuing them with a new meaning.)

The availability and specific meaning of these escape sequences vary a bit from situation to situation, so it’s important to consult a reference. That said, in our case, we only need a couple which tend to be very common to find.

One of the very most common such escape sequences is the \w which stands in for any “word” character. For our purposes, it matches any alphanumeric character. This is good enough for the inside of a word, so we can revisit our pattern and turn it into /[^\w][Cc]\w+t[^\w]/. Our pattern reads a little more logically now: We’re searching for one not-word character (like punctuation or whitespace) followed by an upper- or lower-case c, some indefinite count of word characters, the letter t, and then finally one not-word character.

Notice how I used the escape sequence inside the character classes at the beginning and end of the word. This is perfectly valid and sometimes desirable. For example, it would allow us to combine escape sequences for which there’s no single suitable one.

It also lets us invert their meaning, as you saw in the most recent example, but many escape sequences can be modified in the same way by capitalizing them, such as \W. As a mnemonic to remember this trick, think of it as shifting the escape sequence (using shift to type it). In cases where a character class may be inverted in meaning, often a capitalized counterpart exists.

Using \W, now we can pare down the pattern back to something a little more readable: /\W[Cc]\w+t\W/.

More Reading

For today, I’m satisfied with our pattern. In a string like I would like some carrot cake., it matches carrot with no trouble, but it doesn’t match subcutaneous tissue.

There are many more ways to improve it, though. We’ve only laid the groundwork for understanding more of the advanced concepts of regular expressions, many of which could help us make our expression even more powerful and readable, such as pattern qualifiers and zero-width assertions.

Concepts like grouping allow you to break up and manipulate matches in fine-grained ways. Backtracking and extended patterns allow patterns to make decisions based on what they’ve already seen or will see. Some programmers have even written entire programs based on regular expressions, only using patterns!

In short, regular expressions are a deep and powerful topic that very few programmers completely master every corner of. Don’t be afraid to keep a reference close at hand—hopefully it will now empower you instead of daunt you, now that you have a grasp of how to get started composing patterns.

Emily St.: “I’m a backend engineer for Simple and am interested in how things work.”