Regex: Processing patterns in text

Many programming languages feature regular expressions—or “regex” for short—which are used to find patterns in strings of text. A regex library is a mini-language for describing patterns, which can be combined with utilities to extract and work with the patterns found in your text. This article introduces you to using regular expressions in your programs.

How regular expressions work

Some say a set of regular expressions comprises a domain-specific language, or DSL; essentially, a mini-programming language. A full-blown programming language like Java or Python can do many things, but regex does one thing only: match text against patterns.

An individual regular expression is expressed as a string of characters. It describes a template for a pattern of characters to search for, or match against, in a string.

Regular expressions can be difficult to read at a glance, as every character in a regex potentially has a special significance. This is why regex has a bad reputation for being a “write-once, read-never” language: the syntax is terse and cryptic at a glance. But with the right tools, you can easily develop your own regular expressions and make sense of those written by others.

Regex syntax

Here’s a simple example of a regular expression, which looks in a string for the sequence Hello world:


Hello world

If you’re just matching plain old letters and numbers and spaces, then all you need for the regular expression is the text you are matching against. The real power of regex, though, is that you can define conditions in the regex to capture patterns. For this, you will use certain reserved characters that have special meanings.

Capturing all characters

The simplest example of a character with special meaning is the dot (.). In regex, a dot means “any character.” So a regular expression that would match any three characters in a row would be ... (or .{3}, since a number in a {} means “match the last thing that many times”).

If you want to match an actual period, you’d use . The backslash before any character in a regex means “match whatever follows literally.”

Capturing quantities of characters

Another special character is the question mark (?). It has different meanings based on the context, but generally, this character is used to indicate the previous thing is an optional match. This type of character is called a quantifier, as it tells regex how many times to match something.

Recall that we use {} to specify an exact number of matches. Now, let’s look at the shorthand for finding one match, no matches, or as many matches as possible in your text.

Consider this regex: The End.? The ? indicates we want to capture either one period, or none. The plus symbol (+) and the asterisk (*) have similar meanings:

+ means “match the previous thing once or more times.”
* means “match the previous thing zero times, or any number of times.”

Here are some examples of possible variations in the regex and what each one would capture:

The End.+ would match The End., The End.., The End... and so on, but not The End.
The End.* would match The End, The End., The End.., and so on.
The End.? would match only The End and The End.

Classes of characters

If you wanted to match against one of a set of possible characters, you would use [] in a character class. For instance, if you wanted to match all possible vowels, you could use [AEIOUaeiou].

Note that a character class by default only matches one character in a position. If we used [AEIOUaeiou] on Skypeia, it would match only one vowel at a time in that string, not the three in a row at the end. For that, we’d want to use one of the above quantifiers—[AEIOUaeiou]{3}, for instance—to match three vowels in a row.

You can also use a negated character class, which means “capture everything except these characters.” A negated class starts with [^, so [^AEIOUaeiou] would mean “Capture everything that’s not a vowel.” This is a handy way to do things like capture whatever is delimited in quotes, for example, "[^"]*". It ignores every character that isn’t a quote and keeps going until it encounters one.

Capture groups

Data you capture with a regex doesn’t have to be all in a single lump. You can define parts of your regex that are meant to be broken out as their own captured elements. For this we use parentheses, (), to indicate capture groups.

For instance, if we say data:([0-9]+), that will look for the string data:, followed by one to any number of digits from 0 through 9. The digits, though, are saved into their own separate capture group, which can be accessed from the match object returned by your regex library.

Capture groups and logic

Capture groups can also be used to indicate logical regions of a regular expression. If we use (hey)+ in a regex, that will match any number of occurrences of hey in a row—hey, heyhey, heyheyhey—all as a single group in a match object.

We can also use this feature to capture one of a number of given things, by using the | character as an OR operator. The regex (hey|ho)+, for instance, will capture hey, ho, heyho, hoheyhoho, and so on.

Groups also can be marked to match, but not capture, by using (?:...) instead of (...). This is useful if you want to keep the number of capture groups down, and only capture one or two things from a larger, more complex match pattern.

Other special regex characters

Some special characters in regex are used to capture common types of characters, so you don’t have to reinvent character classes for them:

s|S: Any whitespace (or non-whitespace) character—spaces, tabs, line breaks, etc.
d|D: Any digit (or non-digit) character.
w|W: Any word (or non-word) character. A useful way to capture characters normally surrounded by whitespace on both sides.
b:B: Any word-boundary (or non-word-boundary) character. A useful way to capture characters found between words, such as whitespace and punctuation.
n: Newline or line break characters. (On Windows, line breaks are two characters, rn.)
^|$: Match the start (or end) of a given line or string.

Regex flags

When you execute a regular expression on a string, you can pass options, or “flags,” that modify how the expression executes. These often have major effects on a regular expression’s behavior—sometimes, a regex won’t work as you intend unless you use one of them.

Note that how these flags are set depends on the regex library in use. Also, these are only a few of the most common flags; the library you use may have many more.

Global: The regex should be applied to the entire string and not just stop at the first match. If you want to capture all the possible instances of a match in a string, you’ll need to enable this flag.
Multiline: When set, ^ and $ will match the beginning or ending of lines in a string, instead of the beginning or ending of the entire string. Use this flag if you’re looking for multiple matches on a pattern that has a line break as part of its structure.
Single line: This option allows the dot (.) to match newlines as well as other characters. This way, dot-captured text can span multiple line breaks if needed.
Case-insensitive: Matches are performed case-insensitively, so upper- and lowercase characters are considered the same. Useful if you have strings that haven’t been normalized to all-upper- or all-lowercase.

A simple regex example

Here’s a simple regex to capture URLs, which uses many of the details we’ve covered.


(https?)://([^/]+)/([^s]+)

Let’s go through the regex element by element:

The https? means “capture http, and optionally an s“. The parentheses place this in its own capture group.
The :// captures the colon, and then the two forward slashes. (Note that in some implementations of regex, you’d have to escape these slashes, too.)
([^/]+) captures everything up to the first single slash, which would be the domain name and optionally the port.
([^s]+) captures, in its own group, every character from that point forward that isn’t whitespace or a line break. Once the regex encounters a whitespace or line break, it stops. (Whitespace isn’t permitted in a valid URL.)

This gives us a capture with three groups in it: the protocol (http or https), the domain name, and the URL path. The resulting captures can then be processed further—either with other regular expressions or with other libraries for specific tasks, such as verifying whether or not a given domain exists.

The sample regex doesn’t try to cover all the possible permutations of a URL, just the most basic patterns. But regular expressions shouldn’t try to capture every possible variant of a pattern. They’re best when used to capture the most general version of a pattern, and for providing a convenient way to break that pattern into the parts you need the most.

Source link