Advanced Wildcard Patterns Most People Don’t Know

Advanced Wildcard patterns globbing unix bash shell mac os

If you’re a macOS Terminal or Linux shell user you may be familiar with simple wildcard patterns. But most people don’t know the more advanced patterns, which occupy a kind of netherworld between simple wildcards and their more complex cousins, regular expressions. In this article you’ll learn about these patterns and their hidden power.

In my article Think You Understand Wildcards? Think Again, I explored the hidden complexity behind some of the apparently simple wildcard patterns many people are familiar with. It showed that a deeper understanding of how the shell interprets wildcards can help explain sometimes confusing behaviour. If you have not read the article I strongly suggest taking a look at it now to make sure you understand how wildcard expansion works before continuing.

This article takes that understanding a step further and examines some of the lesser-used wildcard — formally, globbing — patterns available in the Unix shell.

**The Simple Patterns: `*` and `?`**

You should already be familiar with the simple wildcard patterns ? and *. To recap, ? matches any single character, and * matches any number of characters (including zero). When the shell sees either of these characters unquoted and unescaped in a command line argument, it attempts to expand the argument by interpreting it as a path and matching the wildcard to all possible files in the path. The resulting set of file paths is then sent to the target command as a list of arguments.

Now let’s go beyond this and explore some of the more advanced patterns.

Match Character Set: `[...]`

Square brackets can contain a character set or range. This pattern matches a single character from the specified set. If you were to specify [aeiou] for instance then that would match any single vowel.

Here’s an example. Suppose we have these files in the current directory:-

$ ls
cat d0g dawg dg dig dog doug dug

Match a single character from the set [aeiou]:-

$ echo d[aeiou]g
dig dog dug

As well as sets of characters, you can also specify contiguous ranges using the form [start-end]. The range is inclusive, as shown below:-

$ echo d[a-o]g
dig dog

And say we had a directory containing some numbered files like this:-

$ ls
index.txt report.txt report1.txt report2.txt report3.txt report4.txt report5.txt

Since numbers are just characters, for single-digit ranges you can also do this:-

$ echo report[0-9].txt
report1.txt report2.txt report3.txt report4.txt report5.txt

You cannot match more (or less) than one character using the square bracket notation. To match files with two digits, you’d have to specify the range twice. This is one of the limitations of this pattern. Wildcards are nowhere near as powerful as their bigger cousins, regular expressions, since they are designed for matching similar-looking filenames, not general text pattern matching.

Match Inverse Character Set: `[^...]` or `[!...]`

Closely related to the previous pattern, and equally familiar if you’re fluent in regexs, is the inverse character set match pattern.

Suppose you wanted to see all the files that do not start with d? You can do it easily using the inverse character range construct [!...] or [^...] which matches any character except the ones in the set. For example, in the dogs directory:-

$ ls
cat d0g dawg dg dig dog doug dug

To find matches of the formd?g but with no vowel:-

$ echo d[^aeiou]g
d0g

The inverse form is otherwise syntactically identical to the normal range pattern. For example, in the reports directory you can exclude ranges like this:-

$ echo report[!1–3].txt
report4.txt report5.txt

Note that ^ and ! are semantically identical: they mean the same thing.

More practically, if you wanted to match all text files that don’t end with a number, this pattern does the job:-

$ echo *[^1-9].txt
index.txt report.txt

The power here comes from the fact that the * is a “greedy” operator and will match as many characters as it can to get a match.

Brace Expansion: `{...}`

Brace expansion can also be used in wildcard expansion, but it works in a fundamentally different way to the other patterns you’ve seen so far. It is probably the least-known of all the globbing patterns, and also the least understood.

On the face of it, it does look similar. Brace expansion — also called alternation — matches on any of a series of sub-patterns you specify. To take our vowel matching example from earlier:-

$ echo d{a,e,i,u,o}g
dag deg dig dug dog

At first glance it might seem like a long-winded way of using a character range, but look closely: two filenames we do not have in the directory, dag and deg, appear in the list. How is that?

To make things slightly clearer (hopefully), run the same pattern with ls instead of echo:-

$ ls d{a,e,i,u,o}g
ls: dag: No such file or directory
ls: deg: No such file or directory
dig dog dug

So ls has correctly reported no such file for dag and deg, but why did it even consider them? The answer is that brace expansion works differently to normal wildcards, in that the shell expands the braces before even looking for files: it actually generates all the permutations of the pattern you specify and then performs wildcard expansion on the results.

Here’s another pattern, that may at first seem fairly pointless:-

$ echo d{a..z}g
dag dbg dcg ddg deg dfg dgg dhg dig djg dkg dlg dmg dng dog dpg dqg drg dsg dtg dug dvg dwg dxg dyg dzg

When used inside the braces like this, the double-dot is a range operator just like the [a-z] we met earlier. But because it’s a brace expansion, all of its permutations are expanded by the shell. So the pattern {a..z} expands out to the complete lower-case alphabet, and does so regardless of whether it matches any files or not.

One useful thing this does allow you to do is specify more intelligent ranges, and use them in different ways. Going back to our reports directory from earlier, you saw there were potential problems creating wildcards for files with more than one digit in the name. Recall that the directory had the following files:-

$ ls
index.txt report.txt report1.txt report2.txt report3.txt report4.txt report5.txt

Say you needed to create some more report files, say the remaining number up to 20. Using brace expansion it’s easy to generate the text:-

$ echo report{6..20}.txt
report6.txt report7.txt report8.txt report9.txt report10.txt report11.txt report12.txt report13.txt report14.txt report15.txt report16.txt report17.txt report18.txt report19.txt report20.txt

Because brace expansion generates text to fit the pattern, if you send its output to another command, say touch, it will be treated as a list of arguments and touch will create the files named report6.txt to report20.txt in the directory:-

$ touch report{6..20}.txt

Now if you run ls you’ll see the files report6.txt to report20.txt have been created.

You can also filter using ls, and the numeric ranges you can use are much more flexible than the single digit character classes from earlier:-

$ ls report{8..12}.txt
report10.txt report11.txt report12.txt report8.txt report9.txt

This level of flexibility is not possible using simple wildcard globbing patterns.

And best of all, now you’re finished with the new report files, you can delete them just as easily:-

$ rm report{6..20}.txt
$ ls
index.txt report.txt report1.txt report2.txt report3.txt report4.txt report5.txt

You’d be wise to always run echo on a brace expansion before passing its result to a destructive command like rm, just as a dry run to check for silly mistakes!

Alternation: `{a,b}`

One of the common uses for brace expansion is searching for multiple options (hence its other name alternation):-

$ echo {cat,dog}
cat dog

Where things can get a little complicated is when normal wildcards are mixed with alternation. What happens when a * wildcard is used with a brace expansion?

$ echo {cat,d*}
cat dawg dg dig dog doug dug

You may ask how echo ended up showing only valid files if the brace expansion happens before looking for files? Why didn’t it show the infinite possibilities implied by d*?

The answer is that * is a regular wildcard pattern and so it was processed after the brace expansion, and therefore applied to the filenames in the current directory, before it got to the echo command. This is in contrast to the earlier examples which listed all permutations of the brace expansion.

To illustrate, what the above example shows is a brace expansion that produces two permutations:-

cat

and

d*

These are then processed by the shell, and the second pattern d* is handled as a regular wildcard, so it’s then matched against the files in the current working directory. The process is illustrated in Fig. 1.

Fig. 1: Combined brace expansion and asterisk wildcard expansion

If the wildcard expansion fails to match, the argument list would be:-

cat
d*

Again, this is because the brace expansion will always succeed regardless if whether the resulting wildcards are valid or not.

Practical Examples

A common use case for alternation is looking for different file types. Say you had the following files:-

$ ls
family1.jpg family2.jpeg family4.mov holiday1.png holiday3.m4v
family1.png family3.jpeg holiday1.jpg holiday2.jpg holiday4.mov

Notice there are a lot of different file extensions here. If you wanted to search for only the image files, you could use alternation as follows:-

$ ls {*.jpg,*.jpeg,*.png}
family1.jpg family2.jpeg holiday1.jpg holiday2.jpg
family1.png family3.jpeg holiday1.png

If you were trying to be clever, you could use the equally valid (but harder to understand) pattern:-

$ ls *.{j{p,pe}g,png}
family1.jpg family2.jpeg holiday1.jpg holiday2.jpg
family1.png family3.jpeg holiday1.png

This is an example of a nested alternation: the inner braces {p,pe} are evaluated first, and then the outer braces, and finally the wildcard match. Again, you can use echo to check your pattern and list the results of the brace expansion:-

$ echo {j{p,pe}g,png}
jpg jpeg png

You can use this kind of nested pattern to generate fairly complex patterns. In the example below, a relatively concise and readable pattern generates matches for a variety of media file extensions.

$ echo .{mp{3..4},m4{a,b,p,v}}
.mp3 .mp4 .m4a .m4b .m4p .m4v

Wildcards vs. Regular Expressions

Readers familiar with regular expressions can’t help but have noticed some marked similarities between the syntax of wildcard patterns and regular expressions.

Wildcards were originally called globbing patterns and their origin dates back to the earliest version of Unix in 1971. Back then the shell did not expand globbing patterns but rather a command called glob did the job. At that time only the * and ? patterns were recognised.

Most obvious is the simple * asterisk pattern, which has similar behaviour in both syntaxes. Formally, it is known as a Kleene star after the mathematician Stephen Kleene who came up with the concept back in the 1950s. In regular expressions, a* means “match a zero or more times”, but in a wildcard it means “match any character zero or more times”. It seems the wildcard meaning must have been inspired by the regular expression.

Likewise, the square bracket character set [abc...] and character range [a-z] notations are even more similar to their regex counterparts. Even the use of the caret ^ as the inversion character in the inverted character match [^...] pattern seems to be taken directly from regular expressions.

? has a totally different meaning between wildcard and regex syntax, and the { and } characters also mean something completely different.

Like passing similarities between some human languages, you know some of the syntax of wildcards and regular expressions must share a common origin but it’s more of a curiosity with little practical value.

Categories:LinuxmacOSShell scriptingUncategorizedUnix

Lee

A veteran programmer, evolved from the primordial soup of 1980s 8-bit game development. I started coding in hex because I couldn't afford an assembler. Later, my software helped drill the Channel Tunnel, and I worked on some of the earliest digital mobile phones. I was making mobile apps before they were a thing, and I still am.