Advanced Wildcard Patterns Most People Don’t Know
If you’re a macOS Terminal or Linux shell user you may be familiar with simple wildcard patterns. But most people don’t know the more advanced patterns, which occupy a kind of netherworld between simple wildcards and their more complex cousins, regular expressions. In this article you’ll learn about these patterns and their hidden power.
In my article Think You Understand Wildcards? Think Again, I explored the hidden complexity behind some of the apparently simple wildcard patterns many people are familiar with. It showed that a deeper understanding of how the shell interprets wildcards can help explain sometimes confusing behaviour. If you have not read the article I strongly suggest taking a look at it now to make sure you understand how wildcard expansion works before continuing.
This article takes that understanding a step further and examines some of the lesser-used wildcard — formally, globbing — patterns available in the Unix shell.
The Simple Patterns: *
and ?
You should already be familiar with the simple wildcard patterns ?
and *
. To recap, ?
matches any single character, and *
matches any number of characters (including zero). When the shell sees either of these characters unquoted and unescaped in a command line argument, it attempts to expand the argument by interpreting it as a path and matching the wildcard to all possible files in the path. The resulting set of file paths is then sent to the target command as a list of arguments.
Now let’s go beyond this and explore some of the more advanced patterns.
Match Character Set: [...]
Square brackets can contain a character set or range. This pattern matches a single character from the specified set. If you were to specify [aeiou]
for instance then that would match any single vowel.
Here’s an example. Suppose we have these files in the current directory:-
$ ls cat d0g dawg dg dig dog doug dug
Match a single character from the set [aeiou]
:-
$ echo d[aeiou]g dig dog dug
As well as sets of characters, you can also specify contiguous ranges using the form [start-end]
. The range is inclusive, as shown below:-
$ echo d[a-o]g dig dog
And say we had a directory containing some numbered files like this:-
$ ls index.txt report.txt report1.txt report2.txt report3.txt report4.txt report5.txt
Since numbers are just characters, for single-digit ranges you can also do this:-
$ echo report[0-9].txt report1.txt report2.txt report3.txt report4.txt report5.txt
You cannot match more (or less) than one character using the square bracket notation. To match files with two digits, you’d have to specify the range twice. This is one of the limitations of this pattern. Wildcards are nowhere near as powerful as their bigger cousins, regular expressions, since they are designed for matching similar-looking filenames, not general text pattern matching.
Match Inverse Character Set: [^...]
or [!...]
Closely related to the previous pattern, and equally familiar if you’re fluent in regexs, is the inverse character set match pattern.
Suppose you wanted to see all the files that do not start with d
? You can do it easily using the inverse character range construct [!...]
or [^...]
which matches any character except the ones in the set. For example, in the dogs directory:-
$ ls cat d0g dawg dg dig dog doug dug
To find matches of the formd?g
but with no vowel:-
$ echo d[^aeiou]g d0g
The inverse form is otherwise syntactically identical to the normal range pattern. For example, in the reports directory you can exclude ranges like this:-
$ echo report[!1–3].txt report4.txt report5.txt
Note that ^
and !
are semantically identical: they mean the same thing.
More practically, if you wanted to match all text files that don’t end with a number, this pattern does the job:-
$ echo *[^1-9].txt index.txt report.txt
The power here comes from the fact that the *
is a “greedy” operator and will match as many characters as it can to get a match.
Brace Expansion: {...}
Brace expansion can also be used in wildcard expansion, but it works in a fundamentally different way to the other patterns you’ve seen so far. It is probably the least-known of all the globbing patterns, and also the least understood.
On the face of it, it does look similar. Brace expansion — also called alternation — matches on any of a series of sub-patterns you specify. To take our vowel matching example from earlier:-
$ echo d{a,e,i,u,o}g dag deg dig dug dog
At first glance it might seem like a long-winded way of using a character range, but look closely: two filenames we do not have in the directory, dag
and deg
, appear in the list. How is that?
To make things slightly clearer (hopefully), run the same pattern with ls
instead of echo
:-
$ ls d{a,e,i,u,o}g ls: dag: No such file or directory ls: deg: No such file or directory dig dog dug
So ls
has correctly reported no such file for dag
and deg
, but why did it even consider them? The answer is that brace expansion works differently to normal wildcards, in that the shell expands the braces before even looking for files: it actually generates all the permutations of the pattern you specify and then performs wildcard expansion on the results.
Here’s another pattern, that may at first seem fairly pointless:-
$ echo d{a..z}g dag dbg dcg ddg deg dfg dgg dhg dig djg dkg dlg dmg dng dog dpg dqg drg dsg dtg dug dvg dwg dxg dyg dzg
When used inside the braces like this, the double-dot is a range operator just like the [a-z]
we met earlier. But because it’s a brace expansion, all of its permutations are expanded by the shell. So the pattern {a..z}
expands out to the complete lower-case alphabet, and does so regardless of whether it matches any files or not.
One useful thing this does allow you to do is specify more intelligent ranges, and use them in different ways. Going back to our reports directory from earlier, you saw there were potential problems creating wildcards for files with more than one digit in the name. Recall that the directory had the following files:-
$ ls index.txt report.txt report1.txt report2.txt report3.txt report4.txt report5.txt
Say you needed to create some more report files, say the remaining number up to 20. Using brace expansion it’s easy to generate the text:-
$ echo report{6..20}.txt report6.txt report7.txt report8.txt report9.txt report10.txt report11.txt report12.txt report13.txt report14.txt report15.txt report16.txt report17.txt report18.txt report19.txt report20.txt
Because brace expansion generates text to fit the pattern, if you send its output to another command, say touch
, it will be treated as a list of arguments and touch
will create the files named report6.txt
to report20.txt
in the directory:-
$ touch report{6..20}.txt
Now if you run ls
you’ll see the files report6.txt
to report20.txt
have been created.
You can also filter using ls
, and the numeric ranges you can use are much more flexible than the single digit character classes from earlier:-
$ ls report{8..12}.txt report10.txt report11.txt report12.txt report8.txt report9.txt
This level of flexibility is not possible using simple wildcard globbing patterns.
And best of all, now you’re finished with the new report files, you can delete them just as easily:-
$ rm report{6..20}.txt $ ls index.txt report.txt report1.txt report2.txt report3.txt report4.txt report5.txt
You’d be wise to always run echo
on a brace expansion before passing its result to a destructive command like rm
, just as a dry run to check for silly mistakes!
Alternation: {a,b}
One of the common uses for brace expansion is searching for multiple options (hence its other name alternation):-
$ echo {cat,dog} cat dog
Where things can get a little complicated is when normal wildcards are mixed with alternation. What happens when a *
wildcard is used with a brace expansion?
$ echo {cat,d*} cat dawg dg dig dog doug dug
You may ask how echo ended up showing only valid files if the brace expansion happens before looking for files? Why didn’t it show the infinite possibilities implied by d*
?
The answer is that *
is a regular wildcard pattern and so it was processed after the brace expansion, and therefore applied to the filenames in the current directory, before it got to the echo
command. This is in contrast to the earlier examples which listed all permutations of the brace expansion.
To illustrate, what the above example shows is a brace expansion that produces two permutations:-
cat
and
d*
These are then processed by the shell, and the second pattern d*
is handled as a regular wildcard, so it’s then matched against the files in the current working directory. The process is illustrated in Fig. 1.
Fig. 1: Combined brace expansion and asterisk wildcard expansion
If the wildcard expansion fails to match, the argument list would be:-
cat d*
Again, this is because the brace expansion will always succeed regardless if whether the resulting wildcards are valid or not.
Practical Examples
A common use case for alternation is looking for different file types. Say you had the following files:-
$ ls family1.jpg family2.jpeg family4.mov holiday1.png holiday3.m4v family1.png family3.jpeg holiday1.jpg holiday2.jpg holiday4.mov
Notice there are a lot of different file extensions here. If you wanted to search for only the image files, you could use alternation as follows:-
$ ls {*.jpg,*.jpeg,*.png} family1.jpg family2.jpeg holiday1.jpg holiday2.jpg family1.png family3.jpeg holiday1.png
If you were trying to be clever, you could use the equally valid (but harder to understand) pattern:-
$ ls *.{j{p,pe}g,png} family1.jpg family2.jpeg holiday1.jpg holiday2.jpg family1.png family3.jpeg holiday1.png
This is an example of a nested alternation: the inner braces {p,pe}
are evaluated first, and then the outer braces, and finally the wildcard match. Again, you can use echo to check your pattern and list the results of the brace expansion:-
$ echo {j{p,pe}g,png} jpg jpeg png
You can use this kind of nested pattern to generate fairly complex patterns. In the example below, a relatively concise and readable pattern generates matches for a variety of media file extensions.
$ echo .{mp{3..4},m4{a,b,p,v}} .mp3 .mp4 .m4a .m4b .m4p .m4v
Wildcards vs. Regular Expressions
Readers familiar with regular expressions can’t help but have noticed some marked similarities between the syntax of wildcard patterns and regular expressions.
Wildcards were originally called globbing patterns and their origin dates back to the earliest version of Unix in 1971. Back then the shell did not expand globbing patterns but rather a command called glob
did the job. At that time only the *
and ?
patterns were recognised.
Most obvious is the simple *
asterisk pattern, which has similar behaviour in both syntaxes. Formally, it is known as a Kleene star after the mathematician Stephen Kleene who came up with the concept back in the 1950s. In regular expressions, a*
means “match a
zero or more times”, but in a wildcard it means “match any character zero or more times”. It seems the wildcard meaning must have been inspired by the regular expression.
Likewise, the square bracket character set [abc...]
and character range [a-z]
notations are even more similar to their regex counterparts. Even the use of the caret ^
as the inversion character in the inverted character match [^...]
pattern seems to be taken directly from regular expressions.
?
has a totally different meaning between wildcard and regex syntax, and the {
and }
characters also mean something completely different.
Like passing similarities between some human languages, you know some of the syntax of wildcards and regular expressions must share a common origin but it’s more of a curiosity with little practical value.