A PowerShell users' guide to regular expressions: part 1 (of 3).
It’s not surprising that people can be afraid of Regular Expressions. Even
experts can be a little intimidated by reading through a script and meeting something like
^\{?[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}\}?$
So here are three pieces of good news.
- Monsters like that one make up less than 1% of all regular expressions you will meet;
"this"
is also a valid regular expression and a more typical one. - It only takes few minutes reading to start doing useful work with regular expressions.
- It won’t take much longer to understand that monster. Let’s start at the beginning.
What IS a Regular Expression?
Regular expressions describe patterns in text. If we can describe text of
interest, we can ask “Does that text - or part of it - match this pattern” - possibly with the
follow up of “how does it match?” and we can say “Do something where the text
matches”.
.NET has a [regex]
class and, built on that, PowerShell has a Select-String
command (like Unix’s popular grep
)
and operators -Match
, -Replace
and -Split
(with explicitly case sensitive
and insensitive versions prefixed with c
and i
respectively).
Can’t I just use wildcards?
Wildcards work with simple patterns – for example “list
files with names containing ‘temp’.” and work best when the input already
has some constraints - as file names do. There is little
difference between using PowerShell’s regex-based -match
operator like this:
$name -match 'temp'
and using the wildcard-based -like
operator:
$name -like '*temp*'
When I said simple patterns, I meant things like “temp
with any other stuff”; wildcards
can’t apply much qualification to the other stuff. PowerShell uses the [WildcardPattern]
.NET class which supports more metacharacters than ?
for “any single character” and *
for “any string of characters”. Linux users can use
ls [wxyz]*
to list files that begin with w, x, y or z, and this works in PowerShell too (but not in the the old cmd.exe shell ). These sets allow
$name -like '*temp[1-9]
for “temp
followed by a single digit”, but there is no way to write “temp
with 2,3 or 4 digits”.
The use of []
means that if you test something like:
'[MyText]' -like '[*]'
it will return false
. The -like
operator needs to see a backtick character `
to tell
it to escape [ , ] , *, ? and ` itself, and there is a trap:
"This is not a question!" -like '*`?'
Returns false
. The text does not end with the ‘?’ character. But
"This is not a question!" -like "*`?"
Returns true
.
In single quotes `?
is just those two characters, but inside double quotes
backtick becomes PowerShell’s escape character, for example `n
means “new
line”, and PowerShell sees "`?"
as “escaped-?” which is simply “?” and the backtick never
reaches the -like
operator. To get the backtick character itself inside double quotes, it must be doubled up.
Changing the quote type affects the meaning in more cases as we will see later.
Regular expressions do more than allow a change from “temp
anything” to “temp
followed by digts” for example:
- “Recognize a GUID as 32 hex digits in 5 dash-separated groups of variable length, with the whole thing optionally wrapped in {}”
- “split this text at semi-colons, removing leading/trailing spaces”
- “keep the
*
part of any filenames which matchtemp*
and replacetemp
part with this” - “Is this proposed password at least 8 characters long with 3 from upper case, lower case, digit and symbol”
At least as early as the 1950s people working on formal language theory were creating notations to write things like “a semi colon, with or without spaces” or “3 groups of 4 hex digits joined with a dash”. The current syntax crystalized in the 1980s, and although there are some implementation-specific extras, the core functionality is used in many places: from search boxes in editors to instructing telephone systems how to route calls. Over the years regexes have been shared to solve many common problems: when I needed to recognize GUIDs in a recent script I didn’t devise my own regex, I found the ‘monster’ that appeared at the start of this post.
Moving to regex, starting with the beginning, and the end.
Wildcard patterns try to match the whole text, in PowerShell:
'my temp file' -like 'temp'
returns false
. To get -like
to return true
it needs '*temp*'
to specify “Anything
at the start, then ‘temp’, then anything at the end”.
By contrast, regular expressions, look for a match anywhere so
'my temp file' -match 'temp'
returns true
.
In both of these cases temp
means “the letters ‘t’, ‘e’, ‘m’, and ‘p’, in that order,
once each, with nothing between them”, and in both wildcard and regex, […]
defines a set so [temp]
would mean “a letter ‘t’ or ‘e’ or ‘m’ or
‘p’”.
[…] are the only metacharacters with the same meaning in both worlds. Regex
has more metacharacters and has different meanings for *
and ?
. Backtick
isn’t a metacharacter in regex: \
is used for escape. The table below shows that where wildcards need a *
regular expressions don’t
need anything, and conversely when regex uses “^
” for “start” and “$
” for
“end”, wildcard assumes them.
Wildcard version | Regex version | |
---|---|---|
“Temp” at the start | Temp* |
^Temp |
“Temp” at the end | *Temp |
Temp$ |
“Temp” anywhere | *Temp* |
Temp |
“Temp” exactly | Temp |
^Temp$ |
“Te” anything “mp” | Te*mp |
Te.*mp |
“Temp” anything “.txt” | Temp*.txt |
Temp.*\.txt |
Specifying repetition – the different usage of * and ?
Where Wildcards use “?” for “one of any character”, regular expressions use
“.
” To allow regular expression engines to work on large amounts of text
efficiently they process line by line, and the new-line character is not
included in “any character” by default, so strictly speaking, “.
” is “any character
except new-line”. An expression can be told expressly to look for new-line or
to treat it like other characters, as we will see in part 2.
In regex *
and ?
describe repetition.
‘temp’ specifies one of each letter - we could write t{1,1}e{1,1}m{1,1}p{1,1}
for “t at least once and at most once” and so on, {} describes the number of
occurrences and instead of writing {0,}
for “At least zero, no upper limit”
*
is a shortcut for “any number of”.
So we can convert wildcard to regex by changing begin / end and replacing ?
with .
and replacing *
with .*
. The next table shows other repetition shortcuts:
Explicit form | Concise writing | |
---|---|---|
‘t’ once | t{1,1} |
t |
‘t’ optionally - zero or once | t{0,1} |
t? |
‘t’ repeating at least once | t{1,} |
t+ |
‘t’ repeating even zero times | t{0,} |
t* |
‘t’ repeating at least twice | t{2,} |
|
‘t’ exactly four times | t{4} |
|
‘t’ between 2 and 4 times | t{2,4 } |
Greedy and Lazy
So, +
, *
and ?
specify how many of the previous item – in regex temp*
means ‘t’,’e’,’m’ and maybe no instances of ‘p’, or one instance or a hundred.
Regexes use “.
” for “any character” and “.*
” for “any character(s),
any number of times - even zero” or “.+
” for “any character(s) at least once”.
If what matched matters, then “any number of times” is imprecise. ThePowerShell operation
'This matches' -match 'h.*s'
returns true
but there are three possible ways to satisfy the match: his
, hes
and his matches
.
PowerShell’s -match and -notmatch operators put what matched into the automatic
variable $matches
. Looking at the variable shows the .*
in the last example selects
the text from the first ‘h’ to the last ‘s’ - known as greedy operation, taking minimal
text is known as lazy operation. Most regex engines are greedy by default and ?
specifies lazy behaviour.
In the table above, t?
meant “take a ‘t’, if necessary to make a match” and .*?
means
“only grow the sequence as far as necessary”; using that instead will leave the first
match that can be found in $matches
- this isn’t necessarily the shortest. Sometimes we need to
use the [regex]
class and tell the engine to work from right to left or to look in
the remaining text for more matches. For example
([regex]"h.*?s").Matches("this matches").value
returns “his” and “hes”. Characters can’t be re-used once they have been consumed
in a match, so this example won’t return “his matches” after returning “his”.
Vague definitions can lead to bugs; With filenames, using a wildcard-style temp*
where
a regex-style ^temp.*
was needed will still return the right files, but it may include
some wrong ones. And “.txt” is usually a file extension, an unescaped “.
”, in '.txt'
will match any character before txt
but we might only see extra files when we meet a
filename like “My txt processor.ps1”. The same can happen when we use -replace
and
a greedy .* selects too much text; often debugging is a process of finding out why an
expression includes or excludes things we did not intend.
Specifying classes of characters
“\” is the escape character in regex and an escaped metacharacter – like
\.
matches the character itself. As with double quoted strings in PowerShell,
an escaped “t” is tab, an escaped “n” is new-line and so on, but regex goes a
stage further: \d
is “any digit” \s
is “any space” \w
is “any word
character” (letters, numbers and underscore). Case is critical: uppercase
reverses the meaning so \D
means any non-digit and so on. The same ease of
reversing also applies to sets; [AEIOU]
defines vowels in a set, ^
reverses the
meaning so [^AEIOU]
means any non-vowel including non-letters; we can specify a
set as a range, or “range-excluding”, so [A-Z]
is the 26 letters and [A-Z-[AEIOU]]
is only the consonants.
The handling of unescaped letters in .NET’s [Regex]
class is case-sensitive by default,
but the PowerShell’s operators request insensitivity unless you use their “c” variants; so [A-Z]
may include or exclude lower case letters in different situations.
Sets and classes allow precise statements of which characters should match:
te?mp\d{3,4}\.txt
specifies a match which starts with either “temp” or
“tmp”, (e?
is an optional ‘e’) then has digit (\d
) repeated 3 or 4 times and ends “.txt”. The following table shows common match terms.
Regex syntax | Meaning |
---|---|
a-z 0-9 < > # & " ' ` ! % ; , : @ ~ |
Normal characters |
\\ \. \$ \^ \{ \} \[ \] \( \) \* \? \+ |
Escaped-metacharacters |
\d |
Digits [0-9] |
\D |
Any non-digits (spaces, punctation, letters) |
\w |
Word characters: letters, digits and underscore (see note below) |
\W |
Non word characters, spaces, punctuation and “special” characters |
\b |
The break between sequences of word and non-word characters |
\a |
Start of the text - usually the same as ^ (more in part 2) |
\z |
End of the text - usually the same as $ (more in part 2) |
\1 \k |
Reference back to part of the match (more in part 2) |
\s |
Space characters including tab, carriage-return and new-line |
\S |
Non-spaces |
\p{name} |
A named group defined in unicode -see note below |
\P{name} |
Everything except the unicode named group |
\n \r \t \v \f \e |
Special characters: new-line, return, tab, vertical tab, form-feed and escape |
\x00 |
Characters by their their hex code - x0d is return, x20 is space |
[{}.*()?$] |
A set of characters; note only \ , ^ and ] need to be escaped inside [] |
[^\^] |
The reverse set – for example everything except ^ |
[x-z] |
A range |
[a-cijp-z-[x]] |
Multiple ranges – with exclusions and individual characters |
\a
and \z
are not the first and last characters, but the point before the first or after the last, and
\b
(for “break”) also doesn’t match characters but the place between at least one
non-word character and at least one word character: \btmp
matches “myfile.tmp”
or “my tmp file” but not “my_tmp_file”. Operations like “insert a comma before
the last 3 digits” extend this idea of a zero-length place in the text as we’ll see in part 2.
Finer grain groupings
The table above includes \p{...}
for unicode groups: an identifier appears inside the {}
it can be a named block or a short code. The codes are an upper case letter with an optional additional letter,for example \p{P} is all punctuation and \P{Pd} is just dashes: the codes are summarized as follows.
- C=Control characters. Cc=control, Cf=format, Cs=surrogate, Co=PrivateUse, Cn=NotAssigned
- L=Letter characters. Lu=Uppercase , Ll=lowercase, Lt=titleCase, Lm=Modifier, Lo=Other
- M=Diacritic marks. Mn=Nonspacing, Mc=spacingCombining, Me=enclosing
- N=Numbers. Nd=DecimalDigit, Nl=letter, No=other
- P=Punctuation. Pc=Connector, Pd=Dash, Ps=start, Pe=end, Pi=initialQuote, Pf=finalquote, Po=other
- S=Symbols. Sm=Math, Sc=Currency, Sk=modifier, So=other
- Z=Separator characters. Zs=space, Zl=line, Zp=paragraph
The named blocks mean a test like "Ω" -match "\p{IsGreek}"
will return true
.
Internally the word / non-word, digit / non-digit and space / non-space classes are defined as combinations of the p{...}
groups. “Letters” includes more than [A-Za-z]
- taking in accented letters and characters from non-Roman scripts. Word characters are not simply 0-9, A-Z and _ but all the unicode letters, digits and connector punctuation; connectors include more than just underscore, and digits include not only [0-9]
but their equivalents in other scripts.
The distinction is worth remembering, checking for characters we don’t want in an email address might use a more restrictive set than validating a person’s display name.
Let’s review what we have seen in this part.
- Without
^
for “start” and/or$
for “end” the pattern described by a regular expression can be anywhere in the text. This contrasts with Wildcards which try to match the whole text. \
is the escape character. As well as defining special characters (like\n
for new-line) it is used to define classes of characters, so\d
means “any digit”\s
“any space” and\w
“any word character”; uppercase makes these non-digit, non-space and non-word,- Characters to match can be :
- Normal characters (non-metacharacters),
- Escaped metacharacters (like
\$
), and control characters (like\n
or\x0a
) - A character class (like
\d
) or unicode block (like\p{N}
) - One of a set of characters wrapped in []
- “.” for “any character except new-line”
- PowerShell’s
-match
,-replace
and-split
operators are case-insensitive when they consider normal characters. They have case-sensitive counterparts, and most regex engines – including the one provided by .NET - are case-sensitive by default. - RegEx uses
?
,+
and*
for optional and repeated terms and{}
to specify exact numbers of occurrences. - Default “greedy” behaviour means
*
and+
gather as much as possible.?
specifies “laziness” stopping as soon as a match is successful. - There may be more than one match in a block of text, but each character can only belong to a single match.
- PowerShell puts what matched in
-match
or-notmatch
operations into the automatic variable$matches
.
I haven’t explained ( ) - the grouping function it performs covers most of part 2,
but the syntax of Regular expressions uses 3 kinds of brackets and we have seen 7 other metacharacters (\
.
^
, $
, *
, ?
and +
)
and we will see another, “|
”, at the start of part 2. We’ll also see that #
is a normal character but can become a meta character in some situations.
That’s enough to break down the monster I introduced at the start; it read:
^\{?[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}\}?$
. This translates as
^ # Start
\{? # A { sign, optional
[0-9a-f]{8} # A hex digit [0-9 a-f] repeated 8 times
- # A - sign
([0-9a-f]{4}-){3} # The sub expression([hex digit]x4 - ) repeated 3 times
[0-9a-f]{12} # Hex digit repeated 12 times
\}? # A } sign, optional
$ # End
Part 2 will look at grouping and other ways we can control how the expression is processed. But I hope the last example has shown that the knowledge we’ve covered so far is enough to understand even quite complicated regular expressions, even if more practice is needed to be able to sight-read them.