A PowerShell users' guide to regular expressions: part 2 (of 3).
Breaking up text
Part one explained that Regular Expressions describe patterns in text, and if we
can describe a pattern, we can test whether some text matches it (for example
using PowerShell’s -match
operator) and replace matching parts (with the -replace
)
operator and split where we find a match with the -split
operator.
I find a lot of benefit replacing complex string manipulation with a single
regex operation. For example, a series of commands to slice up “2021-2-2” and
turn it into a [datetime]
object becomes a single statement:
A string slicing implementation which looks for “-” will fail if the input uses
“/” between the digits, but the regex version can be more robust by looking for groups of digits
separated by non-digits. Often these operations to break things up will use the
-split
operator for example:
-split ' '
will break at (and discard) space characters,-split '[\r\n]+'
splits multi-line text on any combination of carriage-return and/or new-line – removing any blank lines.-split '\D+'
splits into groups of digits wherever there are non-digits
The -split
operator discards any text which is included in the match,
and usually a complicated expression defines something we want to keep, so
most uses of -split
are with quite simple patterns.
Using the first of the examples above to split ‘This, a sequence of words’
will leave the comma attached to “This,”.
We might use -split '\W'
instead, but splitting at each non-word character
splits at both space and comma, putting an extra zero-length item between
“This” and “a”. Changing to -split '\W+'
will split every time it
finds at least one non-word character.
This is a reminder to be careful with *
; \W+
specifies at least one, \W*
matches zero-length sequences. How many non-word characters follow each letter?
None – and that’s a match! But in other places we can get cleaner output by
using *
to pull optional characters into the match, for example:
-split '\s*,\s*'
splits on (and discards) commas with any spaces around
them – without this we might need to trim the results as an extra step.
We can use the same expression with the -replace
operator,
–replace '\s*,\s*', ', '
will take text with any combination of spaces around commas and change to commas
followed by exactly one space and preceded by none.
PowerShell’s -split
operator breaks at each match, and the
-replace
operator replaces each match. .NET’s [Regex]
object can be
told to replace only a certain number, and to ignore text at the start. When
there are multiple possible matches is important to remember something from part
1: a character can only be included in one match or if you prefer, matches
cannot overlap. For example, in the operation
'-----' -replace '---','x'
there are 5 ‘-‘ characters and a sequence of 3 could begin at position 1,2 or
3, but we get one match, not three. Once a character has been “consumed” it can’t
be used again (later we will see look-arounds where we can say “Look for this, but
don’t include it the match”). The regex engine reads left to right and finds a
match with the first 3 characters, they will be replaced with x, but the next
two can’t make further matches because they cannot combine with characters that have already been used.
One use of the -split operator is finding the first, last or nth part of
some text. If we take a filename like “My.regex.demo.txt”, using
-split '\.'
will give us an array @(“My”, “regex”, “demo”, “txt”)
and we can
get the last item like this
($FileName -split '\.')[-1]
Sometimes there is a better way – and to see that we need to dive into the use of ()
.
Grouping and references
In part one we saw that ()
can be used for repeating groups, so
([0-9a-f]{4}-){3}
means 3 repetitions of the group “hex-digit repeated 4 times followed by ‘-‘”
In Regex the |
sign means “or”, so
'^file://|^ftp://|^http://|^https://'
can be used to test for a protocol at the
beginning of a string. Grouping part of this would let us write it more tidily as:
'^(file|ftp|https?)://'
Doing this has an interesting side effect.
"http://jhoneill.github.io" -match '^(file|ftp|https?)://'
returns True
, but examining $Matches
shows we have more than one match :
Name Value
---- -----
1 http
0 http://
Group 0 is the whole match and group 1 is the result of “capturing” the first
group or “sub-match”. We can use these captured groups in lots of creative ways. One
is to move away from isolating one of elements from a -split
– as the file extension example
above did - and switch to explicitly specifying and capturing the part(s) of interest.
We can describe a file extension as '\.\w+$'
-for “a dot, then at least one word
character followed by the end of the string” and use () to make that a group.
Outside the group we can specify “all other text” :
"My.regex.demo.txt" -match '^.*(\.\w+)$'
Now it specifies “start, then anything, then the group, and end after the group. Examining
$Matches
shows this:
Name Value
---- -----
1 .txt
0 My.regex.demo.txt
The -replace
operator replaces the whole match (group 0), and any other groups can used in the replacement text:
"My.regex.demo.txt" -replace '^.*(\.\w+)$' ,'$1'
returns
.txt
This requires care with quotes: PowerShell will read "$1"
as a string with
an embedded variable to be expanded, so the command needs to use either '$1'
or "`$1"
There is a secondary caution. We can write:
$extension = $fullName -replace '^.*(\.\w+)$' ,'$1'
but if $fullName
doesn’t end with “.something” there will be no
replacement so $extension
will contain the full name. It might be better to write :
Sometimes we want to know what kinds of characters are in the text, for example
“does this password contain upper case, lower case, digits, and symbols?” and groups can help with that.
We can look at $matches
after the following test which uses -cmatch
operator to put [a-z]
and [A-Z]
into different groups
> "Password" -cmatch "((\W)|([0-9])|([a-z])|([A-Z])){8,}"`
True
> $Matches
Name Value
---- -----
5 P
4 d
1 d
0 Password
The whole match goes into 0, the last item from the outer grouping is in 1 and this grouping must be repeated at least 8 times. Non-word characters would be in group 2 and digits in 3 - here both are missing - lower-case letters go into group 4 and upper-case into 5. So we have 1 match for each type of character and two additional ones, so a test for 8 letters of 3 types could look like this.
It’s unusual to care only that groups exist and not what is in them; the more common pattern is “Match everything, isolate some part(s), and use the part(s) to make something new” and having recognized it, it reveals itself in many places. For another example, let’s transform a display name – possibly with a middle name or initials, but possibly not - into Last, First format.
"James H. O'Neill" -replace "^([\w'-]+)\s.*?([\w'-]+)$",'$2, $1'
Here, the regular expression defines two identical groups, one at the start and one at the end,
and there is a space and possibly other characters between groups. This is another place
to remember greedy and lazy, if the \s.*?
in the middle did not have
the ?
it would consume as much as possible, provided enough is left for the rest of
expression to match, so only the final “l” of “O’Neill” would be left for the second
group. With the ?
it takes the smallest amount which still gives an unbroken
run of [\w'-]
characters all the way to the end.
I use [\w'-]
to ensure names with hypens and apostrophes are processed correctly, and this will provide another example later.
And one more example: I mentioned splitting text and making it into a date, it could be done like this.
A side point on PowerShell style here: when I have a piece of code which goes:
I find that when I have read through the logic and reached the else
, I must scroll
back up to remind myself of the condition - I find code reads more fluently like this:
As well as referring to groups in replace
operations, we can refer back to them in
the expression where they are created. So, we can test for any two vowels like this
$text -match '[aeiou].*'[aeiou]'
But to test for the same vowel twice needs a back reference:
$text -match '([aeiou]).*\1'
returns true
when $text
holds “Alpha” - \1
is a reference to group 1, so it holds A and matches a;
\2
would be group 2 and so on. This expression returns false
when the text is
“extra”; \1
initially holds ‘e’ which doesn’t match anything, and then a second attempt will try
with \1
as ‘a’ which also doesn’t match. Setting $text
to “extraordinary”
gives true
– you can see this would be difficult without regex as ‘a’ isn’t the first
vowel, and the two ‘a’ characters are not consecutive vowels either, but as a regex it
tries ‘e’, fails and then tries ‘a’.
Using () for more than groups
If the first character inside the ()
is ?
it says the group is doing
something special - the main uses of the (?)
syntax are listed in the table below:
Description | Syntax |
---|---|
Group with numbered back-reference | (xxxx) |
Group with Name as its back-reference |
(?'Name'xxx) or (?<Name>xxx) |
Group without back-reference | (?:xxx) |
Directive: no unnamed back references. | (?n) |
Directive: include line break in “any character” | (?s) |
Directive: $ & ^ match “start/end of line” instead of “start/end of text” |
(?m) |
Directive: case insensitive processing of letters | (?i) |
Directive: treat group as atomic (see part 3) | (?>) |
Comments | (?# a comment ) |
Exclude white space and comments unless escaped | (?x) |
Look ahead and look behind (see later) | (?=) , (?!) ,(?<=) or (?<!) |
Conditional processing if an earlier group matched | (?(name)xxx) or (?(1)xxx) |
Named groups
The first way of using this syntax is for named groups, <name>
or 'name'
after the (?
specifies a name, which can then be used externally or in a back
reference; re-working an earlier example this:
"http://jhoneill.github.io" -match
'\^((?\<file\>file)\|(?\<ftp\>ftp)\|(?\<http\>https?))://'
still returns true and still has a group named 1 but that sub-divides into one of
3 named groups:
Name Value
---- -----
http http
1 http
0 http://
This allows us to write code like this:
Unnamed groups use \1
, \2
, \3
and so on as back-references in the same expresssion
and $1
, $2
, $3
to refer to the groups in a replace operation. For a named back reference
we use \k<Name>
inside the expression and ${name}
when replacing.
Comments and formatting
Specifying the (?x)
option at the start of the expression would let us include the monster regex
from Part one in a piece of PowerShell which looks like this:
Even without comments the layout is easier to follow: anything between a #
sign and the end of a line
is a comment and spaces and line-breaks are all ignored, so any white-space that
is part of the pattern needs to use \s
. Any space before the (?x)
forms part of the expression, so be sure to put (?x)
at the very start of the text without any stray indents.
If we prefer using a [regex]
object to the PowerShell operators we can cast the string to a regex, with
$regex = [regex]$GuidRegex
or we can create an object with either:
$regex = New-Object -TypeName regex -ArgumentList $guidregex
or
$regex = [regex]::new($guidregex)
With the last two forms, we can specify options for the regex behaviour - like rightToLeft processing, and in this case we could use
$regex = [regex]::new($guidregex,'IgnorePatternWhitespace')
and omit the (?x)
inside the expression. IgnorePatternWhitespace
is a value from the the enum [System.Text.RegularExpressions.RegexOptions]
so if it could be written in full as
[System.Text.RegularExpressions.RegexOptions]::IgnorePatternWhitespace
Single line and multiline mode
Ignoring Pattern-Whitespace changes how line breaks in the expression are interpreted, but what about line breaks – i.e. new-line characters – in the text being examined?
returns True
because new-line counts as a space, so it matches \s
. But testing with an expression of 'on.*five'
returns false
because new-line is excluded from “.”. If we use the [regex]
object we can request the singleline
RegexOption.
$regex =[regex]::new('is.*on','singleline')
$regex.IsMatch($text)
Or we can specify single-line mode by starting the expression with (?s)
$text -match '(?s)is.*five'
returns true
.
In some multi-line cases it would be useful to consider each line as a
string in its own right and have ^
and $
act as “start and end of line”
not “start and end of text”. (\a
and \z
are always start-of-text
and end-of-text) requesting the multiline regex option or putting (?m)
into the expression does this.
$text -match '(?m)^is'
returns true
: but
'(?m)on\s*five'
will give false
because the line break after “on” has become a hard break.
Look arounds.
Sometimes we want to match on something only if it is preceded or followed by something
else. For example when we want to format a string of digits as 12,345,678.
This is something so familar that it takes a moment to think about
the process and convert that into a regex description. We insert a comma after a
digit if the remaining digits group into blocks of 3.
What we have seen so far lets the -replace
operator say “If there are 3 digits
make them ,123 or 123,” but not “only if after those 3 there are 0,3,6 or any
multiple of 3 more” and “only when one digit has been passed”
When digits have been used in part of a match, another part of a match can’t re-use them,
we need a way to peek back at the ones which have been consumed or forward without
consuming any that will be consumed by a later part of the expression. And that peeking is properly known as looking behind and looking ahead.
A look behind allows us to say that before any place where a comma should go there
must be one digit and a look ahead says after that place there must be groups
of 3 running up to the end, which is what the following expression says.:
(?<=\d)(?=(\d{3})+$)
If the expression didn’t explicity include the end the engine would ignore the final
digit(s) and make extra matches. Which is an important point about look-arounds –
characters can be included in a look-ahead or look-behnind, even if they have
already been used in a match or look-around - the last 3 digits are part of every match.
This example uses a look ahead and look behind to identify a point in the text which extends for zero characters. There is no operator for insert - it is done here by replacing zero characters before 678 with “,”
We can negate a look around by replacing the = sign with a ! so this
$text -replace '(?<!(\s{2}|\n))\r?\n(?!\r?\n)', ' '
Replaces new-line (alone or with return) with a space, but only if it is neither
followed by another one, nor preceded by 2 spaces or a new-line.
A hint on reading regexes When I was editing this post I came back to the expression above and had to look at it for a moment to see what it did - in a script I would have a comment to explain it. Mentally I did this:
- Something is going to be replaced with space. Use brackets to break what is being replaced into groups:
(?<!(\s{2}|\n))
\r?\n
(?!\r?\n)
- The
()
are look ahead / behind which apply a “but only if” to the rest, so put them to one side leaving\r?\n
. That’s carriage-return (optional) new-line (required) - a linebreak.
So “Replace a line break with a space, but only if … what?” - ` (?! … )` “Looking foraward we don’t see” the same expression - so “but not when this is the first of two line breaks” .
(?<! something in brackets )
the something in brackets is things seperated with|
so anor
\s{2}
2 spaces or\n
new-line. So “nor when this is the second of two breaks, nor when preceeded by spaces.”
Aha! That’s describing when two lines get merged into one in markdown.
I talked above about testing letters in a name: having a surname like O’Neill, any code which uses
-match '^[a-z]+$'
, or -match '^\w+$'
rejects me. The latter will let me turn O'Neill
into
0Ne111
- it’s doubley poor work.
Replacing the test with -match "^[a-z'-]+$"
allows O’Neill but also makes it legal to begin or
end with ‘ or - and still doesn’t allow accented letters. One way to work round this would be
to use \w
to allow all letters (and numbers) and add lookarounds to rule out illegal
combinations - digits anywhere and non word characters at either end, like this:
"^(?=\w)[\w'-]+(?<!\W|\d.*)$"
Using the way of reading I showed a moment ago, in the middle of this expression [\w'-]+
allows everything we want, but also allows things we don’t: digits and misplaced ‘ and - characters.
Wrapped around it on the left the expression says “Peeking at the first character we must see a word
character (not '
or -
).” and on the right it says “looking back we must NOT see a non-word character
in the last position OR see a digit followed by anything”.
Another place I use look aheads and behinds is setting my PowerShell prompt; when the
current path is too long, I cut the middle part out with this:
$pwd -replace "(?<=^.{4,})\\.*\\(?=.{20,}$)","\...\"
which says “find a block any text between \ characters with at least 20
characters following it before the end of text, and 4 characters preceding it
before the beginning of the text and replace it with \...\
” so
C:\Users\James\Documents\WindowsPowerShell\Modules\Microsoft.Graph.PlusPlus
becomes C:\Users\...\Microsoft.Graph.PlusPlus
Greedy and lazy matter here, that way of reading helps again: the match part tries to be greedy and if it consumes too many characters it will surrender some until the lookarounds are satisfied. If the parts were processed in sequence, more would go into the look arounds and the minimum amount would be replaced. Matches get to to be greedy before look arounds do. And we we will see why that can be important in the third and final part of this series