A PowerShell users' guide to regular expressions: part 2 (of 3).

11 Apr 2021

Breaking up text

Part one explained that Regular Expressions describe patterns in text, and if we can describe a pattern, we can test whether some text matches it (for example using PowerShell’s -match operator) and replace matching parts (with the -replace) operator and split where we find a match with the -split operator.

I find a lot of benefit replacing complex string manipulation with a single regex operation. For example, a series of commands to slice up “2021-2-2” and turn it into a [datetime] object becomes a single statement:

    if ($string -notmatch $dateRegex) {"Not a date!"}   
    else {Get-Date <<with $matches>>}

A string slicing implementation which looks for “-” will fail if the input uses “/” between the digits, but the regex version can be more robust by looking for groups of digits separated by non-digits. Often these operations to break things up will use the -split operator for example:

-split ' ' will break at (and discard) space characters,
-split '[\r\n]+' splits multi-line text on any combination of carriage-return and/or new-line – removing any blank lines.
-split '\D+' splits into groups of digits wherever there are non-digits

The -split operator discards any text which is included in the match, and usually a complicated expression defines something we want to keep, so most uses of -split are with quite simple patterns.
Using the first of the examples above to split ‘This, a sequence of words’ will leave the comma attached to “This,”.
We might use -split '\W' instead, but splitting at each non-word character splits at both space and comma, putting an extra zero-length item between “This” and “a”. Changing to -split '\W+' will split every time it finds at least one non-word character.

This is a reminder to be careful with *; \W+ specifies at least one, \W* matches zero-length sequences. How many non-word characters follow each letter? None – and that’s a match! But in other places we can get cleaner output by using * to pull optional characters into the match, for example:
-split '\s*,\s*' splits on (and discards) commas with any spaces around them – without this we might need to trim the results as an extra step. We can use the same expression with the -replace operator,
–replace '\s*,\s*', ', '
will take text with any combination of spaces around commas and change to commas followed by exactly one space and preceded by none.

PowerShell’s -split operator breaks at each match, and the -replace operator replaces each match. .NET’s [Regex] object can be told to replace only a certain number, and to ignore text at the start. When there are multiple possible matches is important to remember something from part 1: a character can only be included in one match or if you prefer, matches cannot overlap. For example, in the operation
'-----' -replace '---','x'
there are 5 ‘-‘ characters and a sequence of 3 could begin at position 1,2 or 3, but we get one match, not three. Once a character has been “consumed” it can’t be used again (later we will see look-arounds where we can say “Look for this, but don’t include it the match”). The regex engine reads left to right and finds a match with the first 3 characters, they will be replaced with x, but the next two can’t make further matches because they cannot combine with characters that have already been used.

One use of the -split operator is finding the first, last or nth part of some text. If we take a filename like “My.regex.demo.txt”, using
-split '\.' will give us an array @(“My”, “regex”, “demo”, “txt”) and we can get the last item like this
($FileName -split '\.')[-1]

Sometimes there is a better way – and to see that we need to dive into the use of ().

Grouping and references

In part one we saw that () can be used for repeating groups, so ([0-9a-f]{4}-){3} means 3 repetitions of the group “hex-digit repeated 4 times followed by ‘-‘”

Doing this has an interesting side effect.
"http://jhoneill.github.io" -match '^(file|ftp|https?)://'
returns True, but examining $Matches shows we have more than one match :

Name Value
---- -----
1    http
0    http://

Group 0 is the whole match and group 1 is the result of “capturing” the first group or “sub-match”. We can use these captured groups in lots of creative ways. One is to move away from isolating one of elements from a -split – as the file extension example above did - and switch to explicitly specifying and capturing the part(s) of interest. We can describe a file extension as '\.\w+$' -for “a dot, then at least one word character followed by the end of the string” and use () to make that a group. Outside the group we can specify “all other text” :
"My.regex.demo.txt" -match '^.*(\.\w+)$'
Now it specifies “start, then anything, then the group, and end after the group. Examining $Matches shows this:

Name Value
---- -----
1    .txt
0    My.regex.demo.txt

The -replace operator replaces the whole match (group 0), and any other groups can used in the replacement text:
"My.regex.demo.txt" -replace '^.*(\.\w+)$' ,'$1' returns
.txt

This requires care with quotes: PowerShell will read "$1" as a string with an embedded variable to be expanded, so the command needs to use either '$1' or "`$1"

There is a secondary caution. We can write:
$extension = $fullName -replace '^.*(\.\w+)$' ,'$1'
but if $fullName doesn’t end with “.something” there will be no replacement so $extension will contain the full name. It might be better to write :

    if ($fullname -match '^.*(\.\w+)$') {$ext = $matches[1]}   
    else { << handle ‘no extension’ condition>> }

Sometimes we want to know what kinds of characters are in the text, for example “does this password contain upper case, lower case, digits, and symbols?” and groups can help with that. We can look at $matches after the following test which uses -cmatch operator to put [a-z] and [A-Z] into different groups

> "Password" -cmatch "((\W)|([0-9])|([a-z])|([A-Z])){8,}"`    
True
> $Matches

Name                           Value
----                           -----
5                              P
4                              d
1                              d
0                              Password

The whole match goes into 0, the last item from the outer grouping is in 1 and this grouping must be repeated at least 8 times. Non-word characters would be in group 2 and digits in 3 - here both are missing - lower-case letters go into group 4 and upper-case into 5. So we have 1 match for each type of character and two additional ones, so a test for 8 letters of 3 types could look like this.

    if ($newPassword -cnotmatch "((\W)|([0-9])|([a-z])|([A-Z])){8,}"  -or $Matches.Count -lt 5) {
        throw "bad password" 
    }
    else {set-password $newPassword}

It’s unusual to care only that groups exist and not what is in them; the more common pattern is “Match everything, isolate some part(s), and use the part(s) to make something new” and having recognized it, it reveals itself in many places. For another example, let’s transform a display name – possibly with a middle name or initials, but possibly not - into Last, First format.

"James H. O'Neill" -replace "^([\w'-]+)\s.*?([\w'-]+)$",'$2, $1'

Here, the regular expression defines two identical groups, one at the start and one at the end, and there is a space and possibly other characters between groups. This is another place to remember greedy and lazy, if the \s.*? in the middle did not have the ? it would consume as much as possible, provided enough is left for the rest of expression to match, so only the final “l” of “O’Neill” would be left for the second group. With the ? it takes the smallest amount which still gives an unbroken run of [\w'-] characters all the way to the end.
I use [\w'-] to ensure names with hypens and apostrophes are processed correctly, and this will provide another example later.

And one more example: I mentioned splitting text and making it into a date, it could be done like this.

    if ($string -notmatch '(\d+)\D+(\d+)\D+(\d+)') { "Not a date!" }
    else {Get-Date -Year $Matches[1] -Month $Matches[2] -Day $Matches[3] }

A side point on PowerShell style here: when I have a piece of code which goes:

    if ($x -match $y) { 
        many
        lines
        of_logic 
    }
    else {Warning_or_error_message}

I find that when I have read through the logic and reached the else, I must scroll back up to remind myself of the condition - I find code reads more fluently like this:

    if ($x -notmatch $y) {Warning_or_error}  
    else {
        all
        the
        logic
    }

As well as referring to groups in replace operations, we can refer back to them in the expression where they are created. So, we can test for any two vowels like this
$text -match '[aeiou].*'[aeiou]'
But to test for the same vowel twice needs a back reference:
$text -match '([aeiou]).*\1'
returns true when $text holds “Alpha” - \1 is a reference to group 1, so it holds A and matches a; \2 would be group 2 and so on. This expression returns false when the text is “extra”; \1 initially holds ‘e’ which doesn’t match anything, and then a second attempt will try with \1 as ‘a’ which also doesn’t match. Setting $text to “extraordinary” gives true – you can see this would be difficult without regex as ‘a’ isn’t the first vowel, and the two ‘a’ characters are not consecutive vowels either, but as a regex it tries ‘e’, fails and then tries ‘a’.

Using () for more than groups

If the first character inside the () is ? it says the group is doing something special - the main uses of the (?) syntax are listed in the table below:

Description	Syntax
Group with numbered back-reference	`(xxxx)`
Group with `Name` as its back-reference	`(?'Name'xxx)` or `(?<Name>xxx)`
Group without back-reference	`(?:xxx)`
Directive: no unnamed back references.	`(?n)`
Directive: include line break in “any character”	`(?s)`
Directive: `$` & `^` match “start/end of line” instead of “start/end of text”	`(?m)`
Directive: case insensitive processing of letters	`(?i)`
Directive: treat group as atomic (see part 3)	`(?>)`
Comments	`(?# a comment )`
Exclude white space and comments unless escaped	`(?x)`
Look ahead and look behind (see later)	`(?=)`, `(?!)`,`(?<=)` or `(?<!)`
Conditional processing if an earlier group matched	`(?(name)xxx)` or `(?(1)xxx)`

Named groups

The first way of using this syntax is for named groups, <name> or 'name' after the (? specifies a name, which can then be used externally or in a back reference; re-working an earlier example this:
"http://jhoneill.github.io" -match '\^((?\<file\>file)\|(?\<ftp\>ftp)\|(?\<http\>https?))://'
still returns true and still has a group named 1 but that sub-divides into one of 3 named groups:

Name Value
---- -----
http http
1    http
0    http://

This allows us to write code like this:

    if ($path -notmatch '^((?<file>file)|(?<ftp\>ftp)|(?<http>https?))://') {
        Write-Warning 'No Protocol'
    }
    elseIf ($matches.file) {do_something}  
    elseIf ($matches.ftp)  {do_something}  
    elseIf ($matches.http) {do_something}

Unnamed groups use \1, \2, \3 and so on as back-references in the same expresssion and $1, $2, $3 to refer to the groups in a replace operation. For a named back reference we use \k<Name> inside the expression and ${name} when replacing.

Comments and formatting

Specifying the (?x) option at the start of the expression would let us include the monster regex from Part one in a piece of PowerShell which looks like this:

    $GuidRegex = @'
    (?x)
    ^                  # start,
    \{?                # a { sign, optional,
    [0-9a-f]{8}        # a hex digit [0-9 a-f] repeated 8 times ,
    -                  # a – sign,
    ([0-9a-f]{4}-){3}  # the sub expression([hex digit]x4 - ) repeated 3 times,
    [0-9a-f]{12}       # hex digit repeated 12 times,
    \}?                # a } sign, optional,
    $                  # end.  
    '@

Even without comments the layout is easier to follow: anything between a # sign and the end of a line is a comment and spaces and line-breaks are all ignored, so any white-space that is part of the pattern needs to use \s. Any space before the (?x) forms part of the expression, so be sure to put (?x) at the very start of the text without any stray indents.

If we prefer using a [regex] object to the PowerShell operators we can cast the string to a regex, with
$regex = [regex]$GuidRegex
or we can create an object with either:
$regex = New-Object -TypeName regex -ArgumentList $guidregex
or
$regex = [regex]::new($guidregex)

With the last two forms, we can specify options for the regex behaviour - like rightToLeft processing, and in this case we could use
$regex = [regex]::new($guidregex,'IgnorePatternWhitespace')
and omit the (?x) inside the expression. IgnorePatternWhitespace is a value from the the enum [System.Text.RegularExpressions.RegexOptions] so if it could be written in full as
[System.Text.RegularExpressions.RegexOptions]::IgnorePatternWhitespace

Single line and multiline mode

Ignoring Pattern-Whitespace changes how line breaks in the expression are interpreted, but what about line breaks – i.e. new-line characters – in the text being examined?

    $text = @'
    this
    is
    on
    five 
    lines
    '@
    $text -match 'on\s*five'

returns True because new-line counts as a space, so it matches \s. But testing with an expression of 'on.*five'
returns false because new-line is excluded from “.”. If we use the [regex] object we can request the singleline RegexOption.
$regex =[regex]::new('is.*on','singleline')
$regex.IsMatch($text)

Or we can specify single-line mode by starting the expression with (?s)
$text -match '(?s)is.*five'
returns true.

In some multi-line cases it would be useful to consider each line as a string in its own right and have ^ and $ act as “start and end of line” not “start and end of text”. (\a and \z are always start-of-text and end-of-text) requesting the multiline regex option or putting (?m) into the expression does this.
$text -match '(?m)^is'
returns true: but
'(?m)on\s*five'
will give false because the line break after “on” has become a hard break.

Look arounds.

Sometimes we want to match on something only if it is preceded or followed by something else. For example when we want to format a string of digits as 12,345,678.
This is something so familar that it takes a moment to think about the process and convert that into a regex description. We insert a comma after a digit if the remaining digits group into blocks of 3.

What we have seen so far lets the -replace operator say “If there are 3 digits make them ,123 or 123,” but not “only if after those 3 there are 0,3,6 or any multiple of 3 more” and “only when one digit has been passed” When digits have been used in part of a match, another part of a match can’t re-use them, we need a way to peek back at the ones which have been consumed or forward without consuming any that will be consumed by a later part of the expression. And that peeking is properly known as looking behind and looking ahead.

A look behind allows us to say that before any place where a comma should go there must be one digit and a look ahead says after that place there must be groups of 3 running up to the end, which is what the following expression says.:
(?<=\d)(?=(\d{3})+$)
If the expression didn’t explicity include the end the engine would ignore the final digit(s) and make extra matches. Which is an important point about look-arounds – characters can be included in a look-ahead or look-behnind, even if they have already been used in a match or look-around - the last 3 digits are part of every match.

This example uses a look ahead and look behind to identify a point in the text which extends for zero characters. There is no operator for insert - it is done here by replacing zero characters before 678 with “,”

We can negate a look around by replacing the = sign with a ! so this
$text -replace '(?<!(\s{2}|\n))\r?\n(?!\r?\n)', ' '
Replaces new-line (alone or with return) with a space, but only if it is neither followed by another one, nor preceded by 2 spaces or a new-line.

A hint on reading regexes When I was editing this post I came back to the expression above and had to look at it for a moment to see what it did - in a script I would have a comment to explain it. Mentally I did this:

Something is going to be replaced with space. Use brackets to break what is being replaced into groups:
(?<!(\s{2}|\n))
\r?\n
(?!\r?\n)
The () are look ahead / behind which apply a “but only if” to the rest, so put them to one side leaving \r?\n. That’s carriage-return (optional) new-line (required) - a linebreak.
So “Replace a line break with a space, but only if … what?”
` (?! … )` “Looking foraward we don’t see” the same expression - so “but not when this is the first of two line breaks” .
(?<! something in brackets ) the something in brackets is things seperated with | so an or
\s{2} 2 spaces or \n new-line. So “nor when this is the second of two breaks, nor when preceeded by spaces.”

Aha! That’s describing when two lines get merged into one in markdown.

I talked above about testing letters in a name: having a surname like O’Neill, any code which uses -match '^[a-z]+$', or -match '^\w+$' rejects me. The latter will let me turn O'Neill into 0Ne111 - it’s doubley poor work.
Replacing the test with -match "^[a-z'-]+$" allows O’Neill but also makes it legal to begin or end with ‘ or - and still doesn’t allow accented letters. One way to work round this would be to use \w to allow all letters (and numbers) and add lookarounds to rule out illegal combinations - digits anywhere and non word characters at either end, like this:
"^(?=\w)[\w'-]+(?<!\W|\d.*)$"
Using the way of reading I showed a moment ago, in the middle of this expression [\w'-]+ allows everything we want, but also allows things we don’t: digits and misplaced ‘ and - characters. Wrapped around it on the left the expression says “Peeking at the first character we must see a word character (not ' or -).” and on the right it says “looking back we must NOT see a non-word character in the last position OR see a digit followed by anything”.

Another place I use look aheads and behinds is setting my PowerShell prompt; when the current path is too long, I cut the middle part out with this:
$pwd -replace "(?<=^.{4,})\\.*\\(?=.{20,}$)","\...\"
which says “find a block any text between \ characters with at least 20 characters following it before the end of text, and 4 characters preceding it before the beginning of the text and replace it with \...\ ” so C:\Users\James\Documents\WindowsPowerShell\Modules\Microsoft.Graph.PlusPlus becomes C:\Users\...\Microsoft.Graph.PlusPlus

Greedy and lazy matter here, that way of reading helps again: the match part tries to be greedy and if it consumes too many characters it will surrender some until the lookarounds are satisfied. If the parts were processed in sequence, more would go into the look arounds and the minimum amount would be replaced. Matches get to to be greedy before look arounds do. And we we will see why that can be important in the third and final part of this series