A PowerShell users' guide to regular expressions: part 3 (of 3).

05 Jun 2021

Rounding up some regex tricks

Part one of this series looked at the terms which went into a regular expression to describe text of interest. and Part two was mostly about groups; I said that the instead of string methods, I often use a regex group to get the required part of string: here are some examples from files in my PowerShell folder.

... $path -replace '^(.*)/.+?$','$1'
$employeeID = $_.name -replace "^(\d+)\.jpg$",'$1'
$ExtensionNo = [int]($adUser.OfficePhone -replace "^.*(\d{4})$",'$1')
$Identity = ($DisplayName.toupper() -replace "^(\w)(.*\s+)([\w'-]*)$",'$1$3' ) -replace "\W",""
$groupName = $g -replace "^CN=DEPT_(.*?),ou=HR.*$",'$1'
$clusterResType = $s.where( {$_ -match "^\s*ResourceType ="} ) -replace '^.*"(.*)".*$','$1'

I try to explain any regex in a script with a comment - 30 seconds to type a comment can save 5 minutes unpicking the expression- and these expression are:
(1) Get the parent part of a path - the part before the final /something.
(2) in a folder of employee photos extract the employee number from each file-name to add the photo to the correct user in Active Directory.
(3) Convert DDI phone numbers or “Ext 1234” into a 4 digit extension number,
(4) Convert a display name like “James O’Neill” first to uppercase, then to JO’NEILL and finally to JONEILL
(5) Isolate the departmental names like “Sales” from an Active directory distinguished name like CN=DEPT_Sales=ou=HR.... (all the departmental groups were owned by HR).
And (6) find a line in a file which reads ResourceType = "something" and isolate the something.

Although the majority of work with regex in PowerShell is done with the -match and replace operators, they are not the only ways to use expressions. I used the Select-String command to find the ones above; Select-String doesn’t have a -recurse option and going back at least 10 years my PowerShell profile has contained a function “Whathas” , which looks like this:

    function whathas {
        param   (
            [Parameter(ValueFromPipeLine=$true,Mandatory=$true)]
            $Pattern,
            [parameter(ValueFromPipelineByPropertyName = $true)]
            [Alias('Fullname','Path')]
            $Include=@("*.ps1","*.js","*.sql"),
            [Switch]$Recurse,
            [Switch]$List,
            [Switch]$BareMatch
        )
        process {
            $matches = Get-ChildItem -Recurse:$recurse -Include $include | 
                           Select-String -Pattern $pattern  -List:$list
            if ($BareMatch) { $matches | ForEach-Object {$_.matches} | ForEach-Object {$_.value} }
            else            { $matches }
        }
    }

Select-String takes file objects OR text as input, so I select the files (recursively if necessary) and pipe them in. It returns [MatchInfo] objects which hold the path and file name, the line of text containing the match and its number, and details of what matched. Sometimes I just want the matching text so I have a -BareMatch switch; sometimes I want to know which files have a match without worrying about where matches occur, which is what the -List switch returns, so
Whathas something -list | whathas anotherthing -list
will find files with both strings - even if far appart (if you wonder “why not use a multi-line regex with something.*anotherthing - read on) . The command
Whathas something "list|anotherthing" -list
finds files with either or both strings. Occasionally I want to know the lines on either side of the match and use Select-String natively with its -Context parameter when I need it. About 1 in 20 or 30 of all the commands I issue use whathas, and most cases use very simple regexes without classes or escaped characters, but something like
whathas replace -r
At the start of Part one I made the point that a word, “replace” in this case, is valid as a regex just as much as a tangle of metacharacters, and is more likely to be used in the real world. ‘ Select-String found an old script where I would now use Get-CimInstance to get an object with the last boot-up time as a [datetime] object. The script uses Get-WMIObject which returns data in the format 20210324184546.261638+000 and the script to process it looks like this:

    $win32OS = Get-WmiObject win32_operatingsystem

    $lastbootTime = New-Object -TypeName DateTime -ArgumentList (
        $win32OS.lastbootuptime -replace "^(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2}).*",
                                            '$1/$2/$3/$4/$5/$6' -split "/" )

Which turns 20210324184546.261638+000 into 2021/03/24/18/45/46 and then splits it into @(“2021”, “03”, “24”, “18”, “45”, “46”) to give a set of arguments. It’s an interesting technique, but today I’d probably use the style I showed in Part two

    if ($win32OS.lastbootuptime -match "^(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2}).*") {
        $lastbootTime = New-Object -TypeName DateTime -ArgumentList $matches[1..6] 
    }

It’s worth saying here that $Matches contains matches which are found by the -Match or NotMatch operators, but if none is found, $matches is not cleared. Generally the construction
if (something -match regex) {do_something_with $matches}
means this is not a problem.

All the groups or sub-matches above consume characters (unlike lookarounds which just peek at them) and we can collect the text from automatically named variables 1,2,3 etc. or set names for the variables, or specify that the group shouldn’t be preserved. But that description glosses over something interesting.

Stacks of matches

One interesting question when working with groups, is “What happens if we have multiple matches for the same group?”. For example,
"ABCD" –match "^(\w)+$"
Returns true, and $matches shows:

Name Value  
---- -----  
1     D  
0     ABCD 
    

Match 1 looks like the last matching item, but things are more subtle, which we can see by using the .NET regex object instead of the PowerShell commands and operators (another way to use expressions), like this:
$r = [regex]"^(\w)+$"
$r.Match("ABCDE")
which returns

Groups :   {0, 1}  
Success :  True  
Name :     0  
Captures : {0}  
Index :    0  
Length :   5  
Value :    ABCDE

In this example, Groups corresponds to the two groups that were visible in $matches. And if we examine group 1 using
$r.Match("ABCDE").groups[1] we get

Success :  True  
Name :     1  
Captures : {A, B, C, D…}  
Index :    3  
Length :   4  
Value :    E  

This is a stack (each match sits on-top-of earlier matches, it doesn’t displace values from a simple variable) and it stores A,B,C,D simply as their position (index), length and value, E also has a name - “1” in this case.

I often write “end-to-end” matches like the one above, starting with ^ and ending with $; if the regex were declared as
$r = [regex]"[AEIOU]" then
$m = $r.Match("ABCDE") would put the following into $m

Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 0
Length   : 1
Value    : A

The next match (if there is one) is available with $m.NextMatch() - when we want to know what the matches are not simply that at least one exists calling NextMatch until the Success property is false is helpful, but my “end-to-end” version didn’t allow gaps between the matches and this does: $m.NextMatch() returns a value of “E” and “B”,“C” and “D” are skipped. If we don’t want that to happen, one more special escape character acts as an anchor for this (it dawned on me after publishing parts 1 & 2 that I’d never introduced the term anchor for ^ and $, \a and \z and \b which are places in the string, instead of characters.) \G says the match must begin at the first available characeter, in other words the first one in the string or the next character after a match, so
$r = [regex]"\G[AEIOU]" would return the same first match, but calling NextMatch() is unsuccessful.

Until learning it was a stack, in an operation like "ABCDE" -match "^(\w)+\1$" we might have thought of \1 as “The matching item with the [default] name of ‘1’” but really it is the top item on the stack named “1”. and we can manipulate the stack.

"ABCDA" -match "(?<vowel>[AEIOU])\w+(?<second>\k<vowel>)"
returns True, and $matches is as expected:

Name   Value
----   -----
vowel  A
second A
0      ABCDA
    

But if we replace <second> we can make something new happen.
"ABCDA" -match "(?<vowel>[AEIOU])\w+(?<second-vowel>\k<vowel>)"
still returns true, but now $Matches looks like this:

Name   Value
----   -----
second BCD
0      ABCDA
    

<second-vowel> has pushed the text after the top item in the <vowel> stack and before the match onto the <second> stack and popped the top item off the <vowel> stack leaving it empty – if that stack were already empty, trying to access a non-existant top item would cause the expression to fail.

If we just wrote <-vowel> it would pop the item off the top of the stack (or fail if the stack is empty) without storing any result.

The next example brings these parts together:

"ABCDD" -match @'  
(?x)  
^(\w)+     # from the start of text, push single word characters onto stack 1  
( \1       # end by matching characters with the top item on     
  (?<-1>)  # stack 1 & popping the matched items off the stack    
)+         # repeating as many times as possible  
'@

As shown, with ABCDD, it matches the final D with the previous D, and returns true. If we add some more letters, ABCDDCB returns true, and pops D, C and B in turn off stack <1>, leaving only A. If we put an X after the final B , the expression would ignore it to make a match – to specify that each character to the end must have a balancing item on stack <1> needs a $ after the + sign at the end.

Now, if we wanted to test for a palindrome - that is everything before the middle having a balancing item after it - we could say “Fail as unbalanced if there is anything left in stack <1>” - these patterns are often called balanced or balancing matches. We can see anything that was left in <1> by looking at $Matches but instead of letting the match operation succeed and then failing based on what is in the automatic variable, there is a way to do this inside the regex:
(?(1)xxx) says “evaluate xxx if there is anything on stack <1>” and we can create something which will always fail: (?!) says “Fail if looking ahead you see at least… nothing”. Since nothing or more will always be there that always fails, so (?(1)(?!)) will force a failure if there is anything on stack <1> at the end of processing.

    $palindrome = @'  
    (?x)  
    ^(\w)+     # start and push single word characters onto stack 1  
    .?         # there may be one extra character in the middle  
    (  
      \1       # end by matching items with the top item on stack 1,  
      (?<-1>)  # and popping the matching items off stack 1  
    )+$        # Do this at least once up to the end.  
    $(?(1)(?!) # Fail if there is anything left on stack 1  
    '@

    "ABCDDCBA" -match $palindrome

The script above will return true. If we test whether ABCDDCB matches, the first A remains on the stack causing a failure. If we add something and evaluate ABCDDCBAX, when X is tested the stack will be empty, causing a failure, and if the sequence is wrong, for example evaluating ABCDDCAB then \1 will fail to match, here D and C will match, but the first B will be the top item on the stack when the final A is evaulated. To get a successful match, each letter must have a balancing one in the right place. For a good palindrome test, the expression allows an extra letter in the middle, so ABCDEDCBA will also match.

This logic can be adapted to test whether we have the correct open and closing brackets.

    $openWithClose = @'  
    (?x)  
    ^(?>              # Start and make the repeating group ‘atomic’  
        [^()]+        # Non brackets are a match  
      |(?<open>  \( ) # Open is a match – push onto the “open” stack  
      |(?<-open> \) ) # Close is a match – pop off the “open” stack  
                      # Too many closes will cause a failure here  
    )+$               # Repeat up to the end.  
    (?(open)(?!))     # Fail if there is anything left on the “open” stack  
    '@
     
    "A(BCDE)DCBA" -match $openWithClose

By good fortune $openWithClose -match $openWithClose also returns true but, because the expression is not designed to look for escaped characters, valid regular expressions or other “code” may not match. When the language rules get complicated a proper parser will do better job, partly for reasons we will see in a moment.

Many variations of the last example can be found both on-line and in print and I don’t know where or when it was first published or even where I first saw a version; I can’t claim it as my own, and if I had produced it I might not have thought to use the (?>) construct to stop backtracking.

Backtracking

Going back to the Palindrome example, how does the regex engine know how far along the text it should stop pushing things onto the stack and start popping them off it? Nothing in the expression says so.
"ABABA" -match ^(\w)+.?(\1(?<-1>))+$(?(1)(?!) Finds a match by trying the possibilities like this:

Push A,B,A,B and A onto the stack. This will fail when it attempts to to match \1 because there are no characters left.
Back up and try with A,B,A and B on the stack and matching A against the .? There is still nothing to match \1.
Backtrack again,and try A,B,A and B on the stack and .? empty. The final A doesn’t match the B on the top of stack.
Backtrack again, with A,B and A, on the stack and .? matching B. The final A matches the A on top of the stack, we reach the end but now the whole match fails because the first A and B are left on the stack.
Backtrack yet again, with A,B and A, on the stack and .? empty. The final B doesn’t match the A on top of the stack: fail!.
Backtrack one more time, with A and B on the stack and .? matching the second A. Now the final B matches the B on top of the stack, B is popped off the stack; the expression hasn’t reached the end, so it tests the final A which matches the A on top of the stack, and it reaches the end with an empty stack. Success!

With a long piece of text, and multiple variable-length parts it can take a lot of backtracking to try every possiblity and prove there is no possible match. Some constructions are a recipe for vast amounts of backtracking and one of these is "(.*)*x". Stripped back to “Sequence, squared” the problem is more visible, but the brackets example above hides it in “A sequence or something else, repeated, followed by a condition”. If two consecutive sequences can match part(s) of an expression when they are considered separately and can also match when they are concatenated together, we have the start of a problem.
If we call the two parts “A” and “B” the regex engine will try “AB” and if that fails “A” , “B”.
Adding part “C” we can either attach C to B or keep C as a separate item - doubling the number of options , “ABC”, “AB”, “C” and “A”, “BC” , “A”,“B”,“C”
When we add “D” we have those four, either with D attached to C or D as a separate item… each extra letter doubles the number of possibilities to test! This little scrap of PowerShell illustrates it.

    $r = [regex]::new("(.*)+x", "Compiled")  
    $s = ""  
    foreach ($i in (1..24)) {  
        $s+= "A" ;  
        "$i = " + (measure-command {$r.IsMatch($s)}).TotalMilliseconds.tostring("N1")
    }

Using the [regex] object with the compiled option means we can evaluate the execution time without any setup, and on my computer even the 1024 permutations of 11 letters processed in under 1ms. With a 21-character string (a million permutations) the execution time is ~200ms. The time goes up a by factor of 2 with each character, or 2^10 (roughly 1000) with every 10 characters, so a 31-character string takes something over 3 minutes (200 seconds). 41 characters or 200,000 seconds will run for days.
Using this expression to test for valid lines of 20 characters or so which end with a “)” might work with good data and anticipated bad data, but if it the bad data is two lines that have merged to from a 40-character bad line, a script will hang when it tries to validate it.

It is possible to set a matchtimeout value on the [regex] object and the IsMatch method will throw an error if this time is exceeded. And it’s also possible to specify the sequence differently so that two sub-sequences won’t match; but a bit of human intelligence says that in this case anything that fails as whole will still fail however it might be sub-divided, and (?>) tells the regex engine “don’t backtrack over this part”. Changing the regex in the declaration above to
$r = [regex]::new("(?>.*)+x" ,"Compiled")
and running the test again shows it handles 20 or 200 or 2000 characters as quickly as 2.

This is exactly what the brackets example needs; changing the grouping(s) of the characters which matched [^()]+ will not change the result, and all the alternatives in the repeating block are also atomic, so we can tell the engine the whole block should be atomic – that makes the difference between usability and what one (useful) web site about regex describes as “Catastrophic backtracking”. Potentially very long blocks of text for .* also require many tests. At the start I mentioned avoiding multi-line regexes with something.*AnotherThing when the source text is long. To support “something” and “anotherthing” in either order it needs to be something.*AnotherThing|AnotherThing.*something. If only one part is present, everywhere else in the file is a potential starting point for the second and must be checked, and if the one found term is present multiple times the failing search for the missing one is repeated. This scales up with linearly with the size of the text, so it’s inefficient but not Catastrophic.

I also mentioned language parsers above because they will handle logic like “inside double quote marks, open brackets need to be closed only if they are proceeded by $” which make for very awkward regular expressions. Parsers can also avoid a lot of backtracking – which might make regex unworkable. Some regex engines support recursive expressions, but the one in .NET does not, so it is difficult to create a regex for something like a PowerShell pipeline which is recursively defined as “either an Expression or a Command or a pipeline piped into a Command”.

Combining the stack with look-arounds

The brackets example above is designed for a single opening character and a different closing one; but what about testing for multiple types of brackets and perhaps open and close quotation marks as well? I found something near to what I wanted and developed it into the following:

    $opensAndCloses = @"
    (?x)
    ^(?>                          # Start and make the repeating group ‘atomic’
          ""                      # "" (open/close) is a match.
        | ".*[^\\]"               # "xx", "x(x" & "x\"x" are matches; don't end on \".
        | [^(){}\[\]"]+           # A string of non-open/close chars is a match.
        | (?<Open> [(\[{])        # Any open bracket is a match – push onto the “open” stack.
        | (?<Block-Open> [)\]}])  # For a close bracket, pop the last open off the stack...
          (?<=                    # and push text from open to close onto the "Block" stack,
              \( \k<Block> \)     # but only count it as a sub-match if, looking back
            | \[ \k<Block> \]     # shows <open> Block <matching-close>
            | \{ \k<Block> \}
          )
    )+$                           # Repeat up to the end.
    (?(Open)(?!))                 # Fail if there any opens left on the stack.
    "@

This is mostly things we have seen before, but the important difference is that there are multiple opening characters, so checking we have the right number of closures isn’t enough we need to ensure that each closure balances with an opening, and in something like “{(1+2) / 3} the closest open to “}” is not the balancing “{“. The logic goes.

Consume any text enclosed in double quotes, including open and close characters.
Push the position of any unquoted open onto a stack (and consume the opening character). Don’t worry at this stage what kind of open it is.
Consume all text up to the next quoted block, unquoted open or unquoted close.
At any unquoted close, consume the closing character (don’t worry about the type yet). Fail if there is no open on the stack (there are more closures than opens) - otherwise pop the top open item off the stack and note the block of text between its position and the close in question.
Look back at the consumed text for one of “{<block>}”, “(<block>)” or “[<block>]”) if you don’t find one of them, the opens and closes are mismatched which means a failure.
Repeat 1,2,3 and 4 until the end of the text is reached.
If there is anything left on the stack after processing all the text, there are too many opens, and that’s a failure. Otherwise, if processing reached the end without failing, the whole match is successful.
In the event of a failure DO NOT split blocks of text found at 3 as part of a retry.

Stop Press.

As sometimes happens when working on a post, I had completed the text up to this point and was about to write a conclusion, when someone (@markwragg) threw out a releated question.

Does anyone have a fool proof way to turn a string of #PowerShell parameters into a hashtable? For the most part I can split on “ -“ but if I have a parameter that is being fed another cmdlet then that doesn’t work (e.g: -pass (convertto-securestring "pass" -asplaintext -force)

Well… the last regular expression could be adapted to not just look for quoted text, and match opens and closes, but also to recognize PowerShell parameters. The expression above fails if anything is left on the open stack, but leaves things on the block stack, which could be useful, and parameters could go onto another stack. If we test a string and its quotes or brackets aren’t all matched we’ll get a failure, otherwise we can have one group with all the parameters, even the ones inside bracketted blocks, and a second group with all the blocks - which we can use to thin out the parameters to just those which aren’t inside any block. Once we know where each parameter starts and ends, it’s a string slicing operation (and one which isn’t easier as regex) to get the values being passed.

You can see below the code to implement this looks for strings wrapped in single or double quotes - and is told that string matches as little as possible up to a closing quote - but to be a closing quote " can’t be preceded by ` or " and ' cannot be preceded by '. This means the code needs to test for "" and '' as empty strings outside of quote marks. The - sign denotes a parameter if followed by a word character but not proceeded by one (which is not always true - in -1 it doesn’t need a space to function as a subtraction operator or to make 1 a a negative number, but for our purpose here it will work). Other - signs need to be consumed. I made specific opens for $( , ${, $[, @(, @{, and @[ so that the @ or $ become part of the block. In the future I might address multi-line commands with here strings running @"..."@

function SplitEmUp {
param (
    $paramString
)
    #This regex will isolate quoted blocks e.g.  " -example )( $quote "

    # and will identify matching open and close () {} and [] even if nested.

    $opensAndCloses = @'
(?x)                              # Allow comments and ignore spaces and line breaks in the layaout.
    ^(?>                          # Start and make the repeating group ‘atomic’.
          ""                      # Consume "" and '' (open/close). Next, deal with quoted stuff inc escapes.
        | ''
        | (?<Quoted>".*?[^"`]")   # Match (and stack) "xx", "x(x" or "x`"x"; don't end on "" or `".
        | (?<Quoted>'.*?[^']')    # Match (and stack) 'xx', 'x(x' & 'x''x'; don't end on ''.
        | (?<!\w)(?<Param>-\w+)   # Match (and stack) pwsh style -ParameterName.
        | -(?!\w)|(?<=\w)-        # Consume a - when it is not starting a parameter,
        | [^-$@(){}\[\]"']+       #     and any string of non-open/close chars,
        | [$@](?![({\[])          #     and $ or @ when not followed by an open bracket.
        | (?<Open> [$@]?[(\[{])   # Match ( { [ or @( or $( etc. & push onto the "open" stack.
        | (?<Block-Open> [)\]}])  # For a close bracket, pop the last open off the stack...
          (?<=                    #   and push text from open to close onto the "Block" stack; but,
              \( \k<Block> \)     # IT ONLY MATCHES IF looking back shows <open> Block <matching-close>
            | \[ \k<Block> \]
            | \{ \k<Block> \}
          )
    )+$                           # Repeat up to the end.
    (?(Open)(?!))                 # Fail if anything reamins on"open", (we want "Block", "Param" & "Quoted").
'@

    $r       = [regex]$opensAndCloses
    $g       = $r.Matches($paramString).Groups

    $ignore  = $g.where({$_.name -eq 'Block'}).captures
    $params  = @() + $g.where({$_.name -eq 'param'}).captures |
            Where-Object {$i = $_.index; -not $ignore.where({$_.index -le $i -and ($_.index +$_.length) -ge $i})}
    $params += [pscustomobject]@{index=$paramString.Length }

    $i       = 0
    $hash = [ordered]@{}
    while ($params[$i].value) { 
        $pValueStart  = $params[$i].index + $params[$i].Length
        $pValueLength =  $params[$i+1].index - $pvalueStart
        $hash[$params[$i].value -replace "^-",''] = $paramString.Substring($pValueStart, $pValueLength).Trim()
        $i ++
    }

    $hash
}

SplitEmUp '-prefix $(prefix) -nt0applicationStartPort $(ntappIicationStartPort) -nt0applicationEndPort $(ntapplicationEndPort)    
     -_artifactsLocation $(StorageContainerUri) -_artifactsLocationSasToken (ConvertTo-SecureString ''$(StarageCantainerSAS)''    
    -AsPlainText -Force) -clusterName'

Name                           Value
----                           -----
prefix                         $(prefix)
nt0applicationStartPort        $(ntappIicationStartPort)
nt0applicationEndPort          $(ntapplicationEndPort)
_artifactsLocation             $(StorageContainerUri)
_artifactsLocationSasToken     (ConvertTo-SecureString '$(StarageCantainerSAS)' -AsPlainText -Force)
clusterName
    

After the regex is parocessed $g holds groups, one should be named BLOCK and hold everything wrapped in {} () or [], it doesn’t matter that (a + (b +c) ) puts the inner and outer blocks into the stack, all we care about is ignoring any parametrs which start inside any block. If they start in a nested block we’ll ignore them twice. The lines begining with $ignore and $params address this.
The second $params line puts an “end stop” into the array of parameters and the while loop cycles through $params until it reaches that endstop (which can be recognised by having no value property) slicing the parameter names and values out of the string and using them to build an ordered hash table which is output at the end.
The example code came from Mark and was the first case I tested.

I started part one with what I keep calling a monster regex, those leading up to this example have grown worse: manipulating stacks of results like this means the regex is really a program rather than a description of text, and the results of running is a set of complex objects, which form the input to more code.