A PowerShell users' guide to regular expressions: part 3 (of 3).
Rounding up some regex tricks
Part one of this series looked at the terms which went into a regular expression to describe text of interest. and Part two was mostly about groups; I said that the instead of string methods, I often use a regex group to get the required part of string: here are some examples from files in my PowerShell folder.
... $path -replace '^(.*)/.+?$','$1'
$employeeID = $_.name -replace "^(\d+)\.jpg$",'$1'
$ExtensionNo = [int]($adUser.OfficePhone -replace "^.*(\d{4})$",'$1')
$Identity = ($DisplayName.toupper() -replace "^(\w)(.*\s+)([\w'-]*)$",'$1$3' ) -replace "\W",""
$groupName = $g -replace "^CN=DEPT_(.*?),ou=HR.*$",'$1'
$clusterResType = $s.where( {$_ -match "^\s*ResourceType ="} ) -replace '^.*"(.*)".*$','$1'
I try to explain any regex in a script with a comment - 30 seconds to type a comment can save 5 minutes unpicking the expression- and these expression are:
(1) Get the parent part of a path - the part before the final /something.
(2) in a folder of employee photos extract the employee number from each file-name to add the photo to the correct user in Active Directory.
(3) Convert DDI phone numbers or “Ext 1234” into a 4 digit extension number,
(4) Convert a display name like “James O’Neill” first to uppercase, then to JO’NEILL and finally to JONEILL
(5) Isolate the departmental names like “Sales” from an Active directory distinguished name like CN=DEPT_Sales=ou=HR....
(all the departmental groups were owned by HR).
And (6) find a line in a file which reads ResourceType = "something"
and isolate the something.
Although the majority of work with regex in PowerShell is done with the -match
and replace
operators, they are not the only ways to use expressions. I used the Select-String
command to find the ones above; Select-String
doesn’t have a -recurse
option and going back at least 10 years my PowerShell profile has contained a function “Whathas” , which looks like this:
Select-String
takes file objects OR text as input, so I select the files (recursively if necessary) and
pipe them in. It returns [MatchInfo]
objects which hold the path and file name, the line of text
containing the match and its number, and details of what matched. Sometimes I just want the
matching text so I have a -BareMatch
switch; sometimes I want to know which files have a match
without worrying about where matches occur, which is what the -List
switch returns, so
Whathas something -list | whathas anotherthing -list
will find files with both strings - even if far appart (if you wonder “why not use a multi-line regex with something.*anotherthing
- read on) . The command
Whathas something "list|anotherthing" -list
finds files with either or both strings. Occasionally I want to know the lines on either side of the match and use Select-String
natively with its -Context
parameter when I need it. About 1 in 20 or 30 of all the commands I issue use whathas
, and most cases use very simple regexes without classes or escaped characters, but something like
whathas replace -r
At the start of Part one I made the point that a word, “replace” in this case,
is valid as a regex just as much as a tangle of metacharacters, and is more likely to be used in the real world.
‘
Select-String
found an old script where I would now use Get-CimInstance
to get an object with the last boot-up time
as a [datetime]
object. The script uses Get-WMIObject
which returns data in the format 20210324184546.261638+000
and
the script to process it looks like this:
Which turns 20210324184546.261638+000
into 2021/03/24/18/45/46
and then splits
it into @(“2021”, “03”, “24”, “18”, “45”, “46”)
to give a set of arguments. It’s an interesting technique, but
today I’d probably use the style I showed in Part two
It’s worth saying here that $Matches
contains matches which are found by the -Match
or NotMatch
operators, but if none is found, $matches
is not cleared. Generally the construction
if (something -match regex) {do_something_with $matches}
means this is not a problem.
All the groups or sub-matches above consume characters (unlike lookarounds which just peek at them) and we can collect the text from automatically named variables 1,2,3 etc. or set names for the variables, or specify that the group shouldn’t be preserved. But that description glosses over something interesting.
Stacks of matches
One interesting question when working with groups, is “What happens if we have multiple matches for the same group?”. For example,
"ABCD" –match "^(\w)+$"
Returns true
, and $matches
shows:
Name Value
---- -----
1 D
0 ABCD
Match 1
looks like the last matching item, but things are more subtle, which we can see by using the .NET regex object instead of the PowerShell commands and operators (another way to use expressions), like this:
$r = [regex]"^(\w)+$"
$r.Match("ABCDE")
which returns
Groups : {0, 1}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 5
Value : ABCDE
In this example, Groups
corresponds to the two groups that were visible in $matches
. And if we examine group 1
using
$r.Match("ABCDE").groups[1]
we get
Success : True
Name : 1
Captures : {A, B, C, D…}
Index : 3
Length : 4
Value : E
This is a stack (each match sits on-top-of earlier matches, it doesn’t displace values from a simple variable) and it stores A,B,C,D simply as their position (index), length and value, E also has a name - “1” in this case.
I often write “end-to-end” matches like the one above, starting with ^
and ending with $
; if the regex were declared as
$r = [regex]"[AEIOU]"
then
$m = $r.Match("ABCDE")
would put the following into $m
Groups : {0}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 1
Value : A
The next match (if there is one) is available with $m.NextMatch()
- when we want to know what the matches are not simply that at least one exists calling NextMatch
until the Success
property is false
is helpful, but my “end-to-end” version didn’t allow gaps between the matches and this does: $m.NextMatch()
returns a value of “E” and “B”,“C” and “D” are skipped. If we don’t want that to happen, one more special escape character acts as an anchor for this (it dawned on me after publishing parts 1 & 2 that I’d never introduced the term anchor for ^
and $
, \a
and \z
and \b
which are places in the string, instead of characters.) \G
says the match must begin at the first available characeter, in other words the first one in the string or the next character after a match, so
$r = [regex]"\G[AEIOU]"
would return the same first match, but calling NextMatch()
is unsuccessful.
Until learning it was a stack, in an operation like "ABCDE" -match "^(\w)+\1$"
we might have thought of \1
as
“The matching item with the [default] name of ‘1’” but really it is the top item
on the stack named “1”. and we can manipulate the stack.
"ABCDA" -match "(?<vowel>[AEIOU])\w+(?<second>\k<vowel>)"
returns True
, and $matches
is as expected:
Name Value
---- -----
vowel A
second A
0 ABCDA
But if we replace <second>
we can make something new happen.
"ABCDA" -match "(?<vowel>[AEIOU])\w+(?<second-vowel>\k<vowel>)"
still returns true
, but now $Matches
looks like this:
Name Value
---- -----
second BCD
0 ABCDA
<second-vowel>
has pushed the text after the top item in the <vowel>
stack and
before the match onto the <second>
stack and popped the top item off the
<vowel>
stack leaving it empty – if that stack were already empty, trying to
access a non-existant top item would cause the expression to fail.
If we just wrote <-vowel>
it would pop the item off the top of the stack (or
fail if the stack is empty) without storing any result.
The next example brings these parts together:
"ABCDD" -match @'
(?x)
^(\w)+ # from the start of text, push single word characters onto stack 1
( \1 # end by matching characters with the top item on
(?<-1>) # stack 1 & popping the matched items off the stack
)+ # repeating as many times as possible
'@
As shown, with ABCDD
, it matches the final D with the previous D, and returns true
. If
we add some more letters, ABCDDCB
returns true, and pops D, C and B in turn off stack
<1>
, leaving only A. If we put an X after the final B , the expression would
ignore it to make a match – to specify that each character to the end must have a
balancing item on stack <1>
needs a $
after the +
sign at the end.
Now, if we wanted to test for a palindrome - that is everything before the middle having
a balancing item after it - we could say “Fail as unbalanced if there is
anything left in stack <1>
” - these patterns are often called balanced or balancing matches.
We can see anything that was left in <1>
by looking at $Matches
but instead of letting the
match operation succeed and then failing based on what is in the automatic
variable, there is a way to do this inside the regex:
(?(1)xxx)
says “evaluate xxx if there is anything on stack <1>
” and we can create
something which will always fail: (?!)
says “Fail if looking ahead you see
at least… nothing”. Since nothing or more will always be there that always
fails, so (?(1)(?!))
will force a failure if there is anything on stack <1>
at the end
of processing.
The script above will return true
. If we test whether ABCDDCB
matches,
the first A remains on the stack causing a failure.
If we add something and evaluate ABCDDCBAX
, when X is tested the stack will be empty, causing
a failure, and if the sequence is wrong, for example evaluating ABCDDCAB
then \1
will fail to match, here D and C will match, but the first B
will be the top item on the stack when the final A is evaulated. To get a successful match, each
letter must have a balancing one in the right place. For a good palindrome test, the expression allows
an extra letter in the middle, so ABCDEDCBA
will also match.
This logic can be adapted to test whether we have the correct open and closing brackets.
By good fortune $openWithClose -match $openWithClose
also returns true
but,
because the expression is not designed to look for escaped characters, valid
regular expressions or other “code” may not match. When the language rules get
complicated a proper parser will do better job, partly for reasons we
will see in a moment.
Many variations of the last example can be found both on-line and in print and I don’t know
where or when it was first published or even where I first saw a version; I can’t claim it as my own, and
if I had produced it I might not have thought to use the (?>)
construct to stop
backtracking.
Backtracking
Going back to the Palindrome example, how does the regex engine know how far along the text it
should stop pushing things onto the stack and start popping them off it? Nothing in the expression says so.
"ABABA" -match ^(\w)+.?(\1(?<-1>))+$(?(1)(?!)
Finds a match by trying the possibilities like this:
- Push A,B,A,B and A onto the stack. This will fail when it attempts to to match
\1
because there are no characters left. - Back up and try with A,B,A and B on the stack and matching A against the
.?
There is still nothing to match\1
. - Backtrack again,and try A,B,A and B on the stack and
.?
empty. The final A doesn’t match the B on the top of stack. - Backtrack again, with A,B and A, on the stack and
.?
matching B. The final A matches the A on top of the stack, we reach the end but now the whole match fails because the first A and B are left on the stack. - Backtrack yet again, with A,B and A, on the stack and
.?
empty. The final B doesn’t match the A on top of the stack: fail!. - Backtrack one more time, with A and B on the stack and
.?
matching the second A. Now the final B matches the B on top of the stack, B is popped off the stack; the expression hasn’t reached the end, so it tests the final A which matches the A on top of the stack, and it reaches the end with an empty stack. Success!
With a long piece of text, and multiple variable-length parts it can take a lot of backtracking to try every possiblity
and prove there is no possible match. Some constructions are a recipe for vast amounts of backtracking and one of
these is "(.*)*x"
. Stripped back to “Sequence, squared” the
problem is more visible, but the brackets example above hides it in “A sequence or
something else, repeated, followed by a condition”. If two consecutive sequences
can match part(s) of an expression when they are considered separately and can also match
when they are concatenated
together, we have the start of a problem.
If we call the two parts “A” and “B” the regex engine will try “AB” and if that fails “A” , “B”.
Adding part “C” we can either attach C to B or keep C
as a separate item - doubling the number of options , “ABC”, “AB”, “C” and “A”, “BC” , “A”,“B”,“C”
When we add “D” we have those four, either with D attached to C or D as a separate
item… each extra letter doubles the number of possibilities to test! This little scrap of PowerShell illustrates it.
Using the [regex]
object with the compiled
option means we can evaluate the
execution time without any setup, and on my computer even the 1024 permutations
of 11 letters processed in under 1ms. With a 21-character string (a million permutations) the execution
time is ~200ms. The time goes up a by factor of 2 with each character, or 2^10 (roughly 1000)
with every 10 characters, so a 31-character string takes something over 3 minutes (200 seconds).
41 characters or 200,000 seconds will run for days.
Using this expression to test for valid lines of 20 characters or so which end with a “)” might
work with good data and anticipated bad data, but if it the bad data is two lines that have
merged to from a 40-character bad line, a script will hang when it tries to validate it.
It is possible to set a matchtimeout
value on the [regex]
object and the IsMatch
method will
throw an error if this time is exceeded. And it’s also possible to specify the sequence differently
so that two sub-sequences won’t match; but a bit of human intelligence says that in this case anything that fails as
whole will still fail however it might be sub-divided, and (?>)
tells the
regex engine “don’t backtrack over this part”. Changing the regex in the declaration above to
$r = [regex]::new("(?>.*)+x" ,"Compiled")
and running the test again shows it handles 20 or 200 or 2000 characters as quickly as 2.
This is exactly what the brackets example needs; changing the grouping(s) of the
characters which matched [^()]+
will not change the result, and all the
alternatives in the repeating block are also atomic, so we can tell the engine
the whole block should be atomic – that makes the difference
between usability and what one (useful) web site about regex describes as
“Catastrophic backtracking”. Potentially very long blocks of text for .* also require many tests. At the start I mentioned avoiding multi-line regexes with something.*AnotherThing
when the source text is long. To support “something” and “anotherthing” in either order it needs to be something.*AnotherThing|AnotherThing.*something
. If only one part is present, everywhere else in the file is a potential starting point for the second and must be checked, and if the one found term is present multiple times the failing search for the missing one is repeated. This scales up with linearly with the size of the text, so it’s inefficient but not Catastrophic.
I also mentioned language parsers above because they will handle logic like “inside double quote
marks, open brackets need to be closed only if they are proceeded by $
” which make for very awkward
regular expressions. Parsers can also avoid a lot of backtracking – which might
make regex unworkable. Some regex engines support recursive expressions, but the
one in .NET does not, so it is difficult to create a regex for something like a
PowerShell pipeline which is recursively defined as “either an Expression or a
Command or a pipeline piped into a Command”.
Combining the stack with look-arounds
The brackets example above is designed for a single opening character and a different closing one; but what about testing for multiple types of brackets and perhaps open and close quotation marks as well? I found something near to what I wanted and developed it into the following:
This is mostly things we have seen before, but the important difference is that there are multiple opening characters, so checking we have the right number of closures isn’t enough we need to ensure that each closure balances with an opening, and in something like “{(1+2) / 3} the closest open to “}” is not the balancing “{“. The logic goes.
- Consume any text enclosed in double quotes, including open and close characters.
- Push the position of any unquoted open onto a stack (and consume the opening character). Don’t worry at this stage what kind of open it is.
- Consume all text up to the next quoted block, unquoted open or unquoted close.
- At any unquoted close, consume the closing character (don’t worry about the
type yet). Fail if there is no open on the stack (there are more
closures than opens) - otherwise pop the top open item off the stack and
note the block of text between its position and the close in question.
Look back at the consumed text for one of “{<block>}”, “(<block>)” or “[<block>]”) if you don’t find one of them, the opens and closes are mismatched which means a failure. - Repeat 1,2,3 and 4 until the end of the text is reached.
- If there is anything left on the stack after processing all the text, there are too many opens, and that’s a failure. Otherwise, if processing reached the end without failing, the whole match is successful.
- In the event of a failure DO NOT split blocks of text found at 3 as part of a retry.
Stop Press.
As sometimes happens when working on a post, I had completed the text up to this point and was about to write a conclusion, when someone (@markwragg) threw out a releated question.
Does anyone have a fool proof way to turn a string of #PowerShell parameters into a hashtable? For the most part I can split on “ -“ but if I have a parameter that is being fed another cmdlet then that doesn’t work (e.g:
-pass (convertto-securestring "pass" -asplaintext -force)
Well… the last regular expression could be adapted to not just look for quoted text, and match opens and closes, but also to recognize PowerShell parameters. The expression above fails if anything is left on the open
stack, but leaves things on the block
stack, which could be useful, and parameters could go onto another stack. If we test a string and its quotes or brackets aren’t all matched we’ll get a failure, otherwise we can have one group with all the parameters, even the ones inside bracketted blocks, and a second group with all the blocks - which we can use to thin out the parameters to just those which aren’t inside any block. Once we know where each parameter starts and ends, it’s a string slicing operation (and one which isn’t easier as regex) to get the values being passed.
You can see below the code to implement this looks for strings wrapped in single or double quotes - and is told that string matches as little as possible up to a closing quote - but to be a closing quote "
can’t be preceded by `
or "
and '
cannot be preceded by '
. This means the code needs to test for ""
and ''
as empty strings outside of quote marks. The -
sign denotes a parameter if followed by a word character but not proceeded by one (which is not always true - in -1
it doesn’t need a space to function as a subtraction operator or to make 1
a a negative number, but for our purpose here it will work). Other -
signs need to be consumed. I made specific opens for $(
, ${
, $[
, @(
, @{
, and @[
so that the @
or $
become part of the block. In the future I might address multi-line commands with here strings running @"..."@
Name Value
---- -----
prefix $(prefix)
nt0applicationStartPort $(ntappIicationStartPort)
nt0applicationEndPort $(ntapplicationEndPort)
_artifactsLocation $(StorageContainerUri)
_artifactsLocationSasToken (ConvertTo-SecureString '$(StarageCantainerSAS)' -AsPlainText -Force)
clusterName
After the regex is parocessed $g
holds groups, one should be named BLOCK and hold everything wrapped in {} () or [], it doesn’t matter that (a + (b +c) )
puts the inner and outer blocks into the stack, all we care about is ignoring any parametrs which start inside any block. If they start in a nested block we’ll ignore them twice. The lines begining with $ignore
and $params
address this.
The second $params
line puts an “end stop” into the array of parameters and the while
loop cycles through $params
until it reaches that endstop (which can be recognised by having no value
property) slicing the parameter names and values out of the string and using them to build an ordered hash table which is output at the end.
The example code came from Mark and was the first case I tested.
I started part one with what I keep calling a monster regex, those leading up to this example have grown worse: manipulating stacks of results like this means the regex is really a program rather than a description of text, and the results of running is a set of complex objects, which form the input to more code.