I had an email last week from mate about using the regular expressions to replace matched occurrences in a string.
Starting with the string "xabcy and then stuff xwefy" they wanting to find any words beginning with x and ending in a y, remove the x and y and capitalise the remains of the word. Resulting in "ABC and then stuff WEF".
I find that if I don’t use regular expressions regularly (sorry) it’s a syntax that I forget easily. One of the best tools that I’ve come across for building expressions is Roy Osherove’s The Regulator. It’s easy to use with just enough intellisence to get me back into the swing of using expression. It is also linked to the RegexLib.com online expression database.
This is the expression that came up with:
"(?:x)(?<stuff>\w*)(?:y)(?:\b)";
I find it’s important to group part of the expression together.
Starting from the beginning of the expression.
(?:x) – Find the letter x the (?: ) tells the regular expression compiler to ignore the x from the matched results.
(?<stuff>\w*) - \w* matches any number of non-whitespace characters; I’ve called this group “stuff”. I will use this later in the example.
(?:y) – Like the group that matches the x this finds the letter y. This group is also ignored.
(?:\b) – This matches a word boundary (space or end of string). Again I am ignoring this from the matched results.
If you run this though The Regulator using the string above you will match abc and wef.
I know that there is a way of using EMCA script to do manipulation, but to be honest it’s not something that I’ve got my head around.
The .Net Regex.Replace object has an overloaded method which you can capture a matched result from the input string and pass it to a function which you then could manipulate the string.
The above example will return
"ABC and stuff WEF and more stuff xcvfyhgf more DFR longer string DFFGHJJ"
I prefer using the group name rather than the index of the matched groups for the same reason as I prefer using named fields when referencing recordset columns as the index number might change if you change the SQL statement or expression.
And this gets around a gottcha. If you just reference the match object, even if you ignore parts of the string it will return the whole of the matched results. i.e. xabcy. The other thing is that the first object in the Match.Groups collection (match.Groups[0]) is also the whole matched result. My stuff group is actually match.Groups[1].
There is no reason why I couldn’t capture the X, Y and word boundary and then deal with removing these using the MatchEvaluator, but I prefer to ignore these as soon as possible.