Regular expressions are incredibly powerful, yet most developers don't
really understand when or how to use them. This tutorial will take you
from regular expression noob to master. (well ok, mabye not master but
you will gain quite a bit of knowledge!)

What is Regular Expression Syntax?

It's basically a small language that is used to find things in
strings. Think of something complex like a the html for a web page.
Suppose you wanted
to get only part of the page and return it. Sound simple? Trust me it's
harder than you think once you start looking at real world examples,
but this is
an area that regular expressions shine. (more on that later.)

Testing your Expressions

.NET for some strange reason has no regular expression test client.
Since regular expression syntax can be tricky to get right you should
always test your
expressions before you put them in your code. As a matter of fact I
prefer to incrementally test mine, I'll write part of an expression and
test that part, when
I'm statisified it returns what I want I add more and more until I have
the whole expression, basically it's just like coding anything else,
you don't write the the whole thing and press run and expect it to work
do ya?

There are many applications on the web you can download that let you
test regular expressions, but I prefer an extremely small test
application where little can go wrong. Here is what I do.

  1. Create a new windows application.
  2. Add a TextBox to your app called "txtExpression"
  3. Add
    two text boxes to your form and set the Multiline property on both of
    them to true. Name one "txtData" and the other "txtResults"
  4. Add this line to the top of your code.

    using System.Text.RegularExpressions;
  5. Then add a button add put this code in the button

    txtResults.Clear();

    Regex r = new Regex(txtExpression.Text,RegexOptions.Singleline | RegexOptions.IgnoreCase);

    MatchCollection col = r.Matches(txtData.Text);

    foreach (Match m in col)
    {
    for (int i=0;i>m.Groups.Count;i++)
    {
    txtResults.Text += "(" + i + ")" + m.Groups[i].Value + "\r\n";
    }
    }

    .

Congrats you now have a very quick and dirty regular expression test
client! Paste your data in the txtData TextBox and put your regular
expression in the txtExpression TextBox and when you press the button
the result of the operation will go in the txtResults TextBox. Easy
isn't it?

Making your first expression

Ok lets get on with it. Suppose you have a string like this:

The date of that transaction is 2006-12-02 MST.

And you want to get the date out of it. We can write an expression such as this:

\d{4}-\d{2}-\d{2}

When you put that in our test client and smack the button you get a
"(0)" at the front. The test client we wrote does this because you can
define multiple things you want to return in a regular expression, but
for now just ignore the (0) at the front I'll go into this more later.

So what is that junk? Lets take a peek. "\d" for regular expressions
means digit. So it simply matches a number. The next part "{4}" tell
the previous part how many to match. So it says look for 4 digits. Well
that matches 2006 in our example. The next part is "-" this is just a
literal and matches a "-" in the target data. You can see the digit
part being used again for 2 more digits and again for another two.
So this says find 4 digits a - 2 digits a - and two more digits.

Try this one against the same expression:

The date of that transaction is 2006-12-02 MST.
The date of that transaction is 2006-12-15 MST.

You will notice that two results are returned one for each date. It's
easy to use regular expressions to pull out any number of items like
that from a list. Very helpful indeed.

Wildcards and other fun stuff

Ok so you got digits, what else can it do you wonder? Well lets look
at a more complex example, take this html snippet for instance:

<h1>Welcome to My Page of Goodness</h1>
Thank you for visiting my page. Pick a topic:<br>
<ul>
<li>Goodness</li>
<li>Extra Goodness</li>
<li>Ultra Goodness</li>
<li>Scooby Goodness and snacks too!</li>
</ul>

Suppose you want to get each of the items in the list, how would you do it?

Well we can use the <li> tags as a marker and get the stuff
between them. To do this in an expression would look like this:

<li>.*?</li>

The "." represents any character or number, and the "*" is a
wildcard that means 0 or more of. (yeah I know could have used +
instead to do 1 or more of) Now the "?" is more interesting. Try this
without the "?" and then with it and see what you get. You will notice
that without it matches the first <LI> tag with the last
</LI> tag instead of the first </LI> it comes across. The
"?" forces it to do minimal matching so that it matches with the first
one it finds. Be careful, a ? without a *, + or another ? in front has
a different meaning, it means zero or one matches.

Using the same example we can produce other expressions that do the exact same thing such as this one:

<li>[^<]*</li>

Some new syntax in there, the "[]" give context to the meaning in
them. The "^" is used as a NOT symbol. So this just says match 0 or
more characters except for the "<" character. So it goes until it
hits the
next tag and stops.

Capture groups are your friends!

Suppose you have an href tag like the one shown below and you want
to get both the url and the title of the link? How would you do it?

<a href="http://goodness.is.me.com">My site of wonderful goodness</a>

Obviously you could use the techniques shown previously and capture
this twice but that is very wasteful. Regular expressions have a way to
do this exact thing
with ease, it's called capture groups. A capture group is just a pair
of rounded brackets "( )" but it allows you to pull out more than one
piece of data. Lets look at
an example of how you could get the url and title from the link above.

<a href="(.*?)">(.*?)</a>

First when you run this you will notice that you have three elements
being returned (thats what the (0), (1), (2) at the begining represent,
the array element (or capture group) of the item being returned. You
should get the following output:

(0)<a href="http://goodness.is.me.com">My site of wonderful goodness</a>
(1)http://goodness.is.me.com
(2)My site of wonderful goodness

The syntax of that command should all be familiar as we just covered
the wildcards earlier. Notice that we now attempt to match the whole
expression and have brackets
around parts of it? The brackets in the expression define the capture
groups. The Match object returned from a call to RegEx has an array of
Group. This array has each of
the capture groups in it. The whole regular expression gets captured
and put in the 0 element. The first capture group goes in 1 and the
second capture group goes in 2. Easy and
very powerful to get multiple returns from a single expression match!

Just a quick note: You can give your capture groups names if you
like, then in the code you can pull them out of the group array by name
instead of index. You can do names like so:

<a href="(?<url>.*?)">(?<title>.*?)</a>

Note that each capture group now has a ?<name> in it. this
defines the capture groups name. (another note, whomever created this
language loves the ? character!)

Alternation

Suppose you want to match two different things such as this:

grey
gray

Yes I know you could wildcard it like gr\wy to get any word in that
spot, but lets be a bit more specific so that it only matches "e" and
"a". This is where
alternation works. Basically it's just like doing a logical OR
statement in c#.

gr(?:e|a)y

I know, I know wtf is the (?: stuff all about. Well the brackets
define a capture group and the ?: part tells it not to capture the
group, so basically the brackets work
more like brackets in math, they just group parts of the expression
instead of capturing them.

A bit about options

You notice that the RegEx class can accept some options on the constructor, we used these when we created our test client:

Regex r = new Regex(txtExpression.Text,RegexOptions.Singleline | RegexOptions.IgnoreCase);

The options are very important as they make a huge difference on
what gets captured and what does not. Lets look at the two we used
here.

The SingleLine option is described in the help as: "Specifies
single-line mode. Changes the meaning of the period character (.) so
that it matches every character (instead of every character except \n)."
So in short it means that your capture can span multiple lines, else
when you do a "*" it stops at the end of the current line. I'm suprised
that this isn't the default, as it has bitten me more times than once,
so ensure you set the value of this to the function you desire.

Next up is the IgnoreCase option. This one should be self
explanitory, it doesn't care what the case of the words are and matches
regardless.

The only other one I find I use often is the Compiled option. This
compiles the regular expression into your assembly so it goes much
faster. But remember this will only work if you know the expression
beforehand, if you
are planning on providing the expression from a configuration file it
will not work as it cannot compile it since it does not have the
expression!

A word in parting

Feel like a RegEx master now? Ya me neither, but hey your a few
steps closer, hopefully this tutorial has helped you determine when a
regular expression can
help you get the job done faster.