RegEx Those Dates

This is a series on SQL Data Generator, covering some interesting scenarios I’ve run into. If you’ve never tried it, SQL Data Generator is a part of the SQL Toolbelt. Give it a try today with an evaluation today.

Recently I was working on transforming some dates, and wanted to generate a large n number of dates for testing. I decided to use SQL Data Generator, and a little RegEx to meet my needs.

The format I needed was CYYMMDD, which is the century as a 0 or 1 (1900 or 2000) and then the yymmdd format. While there are some pre-made expressions to build dates, there wasn’t an easy one to handle the century like this. I could have used a date expression in T-SQL and randomly allocated a century, but I decided to play around with RegEx.

I know that the brackets allow a choice of values to be used. The regular expression can choose any of the values to match. An example is for the first part of my date, the century. It can be zero or one, so I can do this:

[01]

When Data Generator runs, it will randomly build expressions that match this pattern, which in my case results in

regex01

That makes it easy for me to pick numbers, and I could do something like this for the years:

[01][0-0][0-9]

That works, as any number from 00, as in 2000, up to 99, as in 1999, is valid. That gets me this:

regex02

Months

However that causes issues when I get to the month. I need a two digit month, but I can’t have some combinations of two digits. If I were to write [01][1-9], I’d get months like 18, which aren’t valid. Instead, I need a pattern that only matches a 0 with 1 to 9, and only allows a 1 with a 1 or 2.

To do that, I’ll use an OR. That’s a pipe (|) in regular expressions. I’ll say (in pseudocode), give me a (01 to 09) OR a (10-12). The easy way to build that is like this:

(0[1-9]|1[0-2])

This says that if we match the first half (before the pipe), then we literally have a 0 there, with a second character in the range 1-9. That gives us 01 to 09. The second half, after the pipe, does the same thing, but it matches a literal “1”, and then a 0, 1, or 2. As you can see, I have random months (only showing this expression).

regex03

Days

Now the hard part: days.

Days are strange in the calendar because the possible days depend on the months. Years and months are consistently in ranges, but the days are not. Let’s start with the most common days: 31.

I have 31 days in months 1, 3, 5, 7, 8, 11, 12. In order to match these up, I’ll need to combine the month and day items. Let’s first change our months to be just those particular months. That gives me:

(0[13578]|1[02])

With these months, I am going to allow up to 31 days. The patterns for the first 29 days of the month are the same. A 0, 1, or 2, with any combination of 1-9. Putting that together gives me:

(0[13578]|1[02])([012][1-9])

This handles the first 29. The next two, 30 and 31, are an OR expression like the months. I’ll use a literal 3 and a choice of zero or one. That gives me:

regex04

Whew! This is a lot of work, but it matches things up well. Now I need to handle 30 days. I’ll do that the same way, but I’ll now OR both expressions together. The expression is:

((0[13578]|1[02])([012][1-9]|3[01]))|((0[2469]|11)([012][1-9]|30))

And the data:

regex05

That gets me almost all the months. The last part of February, the hardest. Now I could worry about leap years, but I’m not going to. Proper handling here means verifying the year (and century here) and doing math to ensure a leap year is viable. Instead, I’m going to just ignore the 30s and manage days 1 to 29.

((0[13578]|1[02])([012][1-9]|3[01]))|((0[2469]|11)([012][1-9]|30))|((02([012][1-9])))

Now I have a nice set of random dates if I put everything together.

[01][0-0][0-9](((0[13578]|1[02])([012][1-9]|3[01]))|((0[2469]|11)([012][1-9]|30))|((02([012][1-9]))))

regex06

References

I leaned on a few examples to decode a few of the expressions and also to check that I wasn’t messing up.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Blog and tagged , , , . Bookmark the permalink.