Friday, April 26, 2013

mdq.regex CLR Functions -- Part 1: mdq.RegexIsMatch


Updated 12/7/2013

I updated this article on 12/7/2013 to include a link to instructions for enabling CLR Integration as and an updated link for the instructions to create the [Microsoft.MasterDataServices.DataQuality] assembly.


Intro

T-SQL is very good at dealing with strings but it does have some shortcomings. The Regular Expressions (Regex) language is elegant, powerful and can be a viable alternative for solving complex string problems. You do not need to use regular expressions to be a great SQL Developer but some basic regex knowledge and a well-written regex CLR are great tools to have in your coding toolbox. SQL Server R2 ships with Master Data Services which includes a number of exciting new Regex CLR functions. You don’t need Master Data Services to use them, just an instance of SQL Server with CLR integration enabled. You also don’t need to be a .NET programmer, have any knowledge of C# or VB or even know much about regular expressions to use these tools.

The CLR function I am covering today performs very well and produces a simple query plan. Setup is fast, easy and, best of all, the assembly and each of the mdq CLRs were developed, tested, tuned and QA’d by Microsoft. My only complaint is the lack of documentation. Some of these functions are listed at http://msdn.microsoft.com/en-us/library/ee633712.aspx but for most of the ones that are listed, when you click “details”, you get this:


This topic is available here

In this series, mdq.regex CLR Functions I will provide examples of how to use:

  1. mdq.RegexIsMatch
  2. mdq.RegexMatches
  3. mdq.RegexReplace
  4. mdq.RegexSplit
  5. mdq.RegexExtract
  6. mdq.RegexIsValid
  7. mdq.RegexMask

I started with mdq.RegexIsMatch because it is simple yet very powerful. I am intentionally avoiding the topic of regex programming except to say that it is not that difficult to get the basics down and that there are many great books and websites about the topic. This CLR is easier to learn by example so let’s quickly run through setup and look at some.


Setup

To create this function you will need:

  1. A schema named, “mdq” in your target DB
  2. CLR integration enabled
  3. The [Microsoft.MasterDataServices.DataQuality] assembly

See this article for details on enabling CLR integration and this article for the steps to create the assembly.


Run the following DDL to create mdq.RegexIsMatch:

CREATE FUNCTION mdq.[RegexIsMatch](@input [nvarchar](4000), @pattern [nvarchar](4000), @mask [tinyint])

RETURNS [bit] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT

AS

EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[RegexIsMatch]

GO



mdq.RegexIsMatch Code Samples

mdq.RegexIsMatch takes two parameters: a string and a regular expression. It returns a bit based on if the string matches the expression. The examples below are mostly validation. These are all good tools for things like ETL data cleansing tasks or where you are asked to develop a new view, function or stored procedure that produces an Email list with only valid Email addresses.

/************************************************

RegexMatch (Scalar UDF) returns a bit

************************************************/

/**** Complex input validation ****/

--valid US phone number?

DECLARE @phone varchar(20)='312-958-8855';

DECLARE @valid_us_phone varchar(100)='^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$';

SELECT @phone AS phone_nbr,

              mdq.RegexIsMatch(@phone,@valid_us_phone,0) AS is_valid;


-----------------------------------

--Is this likely to be a valid credit card?

DECLARE @cc varchar(20)='4953129482924435';

SELECT

 @cc AS CC,

 mdq.RegexIsMatch

 (@cc,'^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13}|(?:2131|1800)\d{11})$',0)

       AS is_valid;


-----------------------------------

--Is this a valid ZIP code?

DECLARE @zip varchar(20)='02115-4653',

              @valid_zip varchar(30)='^[0-9]{5,5}([- ]?[0-9]{4,4})?$';

SELECT @zip AS zip,

              mdq.RegexIsMatch('02115-4653','^[0-9]{5,5}([- ]?[0-9]{4,4})?$',0) AS is_valid;


-----------------------------------

--Is this a valid Postcode?

DECLARE @pc varchar(20)='RG35 2AQ';

SELECT

 @pc AS PostCode,

 mdq.RegexIsMatch(@pc,

       '^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$',0)

              AS is_valid;


-----------------------------------

--Is this a valid currency value (dollar)?

DECLARE @currency varchar(40)='$33,000,999,888,678.00',

              @is_valid varchar(50)='^\$(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?$';

SELECT  @currency AS currency,

              mdq.RegexIsMatch(@currency,@is_valid,0) AS [valid currency?];


-----------------------------------

--is this a valid currency value (Sterling)

SELECT @currency='£34,000.00',

              @is_valid ='^\£(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?$';

SELECT  @currency AS currency,

              mdq.RegexIsMatch(@currency, @is_valid,0) AS [valid currency?];


-----------------------------------

--Is there a repeating word?

DECLARE @cockney_slang1 varchar(25)='Apples and stairs stairs',

              @cockney_slang2 varchar(25)='Apples and pears stairs',

              @repeated_words varchar(20)='\b(\w+)\s+\1\b';

SELECT @cockney_slang1 AS txt,

              mdq.RegexIsMatch(@cockney_slang1,@repeated_words,0)    AS [Repeated Words?]; --YES(1)

SELECT @cockney_slang2 AS txt,

              mdq.RegexIsMatch(@cockney_slang2,@repeated_words,0)    AS [Repeated Words?]; --NO(0)


-----------------------------------

--find a word near another word (in this case 'for' and 'last' 1 or 2 words apart)

--I grabbed this from Robyn Page and Phil Factor (simple-talk.com)

SELECT mdq.RegexIsMatch('You have failed me for the last time, Admiral',

                                         '\bfor(?:\W+\w+){1,2}?\W+last\b',0) AS [Word near Another Word?];--1

SELECT mdq.RegexIsMatch('You have failed me for what could be the last time, Admiral',

                                         '\bfor(?:\W+\w+){1,2}?\W+last\b',0) AS [Word near Another Word?];--0



Email Validation

Lastly, here's and example of how to query a table for valid Email addresses:

IF OBJECT_ID('tempdb..#customer_email') IS NOT NULL

        DROP TABLE #customer_email;

 

CREATE TABLE #customer_email

(       email_id int identity primary key,

        cust_id int NOT NULL,

        email nvarchar(100) unique NOT NULL);

 

INSERT INTO #customer_email (cust_id, email)

        SELECT  1,'john.doe@a+++cme. com'         UNION ALL

        SELECT  1,'JohnDoe@somewhere.com'       UNION ALL

        SELECT  2,'Sally.Smith@Microsoft'       UNION ALL

        SELECT  3,'AlChurch@turning.net'        UNION ALL

        SELECT  4,'JaneDoe@aol.com';

GO

 

DECLARE @validEmail nvarchar(1000);

SET @validEmail='^(([a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+(\.[a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+)*){1,64}@(([A-Za-z0-9]+[A-Za-z0-9-_]*){1,63}\.)*(([A-Za-z0-9]+[A-Za-z0-9-_]*){3,63}\.)+([A-Za-z0-9]{2,4}\.?)+){1,255}$';

 

WITH emails AS

(       SELECT ce.email_id, ce.cust_id, email  

        FROM #customer_email ce)

SELECT  email_id,

                cust_id,

                email,

                mdq.RegexIsMatch(email, @validEmail,0) AS [is_valid]

FROM emails;

 

DROP TABLE #customer_email

... and the results:



Conclusion

Today we reviewed the steps for setting up and using mdq.RegexIsMatches. It is a powerful CLR that has helped me a great deal in my BI and ETL work. Thanks for reading!


--ab

The new xmlsqlninja logo ('X-M-L'-'See-Quel' 'ninja')

Last Updated: 12/27/2013 (minor updates, repaired broken links)

2 comments:

  1. Seems the MS documentation can be found here:
    http://msdn.microsoft.com/en-us/library/ee633712%28SQL.105%29.aspx

    The content is the same, but the links aren't dead. But, like often (IMHO), those are not always the most helpful....

    ReplyDelete
  2. Thanks so much for posting that link. That is excellent stuff! You will notice that, if you do a google search for: "Master Data Services Functions", "Master Data Services CLR Functions" or "mdq.Regex functions" -- nothing comes up. Not in any search engine that I used. 

    And, yes, they don't appear to be too helpful (e.g. all the "0 out 3 found this helpful" votes.) I noticed that mdq.xmltransform is not included in the list ;)

    Thanks again Sanjay!

    ReplyDelete