Tuesday, May 28, 2013

mdq.regex CLR Functions -- Part 2: mdq.RegexMatches


Updated 12/7/2013

I updated this article on 12/7/2013 to include a link to instructions for enabling CLR Integration as and an updated link for the instructions to create the [Microsoft.MasterDataServices.DataQuality] assembly.


Intro

This is the second article in my series, mdq.regex CLR Functions. In my previous article about mdq.regex functions, mdq.regex CLR Functions -- Part 1 I introduced mdq.RegexIsMatch, a scalar CLR function that returns a bit based on if a regex pattern is matched. Today we will review mdq.RegexMatches.

mdq.RegexMatches is a Table Valued CLR function. Like mdq.RegexIsMatch, setup is fast, easy and this CLR was developed, tested, tuned and QA’d by Microsoft.

Setup

To create this function you will need:

  1. A schema named, “mdq” in your target DB
  2. CLR integration enabled
  3. The [Microsoft.MasterDataServices.DataQuality] assembly

See this article for details on enabling CLR integration and this article for the steps to create the assembly.


Run the following DDL to create mdq.RegexMatches:

CREATE FUNCTION mdq.[RegexMatches](@input [nvarchar](4000), @pattern [nvarchar](4000), @mask [tinyint] = 0)

RETURNS TABLE

(      [Sequence] [int] NULL,

       [Token] [nvarchar](4000) NULL)

WITH EXECUTE AS CALLER

AS

EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[RegexMatches]

GO



Using mdq.RegexMatches

mdq.RegexMatches searches an input string for all occurrences of a regular expression and returns all the successful matches. This function uses the regular expression functionality of the Matches method in the Microsoft .NET Framework.


Syntax

mdq.RegexMatches (input,pattern,mask)


Arguments

input

Is the input string for which to find matches. input is nvarchar(4000) with no default.

pattern

Is the regular expression pattern to use for matching. pattern is nvarchar(4000) with no default.

mask

Is the RegexOptions mask that specifies the behavior of the regular expression. For more information, see mdq.RegexMask (Transact-SQL).


mdq.RegexMatches Code Samples

In the first example below I am searching for numeric patterns. For the second example I will demonstrate how to search for records based on a word pattern. We will conclude with a technique for querying a table for valid and invalid EMail addresses.

mdq.regexMatches to find numeric patterns

-- (1) Let's find this pattern: (one or more numbers)-(one or more numbers)-(one or more numbers) [e.g. 222-111-9999]

SELECT Token AS pattern

FROM clr.RegexMatches(N'XYXYXYXYYX11-222-333XYXYXYXYXY', N'\d+-\d+-\d+', 0);

GO

 

-- (2) Let's find patterns of one or more numbers

DECLARE @text varchar(1000)='1 is one, 2 is two and 3 makes three',

        @find_cons_digits varchar(20)='\d+';

SELECT Sequence, Token AS value

FROM clr.RegexMatches(@text, @find_cons_digits, 0);

GO

 

 

-- (3) Let's find sequenses of two consecutive numbers

DECLARE @text varchar(1000)='1 is one, 2 is two and 3 makes three',

        @find_cons_digits varchar(20)='\d+';

 

SELECT  @text='1 is zero, 11 is one, 22 is two... 345678 makes three, four and five',

        @find_cons_digits='\d{2}';

 

SELECT Sequence, Token AS value

FROM clr.RegexMatches(@text, @find_cons_digits, 0);

GO


mdq.regexMatches to find word patterns

-- (4) Let's perform a word count

DECLARE @quote varchar(1000)='The world is a dangerous place to live; not because of the people who are evil, but because of the people who dont do anything about it. --Albert Einstein',

                @quote2 varchar(100)='apple orange orange pear',

                @word_count varchar(20)='\b(\w+)',

                @ab_count varchar(20)='\b([ab]\S+)',

                @find_duplicates varchar(50)='\b(\w+?)\s\1\b';

 

SELECT COUNT(*) AS [word count]

FROM clr.RegexMatches(@quote,@word_count,0);

-------------------------------------------------------

 

-- (5) Let's find how many words begin with A or B

SELECT COUNT(*) AS [words that begin with A or B]

FROM clr.RegexMatches(@quote, @ab_count,0);

 

-- (6) Let's find repeated words

SELECT @quote2 AS quote, Token AS [repeated words]

FROM clr.RegexMatches(@quote2,@find_duplicates,0);

 

-- (7) Let's normalize a string, a series of numbers or both:

SELECT Sequence, Token AS val

FROM clr.RegexMatches(N'NormalizeMe', N'([a-zA-z])', ''); --string

 

SELECT Sequence, Token AS val

FROM clr.RegexMatches(N'54321', N'\d', ''); --series of numbers

 

SELECT Sequence, Token AS val

FROM clr.RegexMatches(N'text1234', N'([a-zA-z0-9])', ''); --text and numbers


Query that returns valid Email addresses and one that only returns invalid addresses:

IF OBJECT_ID('tempdb..#customer_email') IS NOT NULL

        DROP TABLE #customer_email;

 

CREATE TABLE #customer_email

(       email_id int identity primary key,

        cust_id int NOT NULL,

        email nvarchar(100) unique NOT NULL);

 

INSERT INTO #customer_email (cust_id, email)

        SELECT  1,'john.doe@a+++cme. com'       UNION ALL

        SELECT  1,'JohnDoe@somewhere.com'   UNION ALL

        SELECT  2,'Sally.Smith@Microsoft'   UNION ALL

        SELECT  3,'AlChurch@turning.net'    UNION ALL

        SELECT  4,'JaneDoe@aol.com';

GO

 

DECLARE @validEmail nvarchar(1000);

SET @validEmail='^(([a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+(\.[a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+)*){1,64}@(([A-Za-z0-9]+[A-Za-z0-9-_]*){1,63}\.)*(([A-Za-z0-9]+[A-Za-z0-9-_]*){3,63}\.)+([A-Za-z0-9]{2,4}\.?)+){1,255}$';

 

-- (1) This will return records with only valid EMail addresses

SELECT e. *

FROM #customer_email e

CROSS APPLY clr.RegexMatches(e.email,@validEmail, 0);

 

-- (2) This will return records with only invalid EMail addresses

SELECT e.email_id, e.cust_id, e.email

FROM #customer_email e

EXCEPT

SELECT ex.email_id, ex.cust_id, ex.email

FROM #customer_email ex

CROSS APPLY clr.RegexMatches(ex.email, @validEmail, 0);

 

DROP TABLE #customer_email;

 


... EMail Query results:



Conclusion

Today we reviewed the steps for setting up and using mdq.RegexMatches. Thanks for reading!


--ab

The new xmlsqlninja logo ('X-M-L'-'See-Quel' 'ninja')

Last Updated: 12/7/2013 (minor updates, repaired broken links)

No comments:

Post a Comment