Updated 12/7/2013
I updated this article on 12/7/2013 to include a link to instructions for enabling CLR Integration as and an updated link for the instructions to create the [Microsoft.MasterDataServices.DataQuality] assembly.
Intro
This is the second article in my series, mdq.regex CLR Functions. In my previous article about mdq.regex functions, mdq.regex CLR Functions -- Part 1 I introduced mdq.RegexIsMatch, a scalar CLR function that returns a bit based on if a regex pattern is matched. Today we will review mdq.RegexMatches.
mdq.RegexMatches is a Table Valued CLR function. Like mdq.RegexIsMatch, setup is fast, easy and this CLR was developed, tested, tuned and QA’d by Microsoft.
Setup
To create this function you will need:
- A schema named, “mdq” in your target DB
- CLR integration enabled
- The [Microsoft.MasterDataServices.DataQuality] assembly
See this article for details on enabling CLR integration and this article for the steps to create the assembly.
Run the following DDL to create mdq.RegexMatches:
CREATE FUNCTION mdq.[RegexMatches](@input [nvarchar](4000), @pattern [nvarchar](4000), @mask [tinyint] = 0)
RETURNS TABLE
( [Sequence] [int] NULL,
[Token] [nvarchar](4000) NULL)
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[RegexMatches]
GO
Using mdq.RegexMatches
mdq.RegexMatches searches an input string for all occurrences of a regular expression and returns all the successful matches. This function uses the regular expression functionality of the Matches method in the Microsoft .NET Framework.
Syntax
mdq.RegexMatches (input,pattern,mask)
Arguments
input
Is the input string for which to find matches. input is nvarchar(4000) with no default.
pattern
Is the regular expression pattern to use for matching. pattern is nvarchar(4000) with no default.
mask
Is the RegexOptions mask that specifies the behavior of the regular expression. For more information, see mdq.RegexMask (Transact-SQL).
mdq.RegexMatches Code Samples
In the first example below I am searching for numeric patterns. For the second example I will demonstrate how to search for records based on a word pattern. We will conclude with a technique for querying a table for valid and invalid EMail addresses.
mdq.regexMatches to find numeric patterns
-- (1) Let's find this pattern: (one or
more numbers)-(one or more numbers)-(one or more numbers) [e.g. 222-111-9999]
SELECT Token
AS pattern
FROM clr.RegexMatches(N'XYXYXYXYYX11-222-333XYXYXYXYXY', N'\d+-\d+-\d+', 0);
GO
-- (2) Let's find patterns of one or more numbers
DECLARE @text
varchar(1000)='1 is one, 2 is two and 3 makes three',
@find_cons_digits varchar(20)='\d+';
SELECT Sequence, Token AS value
FROM clr.RegexMatches(@text, @find_cons_digits, 0);
GO
-- (3) Let's find sequenses of two
consecutive numbers
DECLARE @text
varchar(1000)='1 is one, 2 is two and 3 makes three',
@find_cons_digits varchar(20)='\d+';
SELECT @text='1 is zero, 11 is one, 22 is two... 345678 makes three, four
and five',
@find_cons_digits='\d{2}';
SELECT Sequence, Token AS value
FROM clr.RegexMatches(@text, @find_cons_digits, 0);
GO
mdq.regexMatches to find word patterns
-- (4) Let's perform a word count
DECLARE @quote varchar(1000)='The world is a dangerous place to live; not because of the people who are evil, but because of the people who dont do anything about it. --Albert Einstein',
@quote2 varchar(100)='apple orange orange pear',
@word_count varchar(20)='\b(\w+)',
@ab_count varchar(20)='\b([ab]\S+)',
@find_duplicates varchar(50)='\b(\w+?)\s\1\b';
SELECT COUNT(*) AS [word count]
FROM clr.RegexMatches(@quote,@word_count,0);
-------------------------------------------------------
-- (5) Let's find how many words begin with A or B
SELECT COUNT(*) AS [words that begin with A or B]
FROM clr.RegexMatches(@quote, @ab_count,0);
-- (6) Let's find repeated words
SELECT @quote2 AS quote, Token AS [repeated words]
FROM clr.RegexMatches(@quote2,@find_duplicates,0);
-- (7) Let's normalize a string, a series of numbers or both:
SELECT Sequence, Token AS val
FROM clr.RegexMatches(N'NormalizeMe', N'([a-zA-z])', ''); --string
SELECT Sequence, Token AS val
FROM clr.RegexMatches(N'54321', N'\d', ''); --series of numbers
SELECT Sequence, Token AS val
FROM clr.RegexMatches(N'text1234', N'([a-zA-z0-9])', ''); --text and numbers
Query that returns valid Email addresses and one that only returns invalid addresses:
IF OBJECT_ID('tempdb..#customer_email') IS NOT NULL
DROP TABLE #customer_email;
CREATE TABLE #customer_email
( email_id int identity primary key,
cust_id int NOT NULL,
email nvarchar(100) unique NOT NULL);
INSERT INTO #customer_email (cust_id, email)
SELECT 1,'john.doe@a+++cme. com' UNION ALL
SELECT 1,'JohnDoe@somewhere.com' UNION ALL
SELECT 2,'Sally.Smith@Microsoft' UNION ALL
SELECT 3,'AlChurch@turning.net' UNION ALL
SELECT 4,'JaneDoe@aol.com';
GO
DECLARE @validEmail nvarchar(1000);
SET @validEmail='^(([a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+(\.[a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+)*){1,64}@(([A-Za-z0-9]+[A-Za-z0-9-_]*){1,63}\.)*(([A-Za-z0-9]+[A-Za-z0-9-_]*){3,63}\.)+([A-Za-z0-9]{2,4}\.?)+){1,255}$';
-- (1) This will return records with only valid EMail addresses
SELECT e. *
FROM #customer_email e
CROSS APPLY clr.RegexMatches(e.email,@validEmail, 0);
-- (2) This will return records with only invalid EMail addresses
SELECT e.email_id, e.cust_id, e.email
FROM #customer_email e
EXCEPT
SELECT ex.email_id, ex.cust_id, ex.email
FROM #customer_email ex
CROSS APPLY clr.RegexMatches(ex.email, @validEmail, 0);
DROP TABLE #customer_email;
... EMail Query results:
Conclusion
Today we reviewed the steps for setting up and using mdq.RegexMatches. Thanks for reading!
--ab
Last Updated: 12/7/2013 (minor updates, repaired broken links)