Updated on 9/28/2015
Update: Exactly 2 and a half years ago I posted what I believed to be a solution to the Longest Common Subsequence problem using a Tally table. Unfortunately I got a substring and a subsequence mixed up. Ironically, if you look at the Wikipedia page for the Longest Common Subsequence the first line reads, "Not to be confused with longest common substring problem." and for the Longest Common Substring the first line reads, "Not to be confused with longest common subsequence problem.". I guess I'm not the only one who has gotten these confused. Anyhow, I have spent the past 2 and a half years trying to develop a purely set-based solution to the Longest Common Subsequence and have failed. It's been a great learning experience though; this type of exercise has dramatically sharpened my SQL and math skills. For a great solution to this problem see Phil Factor's excellent solution from earlier this year. I do have an updated solution to the Longest Common Substring however. It uses my N-Grams Function. Below is my updated NGrams8K function and LCSS
A Nasty Fast Set-Based Solution to the Longest Common Substring Problem:
CREATE
FUNCTION dbo.NGrams8K (@string
varchar
(
8000), @n
int
)
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2014
Last Updated on: 09/09/2015
n-gram
defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items
from a given
sequence of text or speech. The items can be
phonemes, syllables,
letters, words or base pairs according to the
application.
For more information see: http://en.wikipedia.org/wiki/N-gram
Use:
Outputs a stream of tokens based on an input string.
Similar to mdq.nGrams:
http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
Except it only returns characters as long as K.
nGrams8K also includes the position of the
"Gram" in the string.
Revision History:
Rev 00 - 03/10/2014 Initial Development - Alan
Burstein
Rev 01 - 05/22/2015 Removed DQS N-Grams
functionality,
improved iTally - Alan Burstein
Rev 02 - 05/22/2015 Changed TOP logic to
remove implicit conversion
- Alan Burstein
Rev 03 - 9/9/2015 Added logic to only return
values if @n is greater
than 0 and
less then length of @string - Alan Burstein
********************************************************************/
RETURNS
TABLE
WITH
SCHEMABINDING
AS
RETURN
WITH
L1(
N)
AS
(
SELECT 1
FROM
(
VALUES
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
)
t(
N)
),
iTally
(
N) AS
(
SELECT
TOP
(
CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0))
ROW_
NUMBER(
)
OVER
(
ORDER
BY
(
SELECT
NULL))
FROM L1 a CROSS JOIN L1 b
-- add two more
cross joins to support varchar(max)
)
SELECT
position = N,
token = SUBSTRING(@
string,N,@n
)
FROM
iTally
WHERE @n > 0 AND @n <=
DATALENGTH
(
@string);
GO
/********************************************************************
Created by: Alan
Burstein
Created on: 3/10/2013
Last Updated on: 09/20/2015
Use:
Returns the longest common substring between two strings.
Revision History:
Rev 00 - 03/10/2013 Initial Development - Alan
Burstein
Rev 03 - 09/20/2015 Performance tuned using
NGrams8K - Alan Burstein
********************************************************************/
CREATE
FUNCTION dbo.LCSS8K
(
@string1
varchar
(
8000),
@string2
varchar
(
8000)
)
RETURNS
TABLE
WITH
SCHEMABINDING
AS
RETURN
WITH
Strings AS
(
SELECT
String1 = CASE WHEN
LEN
(
@string1)>LEN(@string2) THEN @string1 ELSE @string2 END,
String2 = CASE WHEN
LEN
(
@string1)>LEN(@string2) THEN @string2 ELSE @string1 END
),
I(
N)
AS (SELECT position FROM
Strings CROSS APPLY
dbo.NGrams8K(String2,1))
SELECT
TOP (1) WITH TIES
TokenLength
= I.N,
NG.Token
FROM
I CROSS APPLY Strings s CROSS
APPLY
dbo.NGrams8K(
String2,I.N) NG
WHERE
CHARINDEX
(
NG.token,String1) > 0
ORDER
BY N DESC;
GO
Conclusion
Another problem solved without cursors, loops or recursive CTEs. Let's mark this one up as another win for the Tally Table. Thanks for reading!
--ab
Further Reading
- http://www.columbia.edu/~cs2035/courses/csor4231.F11/lcs.pdf
- Lecture 15: Dynamic Programming, Longest Common Subsequence
Last Updated: 09/28/2015