Monday, April 29, 2013

A Set-Based Solution to the Longest Common Subsequence Problem


Updated on 9/28/2015


Update: Exactly 2 and a half years ago I posted what I believed to be a solution to the Longest Common Subsequence problem using a Tally table. Unfortunately I got a substring and a subsequence mixed up. Ironically, if you look at the Wikipedia page for the Longest Common Subsequence the first line reads, "Not to be confused with longest common substring problem." and for the Longest Common Substring the first line reads, "Not to be confused with longest common subsequence problem.". I guess I'm not the only one who has gotten these confused. Anyhow, I have spent the past 2 and a half years trying to develop a purely set-based solution to the Longest Common Subsequence and have failed. It's been a great learning experience though; this type of exercise has dramatically sharpened my SQL and math skills. For a great solution to this problem see Phil Factor's excellent solution from earlier this year. I do have an updated solution to the Longest Common Substring however. It uses my N-Grams Function. Below is my updated NGrams8K function and LCSS

A Nasty Fast Set-Based Solution to the Longest Common Substring Problem:

CREATE FUNCTION dbo.NGrams8K (@string varchar ( 8000), @n int )

/********************************************************************

Created by:       Alan Burstein

Created on:      3/10/2014

Last Updated on: 09/09/2015

 

n-gram defined:

In the fields of computational linguistics and probability,

an n-gram is a contiguous sequence of n items from a given

sequence of text or speech. The items can be phonemes, syllables,

letters, words or base pairs according to the application.

For more information see: http://en.wikipedia.org/wiki/N-gram

 

Use:

Outputs a stream of tokens based on an input string.

Similar to mdq.nGrams:

http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.

Except it only returns characters as long as K.

nGrams8K also includes the position of the "Gram" in the string.

 

Revision History:

 Rev 00 - 03/10/2014 Initial Development - Alan Burstein

 Rev 01 - 05/22/2015 Removed DQS N-Grams functionality,

          improved iTally - Alan Burstein

 Rev 02 - 05/22/2015 Changed TOP logic to remove implicit conversion

          - Alan Burstein

 Rev 03 - 9/9/2015 Added logic to only return values if @n is greater

          than 0 and less then length of @string - Alan Burstein

********************************************************************/

RETURNS TABLE WITH SCHEMABINDING AS RETURN

  WITH

  L1( N) AS

  (

  SELECT 1

  FROM ( VALUES

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)

        ) t( N)

  ),

  iTally ( N) AS

  (

        SELECT TOP ( CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0))

          ROW_ NUMBER( ) OVER ( ORDER BY ( SELECT NULL))

        FROM L1 a CROSS JOIN L1 b -- add two more cross joins to support varchar(max)

  )

  SELECT

        position = N,

        token = SUBSTRING(@ string,N,@n )

  FROM iTally

  WHERE @n > 0 AND @n <= DATALENGTH ( @string);

GO

 

/********************************************************************

Created by:      Alan Burstein

Created on:      3/10/2013

Last Updated on: 09/20/2015

 

Use:

Returns the longest common substring between two strings.

 

Revision History:

  Rev 00 - 03/10/2013 Initial Development - Alan Burstein

  Rev 03 - 09/20/2015 Performance tuned using NGrams8K - Alan Burstein

********************************************************************/

CREATE FUNCTION dbo.LCSS8K

(

  @string1 varchar ( 8000),

  @string2 varchar ( 8000)

)

RETURNS TABLE WITH SCHEMABINDING AS RETURN

WITH

Strings AS

(

  SELECT

   String1 = CASE WHEN LEN ( @string1)>LEN(@string2) THEN @string1 ELSE @string2 END,

   String2 = CASE WHEN LEN ( @string1)>LEN(@string2) THEN @string2 ELSE @string1 END

),

I( N) AS (SELECT position FROM Strings CROSS APPLY dbo.NGrams8K(String2,1))

SELECT TOP (1) WITH TIES

  TokenLength = I.N,

  NG.Token

FROM I CROSS APPLY Strings s CROSS APPLY dbo.NGrams8K( String2,I.N) NG

WHERE CHARINDEX ( NG.token,String1) > 0

ORDER BY N DESC;

GO

Conclusion

Another problem solved without cursors, loops or recursive CTEs. Let's mark this one up as another win for the Tally Table. Thanks for reading!


--ab

The new xmlsqlninja logo ('X-M-L'-'See-Quel' 'ninja')

Further Reading


Last Updated: 09/28/2015

6 comments:

  1. I don't think your answer is quite correct. The LCS of your provided example should be "LongestCommonSequence123123123123". See the example provided on Wikipedia for the First Property (http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#First_property).

    Also see http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence.

    ReplyDelete
  2. The answer is correct. Note that there is a 6 after the e in the word "sequence".

    @s1 varchar(100)='XYXYXYXYXYXYXYLongestCommonSequence123123123123',

    @s2 varchar(100)
    ='abcdefgHIJKLongestCommonSequence6123123123123$$$$$';

    ReplyDelete
  3. Right. And after the 6, the common subsequence begins again.

    Or, do this: run the examples provided by the Wikipedia page through the function provided above. The answers do not match those given on the Wikipedia page.

    From the examples used by the Wikipedia page:

    @s1 varchar(100)='XMJYAUZ',
    @s2 varchar(100)='MZJAWXU';

    Wikipedia says LCS is: MJAU.
    Function says LCS is: nothing.

    @s1 varchar(100)='BANANA',
    @s2 varchar(100)='ATANA';

    Wikipedia says LCS is: AANA
    Function says LCS is: ANA

    The function provided finds the longest common substring, but not the longest common subsequence. The point of LCS is to find the common ordered elements of two entities; nothing is ever specified about the common elements being in the same specific position within the compared strings.

    In the first paragraph of the Wikipedia article ("Complexity") it even states that the LCS for "ABC" and "ACB" is both "AB" and "AC". This couldn't be true if it required contiguous positioning.

    ReplyDelete
    Replies
    1. Brian,

      Thanks and good catch! You are 100% correct. I have mixed up the Longest Common Substring with the Longest Common Subsequence. Thus, I will need to re-write this article. String Metrics are still new to me.

      Cheers.

      Delete
  4. Great blog!!! Thank you for such amazing and informative stuff. http://www.sqiar.com/services/tableau-software-consultants/

    ReplyDelete