the xmlsqlninja: A Set-Based Solution to the Longest Common Subsequence Problem

Updated on 9/28/2015

Update: Exactly 2 and a half years ago I posted what I believed to be a solution to the Longest Common Subsequence problem using a Tally table. Unfortunately I got a substring and a subsequence mixed up. Ironically, if you look at the Wikipedia page for the Longest Common Subsequence the first line reads, "Not to be confused with longest common substring problem." and for the Longest Common Substring the first line reads, "Not to be confused with longest common subsequence problem.". I guess I'm not the only one who has gotten these confused. Anyhow, I have spent the past 2 and a half years trying to develop a purely set-based solution to the Longest Common Subsequence and have failed. It's been a great learning experience though; this type of exercise has dramatically sharpened my SQL and math skills. For a great solution to this problem see Phil Factor's excellent solution from earlier this year. I do have an updated solution to the Longest Common Substring however. It uses my N-Grams Function. Below is my updated NGrams8K function and LCSS

A Nasty Fast Set-Based Solution to the Longest Common Substring Problem:

    CREATE
    
      FUNCTION dbo.NGrams8K (@string 
        varchar
        (
      8000), @n 
        int
      )
    
    /********************************************************************
    
      Created by:       Alan Burstein
    
      Created on:      3/10/2014
    
    Last Updated on: 09/09/2015
    
      n-gram
    
     defined:
    
      In the fields of computational linguistics and probability,
    
      an n-gram is a contiguous sequence of n items
      from a given
    
      sequence of text or speech. The items can be
      phonemes, syllables,
    
      letters, words or base pairs according to the
      application.
    
      For more information see: http://en.wikipedia.org/wiki/N-gram
    
    Use:
    
      Outputs a stream of tokens based on an input string.
    
      Similar to mdq.nGrams:
    
      http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
    
      Except it only returns characters as long as K.
    
      nGrams8K also includes the position of the
      "Gram" in the string.
    
    Revision History:
    
       Rev 00 - 03/10/2014 Initial Development - Alan
      Burstein
    
       Rev 01 - 05/22/2015 Removed DQS N-Grams
      functionality,
    
                improved iTally - Alan Burstein
    
       Rev 02 - 05/22/2015 Changed TOP logic to
      remove implicit conversion
    
                - Alan Burstein
    
       Rev 03 - 9/9/2015 Added logic to only return
      values if @n is greater
    
                than 0 and
      less then length of @string - Alan Burstein
    
    ********************************************************************/
    
    RETURNS
    
      TABLE
      WITH
      SCHEMABINDING
      AS
      RETURN
      
      WITH
      
        L1(
      N)
      AS 
    
    (
    
        SELECT 1
    
      FROM 
      (
      VALUES
      
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
    
      (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
    
              ) 
        t(
      N)
    
      ),
      
        iTally
      
        (
      N) AS 
    
    (
    
              SELECT 
        TOP
        (
      CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0)) 
    
        ROW_
          NUMBER(
        
      )
      OVER 
      (
      ORDER
      BY 
      (
      SELECT
      NULL))
      
              FROM L1 a CROSS JOIN L1 b 
        -- add two more
        cross joins to support varchar(max)
      
      )
      
      SELECT
      
              position = N,
    
              token = SUBSTRING(@
        string,N,@n
      )
    
      FROM
      iTally
      
        WHERE @n > 0 AND @n <= 
        DATALENGTH
        (
      @string);
    
      GO
    
    /********************************************************************
    
      Created by:      Alan
      Burstein
    
      Created on:      3/10/2013
    
    Last Updated on: 09/20/2015
    
    Use:
    
  Returns the longest common substring between two strings.
    
    Revision History:
    
        Rev 00 - 03/10/2013 Initial Development - Alan
      Burstein
    
        Rev 03 - 09/20/2015 Performance tuned using
      NGrams8K - Alan Burstein
    
    ********************************************************************/
    
    CREATE
    
      FUNCTION dbo.LCSS8K
    
    (
    
        @string1 
        varchar
        (
      8000),
    
        @string2 
        varchar
        (
      8000)
    
    )
    
    RETURNS
    
      TABLE
      WITH
      SCHEMABINDING
      AS
      RETURN
      
    WITH
    
      Strings AS 
    
    (
    
      SELECT
      
         String1 = CASE WHEN 
        LEN
        (
      @string1)>LEN(@string2) THEN @string1 ELSE @string2 END,
    
         String2 = CASE WHEN 
        LEN
        (
      @string1)>LEN(@string2) THEN @string2 ELSE @string1 END
    
    ),
    
        I(
      
      N)
      AS (SELECT position FROM
      Strings CROSS APPLY
      dbo.NGrams8K(String2,1))
    
    SELECT
    
      TOP (1) WITH TIES
    
        TokenLength
      = I.N,
    
      NG.Token
      
    FROM
    
      I CROSS APPLY Strings s CROSS
      APPLY 
        dbo.NGrams8K(
      String2,I.N) NG
    
    WHERE
    
        CHARINDEX
        (
      NG.token,String1) > 0
    
    ORDER
    
      BY N DESC;
    
      GO

Conclusion

Another problem solved without cursors, loops or recursive CTEs. Let's mark this one up as another win for the Tally Table. Thanks for reading!

--ab

the xmlsqlninja

Monday, April 29, 2013

A Set-Based Solution to the Longest Common Subsequence Problem

Updated on 9/28/2015

A Nasty Fast Set-Based Solution to the Longest Common Substring Problem:

Conclusion

Further Reading

3 comments: