Monday, April 29, 2013

A Set-Based Solution to the Longest Common Subsequence Problem


Updated on 9/28/2015


Update: Exactly 2 and a half years ago I posted what I believed to be a solution to the Longest Common Subsequence problem using a Tally table. Unfortunately I got a substring and a subsequence mixed up. Ironically, if you look at the Wikipedia page for the Longest Common Subsequence the first line reads, "Not to be confused with longest common substring problem." and for the Longest Common Substring the first line reads, "Not to be confused with longest common subsequence problem.". I guess I'm not the only one who has gotten these confused. Anyhow, I have spent the past 2 and a half years trying to develop a purely set-based solution to the Longest Common Subsequence and have failed. It's been a great learning experience though; this type of exercise has dramatically sharpened my SQL and math skills. For a great solution to this problem see Phil Factor's excellent solution from earlier this year. I do have an updated solution to the Longest Common Substring however. It uses my N-Grams Function. Below is my updated NGrams8K function and LCSS

A Nasty Fast Set-Based Solution to the Longest Common Substring Problem:

CREATE FUNCTION dbo.NGrams8K (@string varchar ( 8000), @n int )

/********************************************************************

Created by:       Alan Burstein

Created on:      3/10/2014

Last Updated on: 09/09/2015

 

n-gram defined:

In the fields of computational linguistics and probability,

an n-gram is a contiguous sequence of n items from a given

sequence of text or speech. The items can be phonemes, syllables,

letters, words or base pairs according to the application.

For more information see: http://en.wikipedia.org/wiki/N-gram

 

Use:

Outputs a stream of tokens based on an input string.

Similar to mdq.nGrams:

http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.

Except it only returns characters as long as K.

nGrams8K also includes the position of the "Gram" in the string.

 

Revision History:

 Rev 00 - 03/10/2014 Initial Development - Alan Burstein

 Rev 01 - 05/22/2015 Removed DQS N-Grams functionality,

          improved iTally - Alan Burstein

 Rev 02 - 05/22/2015 Changed TOP logic to remove implicit conversion

          - Alan Burstein

 Rev 03 - 9/9/2015 Added logic to only return values if @n is greater

          than 0 and less then length of @string - Alan Burstein

********************************************************************/

RETURNS TABLE WITH SCHEMABINDING AS RETURN

  WITH

  L1( N) AS

  (

  SELECT 1

  FROM ( VALUES

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)

        ) t( N)

  ),

  iTally ( N) AS

  (

        SELECT TOP ( CONVERT(BIGINT,(DATALENGTH(@string)-(@n-1)),0))

          ROW_ NUMBER( ) OVER ( ORDER BY ( SELECT NULL))

        FROM L1 a CROSS JOIN L1 b -- add two more cross joins to support varchar(max)

  )

  SELECT

        position = N,

        token = SUBSTRING(@ string,N,@n )

  FROM iTally

  WHERE @n > 0 AND @n <= DATALENGTH ( @string);

GO

 

/********************************************************************

Created by:      Alan Burstein

Created on:      3/10/2013

Last Updated on: 09/20/2015

 

Use:

Returns the longest common substring between two strings.

 

Revision History:

  Rev 00 - 03/10/2013 Initial Development - Alan Burstein

  Rev 03 - 09/20/2015 Performance tuned using NGrams8K - Alan Burstein

********************************************************************/

CREATE FUNCTION dbo.LCSS8K

(

  @string1 varchar ( 8000),

  @string2 varchar ( 8000)

)

RETURNS TABLE WITH SCHEMABINDING AS RETURN

WITH

Strings AS

(

  SELECT

   String1 = CASE WHEN LEN ( @string1)>LEN(@string2) THEN @string1 ELSE @string2 END,

   String2 = CASE WHEN LEN ( @string1)>LEN(@string2) THEN @string2 ELSE @string1 END

),

I( N) AS (SELECT position FROM Strings CROSS APPLY dbo.NGrams8K(String2,1))

SELECT TOP (1) WITH TIES

  TokenLength = I.N,

  NG.Token

FROM I CROSS APPLY Strings s CROSS APPLY dbo.NGrams8K( String2,I.N) NG

WHERE CHARINDEX ( NG.token,String1) > 0

ORDER BY N DESC;

GO

Conclusion

Another problem solved without cursors, loops or recursive CTEs. Let's mark this one up as another win for the Tally Table. Thanks for reading!


--ab

The new xmlsqlninja logo ('X-M-L'-'See-Quel' 'ninja')

Further Reading


Last Updated: 09/28/2015

Sunday, April 28, 2013

mdq.XmlTransform -- Part 2: Creating an MS Word Doc


Intro

mdq.xmlTransform can do amazing things. Today I will show you how to write a T-SQL query that returns a Microsoft Office Word document viewable in Word 2003 and later (and most programs that open .doc or .docx files). With mdq.xmlTransform it's very easy. If you don't yet have it, see Setting up mdq.XmlTransform for DDL and setup instructions; it takes less than a minute to setup.


Let's get right to it

Some may be intimidated at the idea of writing a query that returns an MS Word document. Don't be, it's simple. That's why I'll keep this real short and just show you.


(1) A Quick WordProcessingML (WordML) tutorial

1. Open notepad (Start > Run > notepad.exe)
2. Copy/paste the code below into your blank notepad file:

<?xml version="1.0" encoding="utf-8"?>
   
<?mso-application progid="Word.Document"?>
   
       
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
           
<w:body>
                               
           
<w:p>
               
<w:r>
                   
<w:t>Hello Word!!!</w:t>
               
</w:r>
           
</w:p>
   
            
</w:body>
       
</w:wordDocument>

3. Go to file > save as and for "Save as Type" select All Files
4. For the file name type helloWord.xml
5. Click Save then close the file.

Now open the file with Microsoft Word version 2003 or later and you should see this:

Congrats! You just created your first word document using WordML.


(2) Using mdq.xmlTransform to produce a Word document

SQL Server supports the XML Data Type. WordML is XML. mdq.xmlTransform transforms XML. Now that you have seen how WordML works lets write a SQL query that returns a Word Document.

Using a database with mdq.xmlTranform, copy/paste and execute this code into SSMS:

DECLARE @xslt xml='

<xsl:stylesheet version="1.0"

    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">

    <xsl:output method="xml"  xml:space="default" />

 

    <!--  This template does all the work -->

    <xsl:template match="/" xml:space="default">

        <xsl:processing-instruction name="mso-application" xml:space="default">

            <xsl:text>progid="Word.Document"</xsl:text>

        </xsl:processing-instruction>

  

        <w:wordDocument><!-- This creates the word doc -->

            <w:body>

            <w:p>

                <w:r>

                    <w:t>Created with mdq.xmlTransform (no loops)</w:t>

                </w:r>

            </w:p>

            </w:body>

        </w:wordDocument>

    </xsl:template>  

    <xsl:template match="DataSourceID"/>

</xsl:stylesheet>';

 

SELECT REPLACE(mdq.xmlTransform('',@xslt),

                      '<?mso-application progid="Word.Document"?>',

                      '<?xml version="1.0" encoding="ISO-8859-1"?>

                       <?mso-application progid="Word.Document"?>') AS xml_output;

 

You should already know what will happen if you: copy/paste the result set into a new notepad file, save it like we did earlier, and then open it with Word (but feel free to do it again if you thought it was cool).


...and now for something a little more impressive...

Hopefully some developers, DBAs and BI people see the potential here. Think of how many SQL objects that are (or can be) stored in XML format. Query plans, traces, SSRS reports (RDL) and SSIS packages (DTSX packages) to name a few. Word and Excel files, your whatever.config files, a multitude of SharePoint objects, RSS feeds, web service data, Extended Events, etc, etc... All XML. Thanks to the XML data type and mdq.xmlTransform, information from all these things can be stored, measured, analyzed and, as I will show you in a moment, stuffed into XML files or fragments.

(3) Create a Word doc with SSRS Report (RDL) data

The code below contains two XML documents. For the first (1) I grabbed an some data from an SSRS RDL file (just part of it to keep things simple). Using T-SQL REPLACE I removed the rd: namespace references. The second (2) is an XML transform that will extract the Data Provider and Connection String from an SSRS report and use it to create a Word Doc.

<!--  (1) Grab an RDL XML FRAGMENT from an SSRS report -->
<!--  (using T-SQL REPLACE to remove the rd: namespace references) -->

<DataSource Name="ReportingDemo">
   
<DataSourceID>f34d206b-ca72-4ca6-9d5c-4151cd7eaxxx</DataSourceID>
   
<ConnectionProperties>
       
<DataProvider>SQL</DataProvider>
       
<ConnectString>
            Data Source=ABC;EFG Catalog=XYZ
       
</ConnectString>
   
</ConnectionProperties>
</DataSource>

<!—(2) Create an XSLT function that puts this data into a MS Word Doc -->

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
   
<xsl:output method="xml"  xml:space="default" omit-xml-declaration="no" />


    <!--  This template does all the work -->
    <xsl:template match="/DataSource/ConnectionProperties" xml:space="preserve">
        <xsl:processing-instruction name="mso-application" xml:space="default">
            <xsl:text>progid="Word.Document"</xsl:text>
        </xsl:processing-instruction>
   
        <w:wordDocument><!-- This creates the word doc -->
            <w:body>

                <!-- These two templates collect the data -->
                <xsl:apply-templates select="DataProvider"/>
                <xsl:apply-templates select="ConnectString"/>
            </w:body>
        </w:wordDocument>
    </xsl:template>
   
    <xsl:template match="DataProvider" xml:space="preserve">
            <w:p>
                <w:r>
                    <w:t><xsl:apply-templates /></w:t>
                </w:r>
            </w:p>
    </xsl:template>

    <xsl:template match="ConnectString" xml:space="preserve">
            <w:p>
                <w:r>
                    <w:t><xsl:apply-templates /></w:t>
                </w:r>
            </w:p>
    </xsl:template>


    <!-- This disposes of DataSourceID-->
    <xsl:template match="DataSourceID"/>

</xsl:stylesheet>

If we feed that RDL data and transform to mdq.xmlTransform like this:

DECLARE @xml xml='

<DataSource Name="ReportingDemo">

    <DataSourceID>f34d206b-ca72-4ca6-9d5c-4151cd7eaxxx</DataSourceID>

    <ConnectionProperties>

        <DataProvider>SQL</DataProvider>

        <ConnectString>

            Data Source=ABC;EFG Catalog=XYZ

        </ConnectString>

    </ConnectionProperties>

</DataSource>',

 

@xslt xml='

<xsl:stylesheet version="1.0"

    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">

    <xsl:output method="xml"  xml:space="default" />

 

    <!--  This template does all the work -->

    <xsl:template match="/DataSource/ConnectionProperties" xml:space="default">

        <xsl:processing-instruction name="mso-application" xml:space="default">

            <xsl:text>progid="Word.Document"</xsl:text>

        </xsl:processing-instruction>

  

        <w:wordDocument><!-- This creates the word doc -->

            <w:body>

                <!-- These two templates collect the data -->

                <xsl:apply-templates select="DataProvider"/>

                <xsl:apply-templates select="ConnectString"/>

            </w:body>

        </w:wordDocument>

    </xsl:template>

  

    <xsl:template match="DataProvider" xml:space="preserve">

            <w:p>

                <w:r>

                    <w:t><xsl:apply-templates /></w:t>

                </w:r>

            </w:p>

    </xsl:template>

 

    <xsl:template match="ConnectString" xml:space="preserve">

            <w:p>

                <w:r>

                    <w:t><xsl:apply-templates /></w:t>

                </w:r>

            </w:p>

    </xsl:template>

 

    <!-- This disposes of DataSourceID-->

    <xsl:template match="DataSourceID"/>

</xsl:stylesheet>';

 

SELECT REPLACE(mdq.xmlTransform(@xml,@xslt),

                      '<?mso-application progid="Word.Document"?>',

                      '<?xml version="1.0" encoding="ISO-8859-1"?>

                       <?mso-application progid="Word.Document"?>') AS xml_output;

 

We get this:

<?xml version="1.0" encoding="utf-8"?>
   
<?mso-application progid="Word.Document"?>
   
       
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
           
<w:body>
               
           
<w:p>
               
<w:r>
                   
<w:t>SQL</w:t>
               
</w:r>
           
</w:p>
               
           
<w:p>
               
<w:r>
                   
<w:t>Data Source=ABC;EFG Catalog=XYZ</w:t>
               
</w:r>
           
</w:p>
   
            
</w:body>
       
</w:wordDocument>

...which, if we copy/paste into notepad and save as wow.xml, then open with Microsoft Office Word, we get:

Conclusion

mdq.xmlTransform is a powerful tool. Period. Today I showed you how to create a basic Word document using just T-SQL and mdq.xmlTransform. Thanks for reading!


--ab

The new xmlsqlninja logo ('X-M-L'-'See-Quel' 'ninja')

Last Updated: 4/29/2013 11:35am (Posted, fixed code issues)