10/31/14

SQL Server Parse HTML Content from Data

With this TSQL script, we can remove HTML content as long as there are valid start and ending HTML tags.

Script:

Create FUNCTION [dbo].[udf_StripHTMLContent]
(@content VARCHAR(MAX),@tagStart varchar(55),  @tagEnd varchar(55))
RETURNS VARCHAR(MAX)
AS
BEGIN

DECLARE @start INT=1
DECLARE @end INT=1
DECLARE @len INT=1
           
WHILE @start > 0 AND @end > 0 AND @len > 0
BEGIN
                       
      SET @start = CHARINDEX(@tagStart,@content)
      SET @end = (len(@tagEnd)-1) + CHARINDEX(@tagEnd,@content,CHARINDEX(@tagStart,@content))        
      SET @len = (@end - @start) + 1     
      IF @start > 0 AND @end > 0 AND @len > 0 and (@len > 1 and @end > len(@tagEnd))
      BEGIN
            SET @content = STUFF(@content,@start,@len,'')              
      END
      ELSE
      BEGIN
            break;           
      END                                
END
RETURN REPLACE(@content,' ',' ')

This script parses the content looking for the start and end tags. It replaces all the content that is contained by those tags. It continues to parse the string until no more matching tags are found or the end of the string is reached.

Example:

Use the following examples to see how this function can be used. The first couple of examples just parse a string. The last example does an inline SQL update to remove HTML tags from some inventory records.

declare @html varchar(max), @start varchar(55), @end varchar(55)
set @start = '<div><a href="bad.com">'
set @end = '</a></div>'
set @html = 'This item has many features<div><a href="bad.com">Unwanted content</a></div>'
select @html as before
set @html  =dbo.udf_StripHTMLContent(@html,@start, @end)
select @html as after

set @start = '<p style="display:none">'
set @end = '</p>'
set @html = 'This item has many features<p style="display:none"><a href="bad.com">Unwanted content</a></p>'
select @html as before
set @html  =dbo.udf_StripHTMLContent(@html,@start, @end)
select @html as after

--TABLE UPDATE CALL

update a set ItemDetail = dbo.udf_StripHTMLContent(ItemDetail,@start, @end)
from Inventory..Item a
...

This function can also be used to strip out content that matches a particular start and ending pattern.

I hope this is useful.