Strip/sanitize HTML with Perl

pradeep · May 27, 2009

Introduction

Sanitizing HTML is just removing unwanted HTML elements from any inputted HTML code, it does not validate HTML code. We all have seen many sites which allow you to post comments using only a few HTML elements like <a>, <b>, <i> etc. the other HTML tags are automatically removed, you may even want to remove all HTML tags completely or you may want to allow HTML tags with some conditions like <img> tags' src attribute should have only relative URL, or the HTML may contain <span> tags but no style attributes etc. etc.

Solution

There are a couple of modules available on CPAN like HTML::Sanitizer, HTML::Strip, HTML::Scrubber; I personally like to use HTML::Scrubber, it's easy to use, you can have complex conditions if you want and is fast.

The code

Example: We want to strip all HTML from a string or file
Code:
#!/usr/bin/perl

use HTML::Scrubber;

my $html = q(<style type="text/css"> myStle { background: #afe; color: #000;} </style>
    <script language="javascript" type="text/javascript"> alert("We are testing HTML::Scrubber");    </script>
    <HR>
        a   => <a href=1>link </a>
        br  => <br>
        b   => <B> bold </B>
        u   => <U> UNDERLINE </U>
     <img src="http://go4expert.com/text.png" border=0>);
    
my $scrubber = HTML::Scrubber->new;
$scrubber->default(0); ## default to no HTML

my $clean_html = $scrubber->scrub($html);

## OR file

$clean_html = $scrubber->scrub_file('myHtml.html');
Wasn't that easy? Let's take a look at some more interesting examples.

Example: Strip <script> and <style> tags
Code:
my $scrubber = HTML::Scrubber->new;
$scrubber->default(1); ## default to allow HTML

$scrubber->script(0); ## no script
$scrubber->style(0); ## no style

# OR

$scrubber->deny(qw[script style]);

my $clean_html = $scrubber->scrub($html);
Example: Anchor tags allowed only if contain relative URLs
Code:
my $scrubber = HTML::Scrubber->new;
$scrubber->default(1); ## default to allow HTML

my @rules = (
        a => {
            href => qr{^(?!http://)}i, # only relative URLs
            title => 1,                # title attribute allowed
        },
);

$scrubber->rules( @rules );

my $clean_html = $scrubber->scrub($html);
References

http://search.cpan.org

shabbir · Jun 3, 2009

Nomination this Article for Article of the month - May 2009

Log in or Sign up

Strip/sanitize HTML with Perl

pradeep Team Leader

Introduction

Solution

The code

References

shabbir Administrator Staff Member

Share This Page

Log in or Sign up

Strip/sanitize HTML with Perl

pradeep Team Leader

Introduction

Solution

The code

References

shabbir Administrator Staff Member

Share This Page

Useful Searches