Strip/sanitize HTML with Perl

Discussion in 'Perl' started by pradeep, May 27, 2009.

  1. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in

    Introduction



    Sanitizing HTML is just removing unwanted HTML elements from any inputted HTML code, it does not validate HTML code. We all have seen many sites which allow you to post comments using only a few HTML elements like <a>, <b>, <i> etc. the other HTML tags are automatically removed, you may even want to remove all HTML tags completely or you may want to allow HTML tags with some conditions like <img> tags' src attribute should have only relative URL, or the HTML may contain <span> tags but no style attributes etc. etc.

    Solution


    There are a couple of modules available on CPAN like HTML::Sanitizer, HTML::Strip, HTML::Scrubber; I personally like to use HTML::Scrubber, it's easy to use, you can have complex conditions if you want and is fast.

    The code



    Example: We want to strip all HTML from a string or file

    Code:
    #!/usr/bin/perl
    
    use HTML::Scrubber;
    
    my $html = q(<style type="text/css"> myStle { background: #afe; color: #000;} </style>
        <script language="javascript" type="text/javascript"> alert("We are testing HTML::Scrubber");    </script>
        <HR>
            a   => <a href=1>link </a>
            br  => <br>
            b   => <B> bold </B>
            u   => <U> UNDERLINE </U>
         <img src="http://go4expert.com/text.png" border=0>);
        
    my $scrubber = HTML::Scrubber->new;
    $scrubber->default(0); ## default to no HTML
    
    my $clean_html = $scrubber->scrub($html);
    
    ## OR file
    
    $clean_html = $scrubber->scrub_file('myHtml.html');
    
    Wasn't that easy? Let's take a look at some more interesting examples.

    Example: Strip <script> and <style> tags

    Code:
    my $scrubber = HTML::Scrubber->new;
    $scrubber->default(1); ## default to allow HTML
    
    $scrubber->script(0); ## no script
    $scrubber->style(0); ## no style
    
    # OR
    
    $scrubber->deny(qw[script style]);
    
    my $clean_html = $scrubber->scrub($html);
    
    Example: Anchor tags allowed only if contain relative URLs

    Code:
    my $scrubber = HTML::Scrubber->new;
    $scrubber->default(1); ## default to allow HTML
    
    my @rules = (
            a => {
                href => qr{^(?!http://)}i, # only relative URLs
                title => 1,                # title attribute allowed
            },
    );
    
    $scrubber->rules( @rules );
    
    my $clean_html = $scrubber->scrub($html);
    

    References


    http://search.cpan.org
     
  2. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice