Clean User Input HTML using HTML::Scrubber

pradeep · May 24, 2012

Most modern day websites take inputs from user in the form of comments, reviews, PMs etc. and it's needed to control the HTML tags in the users' content to prevent XSS attacks, spamming with URLs, embedding videos - which might attract copyright problems - and similar problems. Many sites list some allowed HTML tags which can be used, and strip out the rest or show an error message to the user.

It's best to strip out the tags because many users may not be aware of the tags present or may not know how to fix them. In this article we'll explore the Perl module HTML::Scrubber which is highly configurable and we'll use it to strip unwanted HTML tags, write validation rules to strip tags based on certain conditions.

Basic Usage

In the following code example we'll see the basic usage of HTML scrubber, we'll allow only the following tags: B, I, BR ; so, all other tags except for these will be stripped off.
Code:
use HTML::Scrubber;

my $basic_scrubber = HTML::Scrubber->new( allow => [qw/b i u br/] );
print $basic_scrubber->scrub('Hi, Check out <a href="http://whatanindianrecipe.com">WhatAnIndianRecipe</a> for delicious dishes from India.');
Output:
Code:
Hi, Check out WhatAnIndianRecipe for delicious dishes from India.
As you can see from the output the A tag has been stripped off, only leaving the B & I tags.

Advanced Usage

In more advanced use we can control what attributes of certain tags we would like to allow, or if we would like to set default rules like not allowing onmouseover attribute at all, etc. Have a look at the example code below, this is would help you understand the idea behind the package.
Code:
#!/usr/bin/perl 

use HTML::Scrubber;

## allowed tags
my @allow = qw/br i a/;

## allow/disallow tags & attributes
my @rules = (
 script => 0,
 img => {
 ## allow images only from a specific domain
 src => qr{^(http://www.go4expert.com)}i,
 ## allow
 alt => 1, # alt attribute allowed
 '*' => 0, # deny all other attributes
 },
);

## default rules
my @default = (
 0 => {
 ## allow all attributes
 '*' => 1,
 ## title attribute in all tags will be removed
 title => 0,
 ## set to disallow all JS event attributes
 'onblur' => 0,
 'onchange' => 0,
 'onclick' => 0,
 'ondblclick' => 0,
 'onerror' => 0,
 'onfocus' => 0,
 'onkeydown' => 0,
 'onkeypress' => 0,
 'onkeyup' => 0,
 'onload' => 0,
 'onmousedown' => 0,
 'onmousemove' => 0,
 'onmouseout' => 0,
 'onmouseover' => 0,
 'onmouseup' => 0,
 'onreset' => 0,
 'onselect' => 0,
 'onsubmit' => 0,
 'onunload' => 0
 }
);

my $advanced_scrubber = HTML::Scrubber->new(
 allow => \@allow,
 rules => \@rules,
 default => \@default
);

print $advanced_scrubber->scrub('Hi, Check out <a href="http://www.google.com" title="Search">Google</a> for delicious dishes from India. More info at <a href="http://en.wikipedia.org/Recipes">Recipes at Wikipedia</a>. <img src="/images/avatar.jpg" alt="Avatar Image" onMouseOver="alert(window.location)"> <embed src="api.flv"></embed> img src="http://www.go4expert.com/images/logo.png" alt="Avatar Image" onMouseOver="alert(window.location)">');
References

http://search.cpan.org/dist/HTML-Scrubber/

FredTighe · Jun 17, 2012

Great Post!I like this blog very much.I knew many important info from this blog.
Keep up the good work

Scripting · Jun 17, 2012

Very interesting, I think I will learn Perl more!

Log in or Sign up

Clean User Input HTML using HTML::Scrubber

pradeep Team Leader

Basic Usage

Advanced Usage

References

FredTighe New Member

Scripting John Hoder

Share This Page

Log in or Sign up

Clean User Input HTML using HTML::Scrubber

pradeep Team Leader

Basic Usage

Advanced Usage

References

FredTighe New Member

Scripting John Hoder

Share This Page

Useful Searches