Most modern day websites take inputs from user in the form of comments, reviews, PMs etc. and it's needed to control the HTML tags in the users' content to prevent XSS attacks, spamming with URLs, embedding videos - which might attract copyright problems - and similar problems. Many sites list some allowed HTML tags which can be used, and strip out the rest or show an error message to the user. It's best to strip out the tags because many users may not be aware of the tags present or may not know how to fix them. In this article we'll explore the Perl module HTML::Scrubber which is highly configurable and we'll use it to strip unwanted HTML tags, write validation rules to strip tags based on certain conditions. Basic Usage In the following code example we'll see the basic usage of HTML scrubber, we'll allow only the following tags: B, I, BR ; so, all other tags except for these will be stripped off. Code: use HTML::Scrubber; my $basic_scrubber = HTML::Scrubber->new( allow => [qw/b i u br/] ); print $basic_scrubber->scrub('Hi,<br> Check out <a href="http://whatanindianrecipe.com">WhatAnIndianRecipe</a> for <b>delicious</b> dishes from <i>India</i>.'); Output: Code: Hi,<br> Check out WhatAnIndianRecipe for <b>delicious</b> dishes from <i>India</i>. As you can see from the output the A tag has been stripped off, only leaving the B & I tags. Advanced Usage In more advanced use we can control what attributes of certain tags we would like to allow, or if we would like to set default rules like not allowing onmouseover attribute at all, etc. Have a look at the example code below, this is would help you understand the idea behind the package. Code: #!/usr/bin/perl use HTML::Scrubber; ## allowed tags my @allow = qw/br i a/; ## allow/disallow tags & attributes my @rules = ( script => 0, img => { ## allow images only from a specific domain src => qr{^(http://www.go4expert.com)}i, ## allow alt => 1, # alt attribute allowed '*' => 0, # deny all other attributes }, ); ## default rules my @default = ( 0 => { ## allow all attributes '*' => 1, ## title attribute in all tags will be removed title => 0, ## set to disallow all JS event attributes 'onblur' => 0, 'onchange' => 0, 'onclick' => 0, 'ondblclick' => 0, 'onerror' => 0, 'onfocus' => 0, 'onkeydown' => 0, 'onkeypress' => 0, 'onkeyup' => 0, 'onload' => 0, 'onmousedown' => 0, 'onmousemove' => 0, 'onmouseout' => 0, 'onmouseover' => 0, 'onmouseup' => 0, 'onreset' => 0, 'onselect' => 0, 'onsubmit' => 0, 'onunload' => 0 } ); my $advanced_scrubber = HTML::Scrubber->new( allow => \@allow, rules => \@rules, default => \@default ); print $advanced_scrubber->scrub('Hi,<br> Check out <a href="http://www.google.com" title="Search">Google</a> for <b>delicious</b> dishes from <i>India</i>. More info at <a href="http://en.wikipedia.org/Recipes">Recipes at Wikipedia</a>.<br><img src="/images/avatar.jpg" alt="Avatar Image" onMouseOver="alert(window.location)"> <embed src="api.flv"></embed> img src="http://www.go4expert.com/images/logo.png" alt="Avatar Image" onMouseOver="alert(window.location)">'); References http://search.cpan.org/dist/HTML-Scrubber/
Great Post!I like this blog very much.I knew many important info from this blog. Keep up the good work