While modern browsers will work around many syntax problems in your HTML, if you want to ensure consistent pages across multiple browsers, it's a good idea to check the syntax. That's where HTML::Lint comes in. We'll show you how to use this powerful syntax-checking tool. Modern browsers include sophisticated routines to work around your bad HTML and render a page without generating a series of ugly error messages about "unterminated tags" or "invalid doctypes". But just because the browser tries to handle errors is no reason for you to ignore the problem. To have your pages render consistently, you should vet the HTML documents against the W3C's latest specification to ensure you are in compliance with the latest rules and syntax. There are online tools to do this, the most famous being the W3C's own Markup Validator Service. The problem with an online service, however, is that it can be slow and may even get swamped if you send it a large number of pages. It's a good idea to use a validator on your local computer, especially if you are planning to validate a large batch of files. That's where the HTML::Lint Perl module comes in. Installing HTML::Lint The HTML::Lint module is built on top of the very popular HTML::Parser and HTML::Tagset modules. It's designed to check, or "lint", your HTML code for errors that might cause it to break or render incorrectly. Written entirely in Perl, with no dependencies on external libraries, HTML::Lint can parse either an HTML file or a string containing HTML markup. Errors are classified into one of three categories according to their severity, and the module includes methods to filter and display all but the most severe errors. HTML::Lint is licensed under the GPL, and is maintained by Andy Lester. Detailed installation instructions are provided in the download archive, but the simplest way to install it is to use the CPAN shell: Code: shell> perl -MCPAN -e shell cpan> install HTML::Lint This tutorial uses the current version 1.28 of HTML::Lint. Linting a string or file With the module installed, let's try a simple example that demonstrates how it works: Code: #!/usr/bin/perl # import module use HTML::Lint; # create an HTML string with an error in it $html = "<html><head></head><body><center>This is a simple HTML document with an unclosed element</body></html>"; # create a Lint object $lint = HTML::Lint->new; # parse the HTML string $lint->parse($html); # check for errors and print an error message ($lint->errors) ? print "The HTML is invalid" : print "The HTML is valid"; This is fairly self-explanatory—once you create an instance of HTML::Lint, most of the heavy lifting is done by the parse() method. This method accepts a string of HTML and checks it for validity. Errors, if any, are stored in the object's @errors array. By checking this array, your script can display a message indicating whether the string is valid or not. Of course, it's unlikely that you're going to be writing HTML strings inside your lint scripts. Luckily you can use it to scan existing HTML documents on your computer. Instead of the plain parse() method, HTML::Lint also comes with a parse_file() method which accepts a file instead of a string as the argument: Code: #!/usr/bin/perl # import module use HTML::Lint; # create a Lint object $lint = HTML::Lint->new; # parse a file $lint->parse_file("/usr/local/apache/htdocs/site1/welcome.html") or die("Cannot find file!"); # check for errors and print an error message ($lint->errors) ? print "The HTML is invalid" : print "The HTML is valid"; Here, the HTML::Lint parser will look up the file, scan it and place errors into the $errors array. You could, obviously, make the file name and path an input argument to the script for maximum flexibility. We'll do that in the next example, but first a word about handling errors. Handling errors found by HTML::Lint While the previous examples showed the basics of how HTML::Lint works, they didn't show you how to identify which errors were found. For that we have to process the @errors array, which contains the detailed error messages. An error in HTML::Lint is returned as an instance of the HTML::Lint::Error object and is one of three types: STRUCTURE - These errors are incorrect attribute values or improperly-terminated/nested elements. HELPER - These assist you by pointing out optional attributes not present in the document but which can make your code "better", such as ALT attributes for images. FLUFF - These include miscellaneous errors, usually unknown elements or attributes. Browsers generally ignore these, but even if they're harmless, you don't want them lurking in your document. These errors are stored in the @errors array, together with the line and column number where the error was located. To see this in action, consider the following revision of the previous example: Code: #!/usr/bin/perl # get the file name from the command line or display an error if (!$ARGV[0]) { die ("ERROR: No file name provided"); } # import module use HTML::Lint; # create a Lint object $lint = HTML::Lint->new; # parse a file $lint->parse_file($ARGV[0]) or die("ERROR: Cannot find file"); # process error list and print foreach $error ($lint->errors) { print $error->where(), ": ", $error->errtext() , "\n"; } # print error count print "Total errors: ", scalar($lint->errors); Here, the name of the file to be linted is passed to the script from the console, through the special Perl @ARGV command-line array. This file is scanned by parse_file(), and the resulting error array is processed using a foreach() loop. For each error message, the where() method displays the line and column number of the error, while the errtext() message displays the exact text of the error message. Here is how you might use the script (called lint.pl) and the possible output: Code: $ ./lint.pl ../projects/form.html (31:36): <IMG> tag has no HEIGHT and WIDTH attributes. (174:2): Unknown attribute "height" for tag <tr> Total errors: 2 To clear the @errors array of all messages (useful if you're parsing multiple files and want a clean slate before each run), use the clear_errors() method: Code: $lint->clear_errors(); Finally, you can filter out specific error types from the error list, by adding an optional argument to the HTML::Lint object's new() constructor, for example, to limit the error list only to structural errors: Code: $lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE); If you want HTML::Lint to check an entire site, you can wrap the script above in a shell script and pass it filenames one after another, or alter the script above to retrieve a file list using Perl's directory functions and then pass the files to parse_file() one by one. Parsing remote files with LWP This final example shows you how to use HTML::Lint to check remote files, by passing the script a URL instead of a local file path. This behavior is not implicitly supported in HTML::Lint—the module itself only supports parsing of local files. But by combining it with the CGI and LWP modules, it can retrieve and check a stream of HTML data from a remote Web server. The example below shows the code to accomplish this (this script should be named lint.cgi and placed in your Web server's CGI-BIN directory): Code: #/usr/bin/perl # import CGI module use CGI; use CGI::Carp qw/fatalsToBrowser warningsToBrowser/; # create CGI object $cgi = new CGI; # send header print $cgi->header; # check if form is submitted # if not display form if (!$cgi->param()) { # print form print <<HTML; <html> <head></head> <body> <form action="/cgi-bin/lint.cgi" method="post"> <input type="text" name="url"><input type="submit" name="submit" value="Check"> </form> </body> </html> HTML } else { # form has been submitted # check for value or die() if ($cgi->param('url')) { # import LWP module use LWP::UserAgent; # create user agent and send request for URL $agent = LWP::UserAgent->new; $request = HTTP::Request->new(GET => $cgi->param('url')); $result = $agent->request($request); # check for return value, else die() if (!$result->is_success) { print "ERROR: Could not connect"; die; } else { # file has come in, now start linting... # import Lint module use HTML::Lint; # create a Lint object $lint = HTML::Lint->new; # parse data from the HTTP result $lint->parse($result->content); # print the result page print <<SHTML; <html> <head></head> <body> Errors: SHTML # process error list and print print "<ul>"; foreach $error ($lint->errors) { print "<li>", $error->where(), ": ", $error->errtext(); } print "</ul>"; # print error count print "Total errors: ", scalar($lint->errors); # close the result page print <<EHTML; </body> </html> EHTML } } else { print "ERROR: Please enter a URL"; die; } } # EOF This might seem complicated, but it's really not that bad. The script is split into two sections, one displaying the initial form and the other displaying the form results. Once a URL is entered into the text field and the form submitted, use the LWP module to connect to the remote server and request the page. The page data is then read into a variable and passed to HTML::Lint for linting. Errors, if any, are displayed in a neat bulleted list. With this script and the one on the previous page, you now have the tools to check both local and remote files for errors with HTML::Lint. So what are you waiting for...start linting!
I have reported the article for Nominate your favorite article of the month for November 2007. Add your nominations as well.