Validating HTML with Perl Module HTML::Lint

Discussion in 'Perl' started by pradeep, Nov 6, 2007.

  1. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    While modern browsers will work around many syntax problems in your HTML, if you want to ensure consistent pages across multiple browsers, it's a good idea to check the syntax. That's where HTML::Lint comes in. We'll show you how to use this powerful syntax-checking tool.

    Modern browsers include sophisticated routines to work around your bad HTML and render a page without generating a series of ugly error messages about "unterminated tags" or "invalid doctypes". But just because the browser tries to handle errors is no reason for you to ignore the problem. To have your pages render consistently, you should vet the HTML documents against the W3C's latest specification to ensure you are in compliance with the latest rules and syntax.

    There are online tools to do this, the most famous being the W3C's own Markup Validator Service. The problem with an online service, however, is that it can be slow and may even get swamped if you send it a large number of pages. It's a good idea to use a validator on your local computer, especially if you are planning to validate a large batch of files. That's where the HTML::Lint Perl module comes in.

    Installing HTML::Lint



    The HTML::Lint module is built on top of the very popular HTML::Parser and HTML::Tagset modules. It's designed to check, or "lint", your HTML code for errors that might cause it to break or render incorrectly. Written entirely in Perl, with no dependencies on external libraries, HTML::Lint can parse either an HTML file or a string containing HTML markup. Errors are classified into one of three categories according to their severity, and the module includes methods to filter and display all but the most severe errors.

    HTML::Lint is licensed under the GPL, and is maintained by Andy Lester. Detailed installation instructions are provided in the download archive, but the simplest way to install it is to use the CPAN shell:

    Code:
    shell> perl -MCPAN -e shell
     cpan> install HTML::Lint
    This tutorial uses the current version 1.28 of HTML::Lint.

    Linting a string or file



    With the module installed, let's try a simple example that demonstrates how it works:

    Code:
    #!/usr/bin/perl
     
     # import module
     use HTML::Lint;
     
     # create an HTML string with an error in it
     $html = "<html><head></head><body><center>This is a simple HTML document with an unclosed element</body></html>";
     
     # create a Lint object
     $lint = HTML::Lint->new;
     
     # parse the HTML string
     $lint->parse($html);
     
     # check for errors and print an error message
     ($lint->errors) ? print "The HTML is invalid" : print "The HTML is valid";
     
    This is fairly self-explanatory—once you create an instance of HTML::Lint, most of the heavy lifting is done by the parse() method. This method accepts a string of HTML and checks it for validity. Errors, if any, are stored in the object's @errors array. By checking this array, your script can display a message indicating whether the string is valid or not.

    Of course, it's unlikely that you're going to be writing HTML strings inside your lint scripts. Luckily you can use it to scan existing HTML documents on your computer. Instead of the plain parse() method, HTML::Lint also comes with a parse_file() method which accepts a file instead of a string as the argument:

    Code:
    #!/usr/bin/perl
     
     # import module
     use HTML::Lint;
     
     # create a Lint object
     $lint = HTML::Lint->new;
     
     # parse a file
     $lint->parse_file("/usr/local/apache/htdocs/site1/welcome.html") or die("Cannot find file!");
     
     # check for errors and print an error message
     ($lint->errors) ? print "The HTML is invalid" : print "The HTML is valid";
     
    Here, the HTML::Lint parser will look up the file, scan it and place errors into the $errors array. You could, obviously, make the file name and path an input argument to the script for maximum flexibility. We'll do that in the next example, but first a word about handling errors.

    Handling errors found by HTML::Lint



    While the previous examples showed the basics of how HTML::Lint works, they didn't show you how to identify which errors were found. For that we have to process the @errors array, which contains the detailed error messages.

    An error in HTML::Lint is returned as an instance of the HTML::Lint::Error object and is one of three types:


    • STRUCTURE - These errors are incorrect attribute values or improperly-terminated/nested elements.
    • HELPER - These assist you by pointing out optional attributes not present in the document but which can make your code "better", such as ALT attributes for images.
    • FLUFF - These include miscellaneous errors, usually unknown elements or attributes. Browsers generally ignore these, but even if they're harmless, you don't want them lurking in your document.
    These errors are stored in the @errors array, together with the line and column number where the error was located. To see this in action, consider the following revision of the previous example:

    Code:
    #!/usr/bin/perl
     
     # get the file name from the command line or display an error
     if (!$ARGV[0]) { die ("ERROR: No file name provided"); }
     
     # import module
     use HTML::Lint;
     
     # create a Lint object
     $lint = HTML::Lint->new;
     
     # parse a file
     $lint->parse_file($ARGV[0]) or die("ERROR: Cannot find file");
     
     # process error list and print
     foreach $error ($lint->errors)
     {
            print $error->where(), ": ", $error->errtext() , "\n";
     }
     
     # print error count
     print "Total errors: ", scalar($lint->errors);
    Here, the name of the file to be linted is passed to the script from the console, through the special Perl @ARGV command-line array. This file is scanned by parse_file(), and the resulting error array is processed using a foreach() loop. For each error message, the where() method displays the line and column number of the error, while the errtext() message displays the exact text of the error message.

    Here is how you might use the script (called lint.pl) and the possible output:

    Code:
    $ ./lint.pl ../projects/form.html
     (31:36): <IMG> tag has no HEIGHT and WIDTH attributes.
     (174:2): Unknown attribute "height" for tag <tr> Total errors: 2
    To clear the @errors array of all messages (useful if you're parsing multiple files and want a clean slate before each run), use the clear_errors() method:

    Code:
    $lint->clear_errors();
    Finally, you can filter out specific error types from the error list, by adding an optional argument to the HTML::Lint object's new() constructor, for example, to limit the error list only to structural errors:

    Code:
    $lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE);
    If you want HTML::Lint to check an entire site, you can wrap the script above in a shell script and pass it filenames one after another, or alter the script above to retrieve a file list using Perl's directory functions and then pass the files to parse_file() one by one.

    Parsing remote files with LWP



    This final example shows you how to use HTML::Lint to check remote files, by passing the script a URL instead of a local file path. This behavior is not implicitly supported in HTML::Lint—the module itself only supports parsing of local files. But by combining it with the CGI and LWP modules, it can retrieve and check a stream of HTML data from a remote Web server.

    The example below shows the code to accomplish this (this script should be named lint.cgi and placed in your Web server's CGI-BIN directory):

    Code:
    #/usr/bin/perl
     
     # import CGI module
     use CGI;
     use CGI::Carp qw/fatalsToBrowser warningsToBrowser/;
     
     # create CGI object
     $cgi = new CGI;
     
     # send header
     print $cgi->header;
     
     # check if form is submitted
     # if not display form
     if (!$cgi->param())
     {
            # print form
            print <<HTML;
     <html>
     <head></head>
     <body>
     <form action="/cgi-bin/lint.cgi" method="post"> <input type="text" name="url"><input type="submit" name="submit"
     value="Check">
     </form>
     </body>
     </html>
     HTML
     }
     else
     {
            # form has been submitted
            # check for value or die()
            if ($cgi->param('url'))
            {
     
                    # import LWP module
                    use LWP::UserAgent;
                    # create user agent and send request for URL
                    $agent = LWP::UserAgent->new;
                    $request = HTTP::Request->new(GET => $cgi->param('url'));
                    $result = $agent->request($request);
     
                    # check for return value, else die()
                    if (!$result->is_success)
                    {
                            print "ERROR: Could not connect";
                            die;
                    }
                    else
                    {
                            # file has come in, now start linting...
                            # import Lint module
                            use HTML::Lint;
     
                            # create a Lint object
                            $lint = HTML::Lint->new;
     
                            # parse data from the HTTP result
                            $lint->parse($result->content);
     
                            # print the result page
                            print <<SHTML;
     <html>
     <head></head>
     <body>
     Errors:
     SHTML
     
                            # process error list and print
                            print "<ul>";
                            foreach $error ($lint->errors)
                            {
                                    print "<li>", $error->where(), ": ", $error->errtext();
                            }
                            print "</ul>";
     
                            # print error count
                            print "Total errors: ", scalar($lint->errors);
     
                            # close the result page
                            print <<EHTML;
     </body>
     </html>
     EHTML
                    }
            }
            else
            {
                    print "ERROR: Please enter a URL";
                    die;
            }
     }
     
     # EOF
     
    This might seem complicated, but it's really not that bad. The script is split into two sections, one displaying the initial form and the other displaying the form results. Once a URL is entered into the text field and the form submitted, use the LWP module to connect to the remote server and request the page. The page data is then read into a variable and passed to HTML::Lint for linting. Errors, if any, are displayed in a neat bulleted list.

    With this script and the one on the previous page, you now have the tools to check both local and remote files for errors with HTML::Lint. So what are you waiting for...start linting!
     
  2. vinealiaptili

    vinealiaptili New Member

    Joined:
    Nov 8, 2007
    Messages:
    1
    Likes Received:
    0
    Trophy Points:
    0
    Hi all! everyone should see this ;-)

    cool uhahah no comment ))
     
    Last edited by a moderator: Nov 8, 2007
  3. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83
    Re: Hi all! everyone should see this ;-)

    No self promotion.
     
  4. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice