Parsing HTML in PHP

Discussion in 'PHP' started by pradeep, Jan 29, 2013.

  1. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    Parsing HTML has always been a tough cookie even for seasoned programmers, but nowadays parsing HTML is extensively used for scraping websites, crawling, error detection websites, and many other useful purposes. In this article we'll be looking into parsing HTML using PHP, for this purpose I have selected Simple HTML DOM Parser, I found this easier to PHP's own DOMDocument parser, Simple HTML DOM parser let's you work in an object oriented manner, and is much lucid to follow and implement.

    Getting Simple HTML DOM Parser



    Get the Simple HTML DOM parser class PHP file from http://sourceforge.net/projects/simplehtmldom/files/ and save it to any directory of your choice. That's all you need to do.

    Getting Started



    In a small example we'll include the class, and get all hyperlinks on the go4exert.com homepage.

    PHP:
    <?php
    // create new DOM object from URL
    $html_obj file_get_html('http://www.go4expert.com/');

    // scan for all hyperlinks and print
    foreach ($html_obj->find('a') as $element) {
           print 
    $element->href '<br>'
    }
    ?>
    You can see how easy this was, now can explorer you ideas.

    Advanced Usage



    Now, we'll be looking at using selectors to find specific elements, and traversing the DOM tree and such.

    PHP:
    <?PHP
    // get an array of all anchor tags with classname 'cool'
    $all_anchor_objects $html_obj->find('a.cool');

    // we can also look for the nth no. of element, say the first span tag
    $first_span $html_obj->find('span'0);

    // or the last span if you need
    $last_span $html_obj->find('span', -1);

    // say we need to look by an attribute, like the button with id 'buynow'
    $buynow_button $html_obj->find('input[id=buynow]'); 

    // we can easily chain methods to traverse the DOM tree
    print $html->find("form[id=cart]"0)->children(1)->children(1)->id;
    ?>
    Well this should be enough to get you started, you can improvise the method chaining to suit your needs. Enjoy!
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice