Go4Expert

Go4Expert (http://www.go4expert.com/)
-   PHP (http://www.go4expert.com/articles/php-tutorials/)
-   -   Parsing HTML in PHP (http://www.go4expert.com/articles/parsing-html-php-t29455/)

pradeep 29Jan2013 18:41

Parsing HTML in PHP
 
Parsing HTML has always been a tough cookie even for seasoned programmers, but nowadays parsing HTML is extensively used for scraping websites, crawling, error detection websites, and many other useful purposes. In this article we'll be looking into parsing HTML using PHP, for this purpose I have selected Simple HTML DOM Parser, I found this easier to PHP's own DOMDocument parser, Simple HTML DOM parser let's you work in an object oriented manner, and is much lucid to follow and implement.

Getting Simple HTML DOM Parser



Get the Simple HTML DOM parser class PHP file from http://sourceforge.net/projects/simplehtmldom/files/ and save it to any directory of your choice. That's all you need to do.

Getting Started



In a small example we'll include the class, and get all hyperlinks on the go4exert.com homepage.

Code: PHP

<?php
// create new DOM object from URL
$html_obj = file_get_html('http://www.go4expert.com/');

// scan for all hyperlinks and print
foreach ($html_obj->find('a') as $element) {
       print $element->href . '<br>';
}
?>


You can see how easy this was, now can explorer you ideas.

Advanced Usage



Now, we'll be looking at using selectors to find specific elements, and traversing the DOM tree and such.

Code: PHP

<?PHP
// get an array of all anchor tags with classname 'cool'
$all_anchor_objects = $html_obj->find('a.cool');

// we can also look for the nth no. of element, say the first span tag
$first_span = $html_obj->find('span', 0);

// or the last span if you need
$last_span = $html_obj->find('span', -1);

// say we need to look by an attribute, like the button with id 'buynow'
$buynow_button = $html_obj->find('input[id=buynow]');

// we can easily chain methods to traverse the DOM tree
print $html->find("form[id=cart]", 0)->children(1)->children(1)->id;
?>


Well this should be enough to get you started, you can improvise the method chaining to suit your needs. Enjoy!


All times are GMT +5.5. The time now is 06:13.