1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

HTML Parsing In Ruby

Discussion in 'Ruby on Rails' started by pradeep, Feb 6, 2013.

  1. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,646
    Likes Received:
    86
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    HTML parsing has become quite necessary in the online world, from tools, online HTML tutorials to lint programs and crawlers, all need HTML to be parsed, Ruby has come to the front with frameworks like RoR so developers have created wonderful parsers for Ruby, in this article today we'll be looking at parsing HTML in Ruby using the Ruby gem Nokogiri.

    Installing Nokogiri



    Nokogiri being a Ruby gem is pretty straight forward to install, issue the following command on the Linux command line.

    Code:
    $ gem install nokogiri
    
    In case you encounter any issue, there is not fixed steps to resolve it, just google the error message for solutions.

    Using Nokogiri



    In parsing HTML we'll be using CSS selectors to access & traverse DOM, follow the simple example below.

    Code:
    require 'rubygems'
    require 'nokogiri'
    
    html_doc = Nokogiri::HTML(open("index.html"))
    
    # selects the title tags in an array
    html_doc.css('title')
    
    # prints the text of the first title tag
    html_doc.css('title')[0].text
    
    You can fetch HTML from an URL directly, and also use CSS selectors to filter elements based ob various criteria, and you can use XPath to traverse & access the DOM tree. Follow the example to get an idea.

    Code:
    require 'rubygems'
    require 'nokogiri'
    ## use open-uri to fetch HTML from URL
    require 'open-uri'
    
    html_doc = Nokogiri::HTML(open("http://www.go4expert.com"))
    
    # search using CSS selectors for all anchor tags inisde the div
    html_doc.css('div#list a').each do |link|
        puts link.content
    end
    
    # search for all li in the div
    html_doc.xpath('//div/ul/li').each do |li|
        puts li.content
    end
    
    I hope this was helpful in getting you started, enjoy and for more information visit http://nokogiri.org/
     

Share This Page