1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping Websites with Ruby

Discussion in 'Ruby on Rails' started by pradeep, Jul 3, 2013.

  1. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,646
    Likes Received:
    86
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    Web scraping is to extract or harvest data from websites using programs to automatically fetch/extract data after per-determined intervals. It is quite similar to a search engine bot crawling a website, the only difference here being that we'd be looking for specific data. We may scrape websites to fetch data into specific formats, or just to have data automatically available when required, or may be to automated a specific process. For example once I had written a small script to scrape the Indian Railways' website to check the status of my PNR and intimate me via email in case there is any change in the status.

    In the article we'll be look at scraping web pages using the Ruby language and we'll be using the Ruby module Mechanize. Mechanize make the underlying task of following links, submitting forms, etc. very easy so that you may concentrate on the logic of data extraction.

    Installing Mechanize



    Installing Mechanize is very easy, just issue the following command on a terminal as root.

    Code:
    $ gem install mechanize

    Using Mechanize



    Say, we'd like to search distrowatch.com for all available Linux distributions and save the data. First, let's see how can we submit forms with Mechanize.

    Code:
    #!/usr/bin/ruby
    
    require 'rubygems'
    require 'mechanize'
    
    mech = Mechanize.new
    mech.get('http://distrowatch.com/') do |page|
      # Submit the search form
      my_page = login_page.form_with(:action => 'table.php') do |field|
        field.distribution  = 'ubuntu'
      end.click_button
    end
    
    
    We can click on links, programatically and set custom user-agent, some websites do not allow programmatic access. Let's see how.

    Code:
    #!/usr/local/bin/ruby
    
    require 'rubygems'
    require 'mechanize'
    
    mech = Mechanize.new
    mech.user_agent_alias = 'Mac Safari'
    mech.get('http://distrowatch.com/') do |page|
      # Submit the search form
      my_page = page.form_with(:action => 'table.php') do |field|
        field.distribution  = 'ubuntu'
      end.click_button
    
      # print the current URL
      print my_page.uri.to_s
    
      # goto link which has ostype parameter
      next_page = my_page.link_with(:href=>/^?ostype=/).click
    
      # print the new URL
      print next_page.uri.to_s
    end
    
    You can use Nokogiri to parse HTML and traverse the DOM to extract data, I have written about it in [THREAD=29471]HTML Parsing in Ruby[/THREAD]. Enjoy scraping the web.
     
    coderzone and shabbir like this.

Share This Page