Simple Web Crawler in Perl

Discussion in 'Perl' started by blitzcoder, Aug 24, 2010.

  1. blitzcoder

    blitzcoder New Member

    Joined:
    Aug 24, 2010
    Messages:
    12
    Likes Received:
    2
    Trophy Points:
    0
    Occupation:
    Student
    Location:
    Surat, India
    Home Page:
    http://shreyagarwal.blog.com
    Have you ever wondered how a search engine like Google works? Well, it uses web crawlers and web spiders which “crawl” the web from one URL to all connected URLs and so on retrieving relevant data from each URL and classifying each web page according to some criteria and storing the URL and related keywords in a database.

    In this article, we’ll take a look at how a web crawler actually goes on crawling the web. The algorithm can be expressed as follows --

    Code:
    Initialize List_URLs
    Add starting URL to list
    While (List_URLs not finished)
         Pick URL
         If (HTTP)
             Fetch Page
             Parse Page
             Add URLs to List_URLs
         End	
    End
    What this algorithm will do is keep scanning through all websites and proceed in a tree-like manner to scan most of the World Wide Web. Now let us look at the Perl code for this web crawler. :)

    Code:
    #!/usr/bin/perl
    use strict;	
    use warnings;
    #Now we will define variables, “links” is the list we are using. “cur_link” is the link of current page
    # “var” is used to take in the page content.
    my(@links,$cur_link,$var,$temp);
    push(@links,"the-starting-website");
    foreach $cur_link (@links)
    {
            if($cur_link=~/^http/)
            {
    	  # in the next few lines, we run the system command “curl” and retrieve the page content
                    open my $fh,"curl $cur_link|";
                    {
                            local $/;
                            $var=<$fh>;
                    }
                    close $fh;
                    print "\nCurrently Scanning -- ".$cur_link;
    	 # In the next line we extract all links in the current page
                    my @p_links= $var=~/<a href=\"(.*?)\">/g;
                    foreach $temp(@p_links)
                    {       
                            if((!($temp=~/^http/))&&($temp=~/^\//))
                            {
    			#This part of the code lets us correct internal addresses like “/index.aspx”
                                    $temp=$cur_link.$temp;
                            }
    		# In the next line we add the links to the main “links” list.
     push(@links,$temp);
                    }
            }
    }
    
    The code with the comments and the algorithm are self-explanatory. Thus, we have created a web crawler that endlessly crawls the web. You can use the extracted information or store it in a database as per your requirement.

    So now you are acquainted with how a web crawler works. This code was written on a linux machine. Thus, the use of command “curl” which is a shell command. Now you can go on to modify the code to include a maximum depth level using Depth-First search or to search for a particular keyword in all these pages. The world wide web is under your control now! ;)
     
  2. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83
  3. PradeepKr

    PradeepKr New Member

    Joined:
    Aug 24, 2010
    Messages:
    24
    Likes Received:
    0
    Trophy Points:
    0
    Home Page:
    http://www.expertsguide.info
    I advise to make it a hash than array to remove repetition of same links (prevent deadlock).
     
  4. shabbir

    shabbir Administrator Staff Member

    Joined:
    Jul 12, 2004
    Messages:
    15,375
    Likes Received:
    388
    Trophy Points:
    83
  5. seoguru

    seoguru New Member

    Joined:
    Oct 6, 2010
    Messages:
    19
    Likes Received:
    0
    Trophy Points:
    0
    Location:
    http://www.seofleet.com
    google have a internal big process for crowning a webstie
     
  6. Monstar

    Monstar New Member

    Joined:
    Sep 14, 2011
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    I found a website called IRobot. It is an open source web crawler.
    I'm seriously thinking about downloading it & use it to automate a few things for me to make my life easier..
    watching this little guy in action got the little squirrels in my head to do some exercise.
    I was wondering if there was such a robot or web crawler that could find articles for me if I gave it a keyword or key phrases & bring them back to me so I may use an article spinner.
    I robot can do this, but. Only if you know the URL.
    I was wondering if there was such a web crawler or spider that exists to just go out into the World Wide Web & find these articles automatically by just giving it keywords or key phrases…
    If it doesn't exist, would somebody here be willing to build one for me?
    I'm trying to make money online via affiliate programs, AdSense etc. etc.
    what I would like to do is submit as many articles as I can to as many article directories that I can creating some sort of broken link wheel. That way it would look a little more natural to the search engines.
    I would be able to pay somebody to create one for me, but. I only get so much every month. So, hopefully this person would be kind enough to take monthly payments in small increments every month until it's paid off.
    Anyway, thank you for taking the time to read this.
    Rob
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice