Have you ever wondered how a search engine like Google works? Well, it uses web crawlers and web spiders which “crawl” the web from one URL to all connected URLs and so on retrieving relevant data from each URL and classifying each web page according to some criteria and storing the URL and related keywords in a database.
In this article, we’ll take a look at how a web crawler actually goes on crawling the web. The algorithm can be expressed as follows --
What this algorithm will do is keep scanning through all websites and proceed in a tree-like manner to scan most of the World Wide Web. Now let us look at the Perl code for this web crawler. 
The code with the comments and the algorithm are self-explanatory. Thus, we have created a web crawler that endlessly crawls the web. You can use the extracted information or store it in a database as per your requirement.
So now you are acquainted with how a web crawler works. This code was written on a linux machine. Thus, the use of command “curl” which is a shell command. Now you can go on to modify the code to include a maximum depth level using Depth-First search or to search for a particular keyword in all these pages. The world wide web is under your control now!
In this article, we’ll take a look at how a web crawler actually goes on crawling the web. The algorithm can be expressed as follows --
Code:
Initialize List_URLs
Add starting URL to list
While (List_URLs not finished)
Pick URL
If (HTTP)
Fetch Page
Parse Page
Add URLs to List_URLs
End
End

Code:
#!/usr/bin/perl
use strict;
use warnings;
#Now we will define variables, “links” is the list we are using. “cur_link” is the link of current page
# “var” is used to take in the page content.
my(@links,$cur_link,$var,$temp);
push(@links,"the-starting-website");
foreach $cur_link (@links)
{
if($cur_link=~/^http/)
{
# in the next few lines, we run the system command “curl” and retrieve the page content
open my $fh,"curl $cur_link|";
{
local $/;
$var=<$fh>;
}
close $fh;
print "\nCurrently Scanning -- ".$cur_link;
# In the next line we extract all links in the current page
my @p_links= $var=~/<a href=\"(.*?)\">/g;
foreach $temp(@p_links)
{
if((!($temp=~/^http/))&&($temp=~/^\//))
{
#This part of the code lets us correct internal addresses like “/index.aspx”
$temp=$cur_link.$temp;
}
# In the next line we add the links to the main “links” list.
push(@links,$temp);
}
}
}
So now you are acquainted with how a web crawler works. This code was written on a linux machine. Thus, the use of command “curl” which is a shell command. Now you can go on to modify the code to include a maximum depth level using Depth-First search or to search for a particular keyword in all these pages. The world wide web is under your control now!

