perl + curl hyperlink parser

I go through seasons where I struggle to find ideas for projects that I am interested in.  Now is not one of those times.  I am finding it hard to make time to work on my ideas.  I like that.  I wanted to see how hard it would be to implement my own web crawler.  At the core of this idea is the ability to find all the hyperlinks in a given web page.  After some trial and error (not googling)  I have a pretty decent hyperlink finder.  My script involves getting the content using curl.  It then redirects the output of curl to a text file.  I use perl to parse the text file and a regular expression to locate the hyperlinks.  At first I was using anchor tags to find links.  It was somewhat effective, unless there were multiple links on a single  line of text.  I scrapped this approach because I used the ‘.+’ method of getting the inner contents of the anchor. It had the undesirable (but predictable) effect of mushing multiple links in a single line together.   I had more luck looking for the inner text of the href=”” attributes of the anchor tags.

Here is a hyperlink finder script.

#/usr/bin/perl -w
use strict;

my $url = $ARGV[0];

print "looking up $url...\n";

`curl $url > pl.txt`;

print "analyzing $url...\n";

my $wc = `wc  pl.txt`;

print $wc;

open XL, "pl.txt" or die $!;
my $nlink = 0;
	while($_ =~ /\shref="(\S+)"/g){
		print "$nlink\t$1\n";
	}#end if	
}#end while

print "num links on page: $nlink\n";

close XL;

Here is a sample output using my homepage.

hyperlink parser using cURL and perl regular expressions
hyperlink parser using cURL and perl regular expressions

Leave a Reply

Your email address will not be published. Required fields are marked *