Skip to content Skip to sidebar Skip to footer

Recursive Web Download Following Links According To Dom Criteria

MSDN is a huge hierarchical doc site. To be more precise, the content is organized in a hierarchical manner, but the URLs are not. The URL space is flat, making it look like everyt

Solution 1:

Mojo::UserAgent returns stuff that understands CSS3 selectors or XPath. For instance, I just showed an example in Painless RSS processing with Mojo. I'm really enjoying this new(ish) web client stuff. Most everything I want is already there (no additional modules) and it's integrated very well.

Solution 2:

This might get you started in the right direction or lead you astray. Note that I first saved the page to a local file so as not to constantly download it while I was working on it.

#!/usr/bin/env perluse strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

$tree->parse_file('nmake-ref.html');

my @links = map { { $_->as_text => $_->attr('href') } }
            $tree->findnodes(q{//div[@class='sectionblock']/*/a});

formy $link (@links) {
    my ($entry, $url) = %{ $link };
    ($link->{ file } = "$entry.html" ) =~ s/[^A-Za-z_0-9.]+/_/g;
    systemwget =>qq{'$url'}, '-O', $link->{ file };
}

Post a Comment for "Recursive Web Download Following Links According To Dom Criteria"