Recursive Web Download Following Links According To Dom Criteria
MSDN is a huge hierarchical doc site. To be more precise, the content is organized in a hierarchical manner, but the URLs are not. The URL space is flat, making it look like everyt
Solution 1:
Mojo::UserAgent returns stuff that understands CSS3 selectors or XPath. For instance, I just showed an example in Painless RSS processing with Mojo. I'm really enjoying this new(ish) web client stuff. Most everything I want is already there (no additional modules) and it's integrated very well.
Solution 2:
This might get you started in the right direction or lead you astray. Note that I first saved the page to a local file so as not to constantly download it while I was working on it.
#!/usr/bin/env perluse strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file('nmake-ref.html');
my @links = map { { $_->as_text => $_->attr('href') } }
$tree->findnodes(q{//div[@class='sectionblock']/*/a});
formy $link (@links) {
my ($entry, $url) = %{ $link };
($link->{ file } = "$entry.html" ) =~ s/[^A-Za-z_0-9.]+/_/g;
systemwget =>qq{'$url'}, '-O', $link->{ file };
}
Post a Comment for "Recursive Web Download Following Links According To Dom Criteria"