HTTrack Website Copier
Free software offline browser - FORUM
Subject: How to do this simple crawl?
Author: Yang
Date: 03/08/2012 01:19
 
Hi, I'm trying to mirror a website that has a bunch of pages of the following
form:

<http://foo.com/dir/az-1.html>
<http://foo.com/dir/az-2.html>
<http://foo.com/dir/az-1.html?letter=A>
<http://foo.com/dir/az-2.html?letter=A>
<http://foo.com/dir/az-3.html?letter=A>
<http://foo.com/dir/az-1.html?letter=B>
<http://foo.com/dir/az-1.html?letter=C>
...

Any of these pages are reachable from any other via links (but potentially
only indirectly via other az-* pages).

I just care about these az-* pages and want to ignore all other links.  I also
don't want any page prereqs like images or CSS - just the HTML.

How do I use httrack to mirror this?  Ideally without creating any local
directories, just fetching the az-* pages straight into the working
directory.

I tried a bunch of invocations including:

httrack -S -p1 -z -v -r99999 <http://foo.com/dir/az-1.html> -'*' +'*az-*'
httrack -S -g -z -v -r99999 <http://foo.com/dir/az-1.html> -'*' +'*az-*'
httrack <http://foo.com/dir/az-1.html> -%v '*az-*' # generated with wizard

But none of these are working.  They're downloading irrelevant files and/or
heading down irrelevant parts of the website and/or stopping right away after
az-1.html.

Just in case it helps further understand what I'm trying to do, this wget
command comes close, except for the shortcoming that wget always follows all
html links (thus heading down irrelevant pages):

wget -r -l inf -nc -np -nH -nd -A 'az-*' <http://foo.com/dir/az-1.html>

Thanks in advance for any answers.
 
Reply


All articles

Subject Author Date
How to do this simple crawl?

03/08/2012 01:19
Re: How to do this simple crawl?

03/08/2012 16:14
Re: How to do this simple crawl?

03/09/2012 09:43




6

Created with FORUM 2.0.11