HTTrack Website Copier
Free software offline browser - FORUM
Subject: Command Line Spidering
Author: rex
Date: 03/25/2008 14:03
 
Hey everyone,

I've recently being trying to move a system i've built from using wget to
using httrack.

The main reason for this is: i want to "spider" the whole of a site but only
save HTML/php etc pages that match a particular filter. As far as i could see,
wget had no way to do this... which is a pain as I don't want to download a
WHOLE website just to gather data from maybe 10 pages inside the site.

The way i am understanding it i should be able to do something very close to
this with httrack. The command i've built up so far is something like: 

 httrack <http://www.websitetospider.com/page.cfm> -O "./spidereddata" "-*"
"+.cfm" "+.htm" "+-html" "+*websitetospider.com/listings.cfm/listing/*"  -r6
-v

Basically i want to start at an arbitary starting point and then spider
through the whole site (to a limit of maybe 6 levels of recursion) but only
store pages that match my listings.cfm/listing/ pattern (these pages have the
data on them).

The issue is, like with Wget, i don't know if httrack will follow the links
that don't match my patterns to find the other ones. I really want it to
exhaust all the pages in the site checking to see if there are any links that
match my pattern to download.

I suppose i'm looking for someone who understands how httrack works to maybe
give me a little bit of assurance/guidance that i'm on the right path...

Any help is greatly appreciated..

- rex
 
Reply


All articles

Subject Author Date
Command Line Spidering

03/25/2008 14:03
Re: Command Line Spidering

03/25/2008 19:34
Re: Command Line Spidering

03/26/2008 13:18




9

Created with FORUM 2.0.11