| Hey everyone,
I've recently being trying to move a system i've built from using wget to
using httrack.
The main reason for this is: i want to "spider" the whole of a site but only
save HTML/php etc pages that match a particular filter. As far as i could see,
wget had no way to do this... which is a pain as I don't want to download a
WHOLE website just to gather data from maybe 10 pages inside the site.
The way i am understanding it i should be able to do something very close to
this with httrack. The command i've built up so far is something like:
httrack <http://www.websitetospider.com/page.cfm> -O "./spidereddata" "-*"
"+.cfm" "+.htm" "+-html" "+*websitetospider.com/listings.cfm/listing/*" -r6
-v
Basically i want to start at an arbitary starting point and then spider
through the whole site (to a limit of maybe 6 levels of recursion) but only
store pages that match my listings.cfm/listing/ pattern (these pages have the
data on them).
The issue is, like with Wget, i don't know if httrack will follow the links
that don't match my patterns to find the other ones. I really want it to
exhaust all the pages in the site checking to see if there are any links that
match my pattern to download.
I suppose i'm looking for someone who understands how httrack works to maybe
give me a little bit of assurance/guidance that i'm on the right path...
Any help is greatly appreciated..
- rex | |