| Hello,
Xavier, if you see this post, thank you very much for HTTrack. It's a
tremendously useful tool that I regularly use in my academic work to save
fan-generated websites about early arcade games before they disappear.
I'm currently trying to push it into a slightly different use-scenario; based
on the documentation, it seems to be possible, but I'm not quite sure how some
of the options interact.
Here's what I would like to do: 1.) capture an entire website, plus one degree
of out-links (i.e. a complete mirror of a website, plus every URL which the
website links to). 2.) do the same, but only capture the list of URLs.
Here's what I have so far, using my personal website as an example:
httrack <http://www.ludist.com/> -O [path directory] -e -%e1 (for the full
capture)
&
httrack <http://www.ludist.com/> -O [path directory] -e -%e1 -p0 (for the URL
scan)
As I understand it, those options are:
-e = search the whole web, rather than just within the top-level domain
-%e1 = limit option; 1 degree of external links from top level domain
-p0 = switch to scan from capture
Is there an argument or something that the -%eN option requires? The problem
I'm running into is that it's not limiting the crawling at all and trying to
download the entire internet. I've tried it without -e, and it still was
trying to download the entire internet, oddly enough.
I saw someone else ask for similar help, and -near was suggested; based on the
documentation, I don't see how that would resolve the issue.
I deeply appreciate any help you can provide. Thanks again for HTTrack!
Best,
Tommy Rousse
JD-PhD student
Northwestern University | Media, Technology and Society | |