| Unlike most of you, I am trying to delimit rather than limit my spidering as
much as possible for an art project I'm working on. I have been using wget,
with reasonable results, but it has a tendency to die rather quickly. I've
been
experimenting with httrack for a few days, and it seem to have some
advantages, but I am having trouble crossing from one domain to another: I'll
get the homepage, but no more. I'm using the following options:
httrack <http://www.somesite.org> -O /Volumes/sounds/httrack_get
-C0N1003s0K%e9999r9999zI0b1nBe
.httrackrc:
assume sp=text/html,php3=text/html,cgi=image/gif
ext-depth 512
user-agent "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
What I get is about 27 files downloaded, and then things no longer download,
though the program is still running, displaying messages like this:
channels.netscape.com/ns/search/hotsearch.jsp (168 bytes) - OK
The matching line in my log:
23:35:33 Info: engine: save-name: local name: channels.netscape.com/ns/
search/hotsearch.html -> hotsearch.html
It has apparently not downloaded, just checked. (I realize that as a
javascript,
it may not download, but html files don't either off the main domain.
Now that is Netscape, who knows what protection they have, but I get this
link:
23:31:54 Info: engine: transfer-status: link recorded:
www.throughthecracks.org/index.html -> /Volumes/sounds/
httrack_get_pan2/index-9.html
I have that file -- it's another homepage, but I'm not getting anything from
the throughthecracks.org site past that point. If I try the site directly, it
downloads, no problem.
I thought the e flag, plus the %e depth, would cover this. What am I doing
wrong? And have you any other tips for promiscuous downloading?
Thanks,
\M | |