| I tend to leave HTTrack set without a maximum depth,
instead preferring to exclude sites the become
problematic on a case by case basis. Anyway, I was
working on this site:
www.staff.uiuc.edu/~ehowes/
and it appeared to be turning into a 'huge' crawl with
thousands of links to grab and hundreds of unprocessed
pages (I think that's what the Links Scanned: 334/1457
(+143) means...143 more links to scan.) (Right?)
Looking at the details of the Actions section showed
that many links were coming from these sites:
intel.com
pcpitstop.com
moosoft.com
neuro-tech.net
os2site.com
all of which I added to the 'exclude' scan rules for
this crawl. This led to HTTrack finishing with a
reasonable number of links and files.
Trying to isolate the cause by adding them back one by
one told me that these were definitely making HTTrack
go on and on and on. I think these are bad
servers...or have some cgis that snag HTTrack.
neuro-tech.net
I'm testing this more thoroughly and will report my
findings soon.
just a note: using -*intel.com/scripts-df/* blocks the
same things as -*.intel.com* (when coming from my
source site) | |