HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: (creating a) problematic sites list
Author: Haudy Kazemi
Date: 05/14/2002 01:10
 
I tend to leave HTTrack set without a maximum depth, 
instead preferring to exclude sites the become 
problematic on a case by case basis.  Anyway, I was 
working on this site:
www.staff.uiuc.edu/~ehowes/

and it appeared to be turning into a 'huge' crawl with 
thousands of links to grab and hundreds of unprocessed 
pages (I think that's what the Links Scanned: 334/1457 
(+143) means...143 more links to scan.)  (Right?)

Looking at the details of the Actions section showed 
that many links were coming from these sites:
intel.com
pcpitstop.com
moosoft.com
neuro-tech.net
os2site.com
all of which I added to the 'exclude' scan rules for 
this crawl.  This led to HTTrack finishing with a 
reasonable number of links and files.

Trying to isolate the cause by adding them back one by 
one told me that these were definitely making HTTrack 
go on and on and on.  I think these are bad 
servers...or have some cgis that snag HTTrack.
neuro-tech.net

I'm testing this more thoroughly and will report my 
findings soon.

just a note: using -*intel.com/scripts-df/* blocks the 
same things as -*.intel.com* (when coming from my 
source site)
 
Reply Create subthread


All articles

Subject Author Date
(creating a) problematic sites list

05/13/2002 10:14
Re: (creating a) problematic sites list

05/13/2002 22:54
HTTP Header Viewer

05/14/2002 00:51
Re: (creating a) problematic sites list

05/14/2002 01:10
Re: HTTP Header Viewer

05/14/2002 02:49
Re: (creating a) problematic sites list

05/16/2002 22:22




d

Created with FORUM 2.0.11