Re: (creating a) problematic sites list - HTTrack Website Copier Forum

Subject: Re: (creating a) problematic sites list

Author: Haudy Kazemi

Date: 05/14/2002 01:10

I tend to leave HTTrack set without a maximum depth, 
instead preferring to exclude sites the become 
problematic on a case by case basis.  Anyway, I was 
working on this site:
www.staff.uiuc.edu/~ehowes/

and it appeared to be turning into a 'huge' crawl with 
thousands of links to grab and hundreds of unprocessed 
pages (I think that's what the Links Scanned: 334/1457 
(+143) means...143 more links to scan.)  (Right?)

Looking at the details of the Actions section showed 
that many links were coming from these sites:
intel.com
pcpitstop.com
moosoft.com
neuro-tech.net
os2site.com
all of which I added to the 'exclude' scan rules for 
this crawl.  This led to HTTrack finishing with a 
reasonable number of links and files.

Trying to isolate the cause by adding them back one by 
one told me that these were definitely making HTTrack 
go on and on and on.  I think these are bad 
servers...or have some cgis that snag HTTrack.
neuro-tech.net

I'm testing this more thoroughly and will report my 
findings soon.

just a note: using -*intel.com/scripts-df/* blocks the 
same things as -*.intel.com* (when coming from my 
source site)

Create subthread

All articles

Subject	Author	Date
(creating a) problematic sites list		05/13/2002 10:14
Re: (creating a) problematic sites list		05/13/2002 22:54
HTTP Header Viewer		05/14/2002 00:51
Re: (creating a) problematic sites list		05/14/2002 01:10
Re: HTTP Header Viewer		05/14/2002 02:49
Re: (creating a) problematic sites list		05/16/2002 22:22