Re: httrack 'escaping' from expected crawling area,XML

Subject: Re: httrack 'escaping' from expected crawling area,XML

Author: Haudy Kazemi

Date: 01/05/2003 08:45

Hi Xaviar,

> > Two reasons:
> > - the server stpidely replies with a 200 ('OK, page 
> found') 
> > message instead of a 404 message, as requested by the 
RFC

If I figured out a way to automatically detect these mis-
identified 404 pages that give a 200 status, would it have 
fixed the problem in this case? (Or would HTTrack have 
continued following that link to 'images/sm-duck.gif' 
anyway?)  

> > - httrack still sees in the fake 404 page the code:
> > 
> > // preload images to be placed in tooltip
> > // place your images in this array
> > var imgAr = new Array(
> > 	'images/sm-duck.gif'
> > );
> > 
> > .. and attempt to fetch again and again the file.

If the page was reported as status 404, but was otherwise 
identical with the same code as above, would HTTrack have 
continued spidering it infinitely?
> > 
> > I will try to find a way to avoid this, but the problem 
> is 
> > not trivial..

I've got some preliminary algorithms put together to help 
identify 'high-risk' URLs and pages that may be problematic 
because they don't return 404s.  They work by looking at 
both the URL given file-type and compare it to the MIME 
type.  A check is then done on the page status response.  I 
think some of this may be usefully incorporated into 
Httrack to give it more intelligence in unusual, non-RFC, 
cases.  Please let me know what you think about all 
this.  :)

Create subthread

All articles

Subject	Author	Date
httrack 'escaping' from expected crawling area,XML		12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML		01/05/2003 08:45