Re: httrack 'escaping' from expected crawling area,XML

Subject: Re: httrack 'escaping' from expected crawling area,XML

Author: Haudy Kazemi

Date: 12/31/2002 11:51

> >completed successfully (AFAIK) except for one
> >endless/infinite loop on an image (actually a 404 error)
> 
> Two reasons:
> - the server stpidely replies with a 200 ('OK, page 
found') 
> message instead of a 404 message, as requested by the RFC

Yes, I see that.  This is also clear from looking at the 
new.txt file (which I must add is a very useful tool).  I 
know this issue has popped up before, I think on a few 
sites I was working on months ago.

> - httrack still sees in the fake 404 page the code:
> 
> // preload images to be placed in tooltip
> // place your images in this array
> var imgAr = new Array(
> 	'images/sm-duck.gif'
> );
> 
> .. and attempt to fetch again and again the file.
> 
> I will try to find a way to avoid this, but the problem 
is 
> not trivial..

Indeed...how to deal with broken servers is definitely a 
problem...if only things would follow the RFCs on things.  
Anyhow, from looking at the NEW.TXT, I observed something 
that may be a (partial) solution:  the broken server 
returned a text/html MIME type for the sm-duck.gif file, 
while it returned normal image/gif MIME types for various 
other gif files it served up.  It did the same thing 
(text/html MIME) for the PDF document too.  On working 
servers, I presume the mime types match up better.

Perhaps by watching for major MIME type/file extension 
clashes (like text/html with an obvious image file 
image.jpg/.gif/.png/.bmp) (or text/html with an obvious non-
text file like PDF, images, XLS, PS, etc.) HTTrack can 
detect 'potentially troublesome' links.  The test would be 
to compare the advertised MIME type with the file type 
given by the file extension taken from the URL.  This would 
be in addition to the current Check Document type used to 
identify the result of a CGI whether it gives HTML, GIF, 
etc. output.

Then, if a link is considered 'potentially troublesome' a 
text search for strings '404', 'not', and 'found' could be 
done on the page at that link.

Alternatively, links flagged as 'potentially troublesome' 
could be added to a list the user must double check and 
override if they want the link to download anyway.

Yet another workaround is figuring out a way to detect 
infinite loops.  I don't have an algorithm for doing 
that...yet...

Create subthread

All articles

Subject	Author	Date
httrack 'escaping' from expected crawling area,XML		12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML		01/05/2003 08:45