| > >completed successfully (AFAIK) except for one
> >endless/infinite loop on an image (actually a 404 error)
>
> Two reasons:
> - the server stpidely replies with a 200 ('OK, page
found')
> message instead of a 404 message, as requested by the RFC
Yes, I see that. This is also clear from looking at the
new.txt file (which I must add is a very useful tool). I
know this issue has popped up before, I think on a few
sites I was working on months ago.
> - httrack still sees in the fake 404 page the code:
>
> // preload images to be placed in tooltip
> // place your images in this array
> var imgAr = new Array(
> 'images/sm-duck.gif'
> );
>
> .. and attempt to fetch again and again the file.
>
> I will try to find a way to avoid this, but the problem
is
> not trivial..
Indeed...how to deal with broken servers is definitely a
problem...if only things would follow the RFCs on things.
Anyhow, from looking at the NEW.TXT, I observed something
that may be a (partial) solution: the broken server
returned a text/html MIME type for the sm-duck.gif file,
while it returned normal image/gif MIME types for various
other gif files it served up. It did the same thing
(text/html MIME) for the PDF document too. On working
servers, I presume the mime types match up better.
Perhaps by watching for major MIME type/file extension
clashes (like text/html with an obvious image file
image.jpg/.gif/.png/.bmp) (or text/html with an obvious non-
text file like PDF, images, XLS, PS, etc.) HTTrack can
detect 'potentially troublesome' links. The test would be
to compare the advertised MIME type with the file type
given by the file extension taken from the URL. This would
be in addition to the current Check Document type used to
identify the result of a CGI whether it gives HTML, GIF,
etc. output.
Then, if a link is considered 'potentially troublesome' a
text search for strings '404', 'not', and 'found' could be
done on the page at that link.
Alternatively, links flagged as 'potentially troublesome'
could be added to a list the user must double check and
override if they want the link to download anyway.
Yet another workaround is figuring out a way to detect
infinite loops. I don't have an algorithm for doing
that...yet... | |