| Hi Xaviar,
> > Two reasons:
> > - the server stpidely replies with a 200 ('OK, page
> found')
> > message instead of a 404 message, as requested by the
RFC
If I figured out a way to automatically detect these mis-
identified 404 pages that give a 200 status, would it have
fixed the problem in this case? (Or would HTTrack have
continued following that link to 'images/sm-duck.gif'
anyway?)
> > - httrack still sees in the fake 404 page the code:
> >
> > // preload images to be placed in tooltip
> > // place your images in this array
> > var imgAr = new Array(
> > 'images/sm-duck.gif'
> > );
> >
> > .. and attempt to fetch again and again the file.
If the page was reported as status 404, but was otherwise
identical with the same code as above, would HTTrack have
continued spidering it infinitely?
> >
> > I will try to find a way to avoid this, but the problem
> is
> > not trivial..
I've got some preliminary algorithms put together to help
identify 'high-risk' URLs and pages that may be problematic
because they don't return 404s. They work by looking at
both the URL given file-type and compare it to the MIME
type. A check is then done on the page status response. I
think some of this may be usefully incorporated into
Httrack to give it more intelligence in unusual, non-RFC,
cases. Please let me know what you think about all
this. :) | |