Re: detecting missing/404 errors for broken file links

Subject: Re: detecting missing/404 errors for broken file links

Author: Haudy Kazemi

Date: 03/14/2002 01:17

> If a file is not found on a server, it is programmed
> to display a 404 page, and in that page it SHOULD
> have a '404 Not Found' HTTP status header.  If it
> doesn't then it would be recognised as an ordinary
> page.

Well, I guess we have a case where the server side 
is 'broken' or implemented improperly.  It is quite 
unlikely we are going to be able to get that fixed, 
especially on sites like Geocities, which leaves a 
client-side workaround the best option.  BTW, that is 
how web browsers tend to operate...they tend to try 
their best to handle broken or poorly implemented 
pages.  Oh, lastly that file link/error page is 
handled correctly in Internet Explorer, meaning that 
IE displays the error page instead of telling the 
user 'here is the file you requested, where shall I 
save it?'.
 
> Coupled with that is also the 'type' of a file.  I
> believe this is also up to the server. It can look
> at the requested file, see what type it is, and set
> the appropriate header.

Maybe HTTrack isn't reading the header type correctly 
from Geocities servers' responses?
> I don't think it would be very good 'looking' at an
> HTML page for <html>, as not every HTML page starts
> with this, and sometimes doesn't even have one at 
all.

Looking for the <HTML> header is just an idea...it 
could also be handled by attempting to verify if a 
file's first bytes are "PK" for zip, "RAR!" for 
rar, "MZ" for exe, etc for non-html file types.  Most 
file types have signatures of some sort at the 
beginnings.  I know COM and DLL have signature headers 
too.

A better implementation is to use a combination of 
techniques...if not checking for <HTML>, then check 
the file to see if it matches the type implied by the 
extension.  (I.e. Read beginning of 'ZIP' file; IF 
first 2 bytes do not equal "PK", THEN save as 
404filename.zip.html)  (Do similar tests for other 
file types like RAR, etc)

Other things that might be affected are cgi/php 
scripts that return files (maybe...I'm not sure.)

Create subthread

All articles

Subject	Author	Date
detecting missing/404 errors for broken file links		03/13/2002 19:55
Re: detecting missing/404 errors for broken file links		03/13/2002 22:52
Re: detecting missing/404 errors for broken file links		03/14/2002 01:17
Re: detecting missing/404 errors for broken file links		03/14/2002 03:43