| > If a file is not found on a server, it is programmed
> to display a 404 page, and in that page it SHOULD
> have a '404 Not Found' HTTP status header. If it
> doesn't then it would be recognised as an ordinary
> page.
Well, I guess we have a case where the server side
is 'broken' or implemented improperly. It is quite
unlikely we are going to be able to get that fixed,
especially on sites like Geocities, which leaves a
client-side workaround the best option. BTW, that is
how web browsers tend to operate...they tend to try
their best to handle broken or poorly implemented
pages. Oh, lastly that file link/error page is
handled correctly in Internet Explorer, meaning that
IE displays the error page instead of telling the
user 'here is the file you requested, where shall I
save it?'.
> Coupled with that is also the 'type' of a file. I
> believe this is also up to the server. It can look
> at the requested file, see what type it is, and set
> the appropriate header.
Maybe HTTrack isn't reading the header type correctly
from Geocities servers' responses?
> I don't think it would be very good 'looking' at an
> HTML page for <html>, as not every HTML page starts
> with this, and sometimes doesn't even have one at
all.
Looking for the <HTML> header is just an idea...it
could also be handled by attempting to verify if a
file's first bytes are "PK" for zip, "RAR!" for
rar, "MZ" for exe, etc for non-html file types. Most
file types have signatures of some sort at the
beginnings. I know COM and DLL have signature headers
too.
A better implementation is to use a combination of
techniques...if not checking for <HTML>, then check
the file to see if it matches the type implied by the
extension. (I.e. Read beginning of 'ZIP' file; IF
first 2 bytes do not equal "PK", THEN save as
404filename.zip.html) (Do similar tests for other
file types like RAR, etc)
Other things that might be affected are cgi/php
scripts that return files (maybe...I'm not sure.)
| |