detecting missing/404 errors for broken file links

Subject: detecting missing/404 errors for broken file links

Author: Haudy Kazemi

Date: 03/13/2002 19:55

*Note, this is long (sorry), but I've tried to 
identify the problem with specific case example, and 
make a reasonable suggestion for (what I think is) a 
good solution.  Jump to the end for a more concise 
summary.*
----------------------
I've run across several cases were HTTrack finds a 
link to a file (ZIP/RAR/etc) where the file is missing 
which generates a 'missing' error in regular web 
browsers.  HTTrack treats the error page as the ZIP or 
RAR file itself and saves it as that.  (If the ZIP 
file was called 'LinuxFAQ.zip' but was missing, the 
custom missing error HTML page is saved as 
LinuxFAQ.zip (I verified this by opening it in 
NotePad...it was the HTML code for the error page.)

I've noticed it on sites with custom 404 like 
geocities.com which do not appear to redirect the web 
browser to a page like '404.html' but instead provides 
the 404 page directly in place of LinuxFAQ.zip.

Fixing this may require detecting the type of 
downloaded files; this is quite easy for ZIPs (when 
you open them they start with 'PK') and RARs (they 
start with 'RAR!').  Maybe it is better to try to 
detect whether the 'ZIP' or 'RAR' file in question is 
HTML code (starts with '<HTML>'), giving it the HTML 
extension if so.

Some ideas for naming the erroneous link are:
Example link to file that is supposed work:
www.geocities.com/somesite/LinuxFAQ.zip

Renaming ideas:
LinuxFAQ.zip404.html
404a.html, 404b.html, etc.
404LinuxFAQ.zip.html

The 3rd idea is the most useful for anyone wishing to 
know the name of the original download, yet also being 
able tell at a quick glance of his folders that some 
files were missing, and what they were called.

-----------------------------
Example/quick summary
-----------------------------
Here is one link that is treated improperly (supposed 
be a WAV file, but was missing and replaced with a 404)
<http://www.geocities.com/SiliconValley/Bay/5498/welcome>
2.wav linked from 
members.aol.com/donnaskani/aolwavs.html

HTTrack grabs and saves WELCOME2.WAV as if it were the 
correct file.  When the mirror is browsed locally, it 
is treated as if it were 'ok' up to when you click on 
a link to open the WAV/ZIP/RAR.  Then you can 
either 'save' or 'open' the file, but that fails 
because it is truely an HTML error page renamed as 
WAV/ZIP/RAR/etc.

I think the fix for these nonstandard error pages is 
to detect the types of all supposedly non-HTML files 
to be downloaded.  If they begin with '<HTML>', the 
local copy could be named something 
like '404LinuxFAQ.zip.html' for a file supposed to be 
called 'LinuxFAQ.zip'.  
Reasons for this naming scheme:
1.) putting 404 in front of the name will make them 
easily noticeable,
2.) including the original filename will preserve more 
of the original content information (so you can 
manually track down the lost file by using 
FTP/filesearch engines)
3.) adding a second extension '.html' will ensure the 
browser opens the missing HTML file instead of trying 
to send the link to an external program.

All articles

Subject	Author	Date
detecting missing/404 errors for broken file links		03/13/2002 19:55
Re: detecting missing/404 errors for broken file links		03/13/2002 22:52
Re: detecting missing/404 errors for broken file links		03/14/2002 01:17
Re: detecting missing/404 errors for broken file links		03/14/2002 03:43