| *Note, this is long (sorry), but I've tried to
identify the problem with specific case example, and
make a reasonable suggestion for (what I think is) a
good solution. Jump to the end for a more concise
summary.*
----------------------
I've run across several cases were HTTrack finds a
link to a file (ZIP/RAR/etc) where the file is missing
which generates a 'missing' error in regular web
browsers. HTTrack treats the error page as the ZIP or
RAR file itself and saves it as that. (If the ZIP
file was called 'LinuxFAQ.zip' but was missing, the
custom missing error HTML page is saved as
LinuxFAQ.zip (I verified this by opening it in
NotePad...it was the HTML code for the error page.)
I've noticed it on sites with custom 404 like
geocities.com which do not appear to redirect the web
browser to a page like '404.html' but instead provides
the 404 page directly in place of LinuxFAQ.zip.
Fixing this may require detecting the type of
downloaded files; this is quite easy for ZIPs (when
you open them they start with 'PK') and RARs (they
start with 'RAR!'). Maybe it is better to try to
detect whether the 'ZIP' or 'RAR' file in question is
HTML code (starts with '<HTML>'), giving it the HTML
extension if so.
Some ideas for naming the erroneous link are:
Example link to file that is supposed work:
www.geocities.com/somesite/LinuxFAQ.zip
Renaming ideas:
LinuxFAQ.zip404.html
404a.html, 404b.html, etc.
404LinuxFAQ.zip.html
The 3rd idea is the most useful for anyone wishing to
know the name of the original download, yet also being
able tell at a quick glance of his folders that some
files were missing, and what they were called.
-----------------------------
Example/quick summary
-----------------------------
Here is one link that is treated improperly (supposed
be a WAV file, but was missing and replaced with a 404)
<http://www.geocities.com/SiliconValley/Bay/5498/welcome>
2.wav linked from
members.aol.com/donnaskani/aolwavs.html
HTTrack grabs and saves WELCOME2.WAV as if it were the
correct file. When the mirror is browsed locally, it
is treated as if it were 'ok' up to when you click on
a link to open the WAV/ZIP/RAR. Then you can
either 'save' or 'open' the file, but that fails
because it is truely an HTML error page renamed as
WAV/ZIP/RAR/etc.
I think the fix for these nonstandard error pages is
to detect the types of all supposedly non-HTML files
to be downloaded. If they begin with '<HTML>', the
local copy could be named something
like '404LinuxFAQ.zip.html' for a file supposed to be
called 'LinuxFAQ.zip'.
Reasons for this naming scheme:
1.) putting 404 in front of the name will make them
easily noticeable,
2.) including the original filename will preserve more
of the original content information (so you can
manually track down the lost file by using
FTP/filesearch engines)
3.) adding a second extension '.html' will ensure the
browser opens the missing HTML file instead of trying
to send the link to an external program.
| |