| ========== Cut-n-Paste
URIs are case sentitive in httrack (not hostnames) ;
therefore index.html and INDEX.HTML will be saved as two
different files by httrack. Besides, httrack considers that
index.html and INDEX.HTML are the same local ressource
names, and therefore will save them with different names.
========== End Cut-n-Paste
...this makes sense, but that is assuming that a link
is pointing to both of them, then as HTTrack processed
it, it would find both (assuming they existed).
========== Cut-n-Paste
you should not have 404 errors, as httrack isn't
converting names into lowercase when doing the requests
========== End Cut-n-Paste
...exactly, this is what I was referring to. If a link
says www.website.com/index.htm, but in fact on the host the
file name is INDEX.HTM, the browser (as well as HTTrack)
cant find it (at least for the site that originally
prompted my first post). This is not a bug/problem with
HTTrack, but this does cause the download to fail. The
problem is that the author of the page created a bad link
(or assumed the filename would be lowercase, not upper).
So what I was proposing was that if HTTrack gets a 404, it
tries converting the name to all uppercase, then lowercase,
then titlecase, as well as tries .htm, & .html (this should
probably be optional to do all these different combonations
as I am pretty sure it could slow things down). This could
very useful, as you could still get a page even though the
links are bad, then when it is saved, they would probably
be corrected by HTTrack.
I wish I had the website that I found this problem on...
but I got so frustrated trying to go through and fix the
links manually I think I deleted the link (I know I deleted
the local copy of the stuff I was working on). I will
continue to search for it, and post it if I find it again.
Regards,
Tj | |