| I tried to download DMOZ but I found the following problem (bug or me not
understanding a feature) in HTTRack, causing an enormous nomber of 404's,
resulting in only a small part of DMOZ being downloaded.
I set the max. nr. of links to 100 000 000, to no avail.
What happens is that HTTrack appends a ".html" to URL's that do not end with
".html" but with a slash.
Example where it goes wrong is the link
"www.dmoz.org/Health/Conditions_and_Diseases/Nutrition_and_Metabolism_Disorders/"
HTTrack appends ".html" and can't follow that erroneous link as a result,
yielding a 404. I have spent weeks to try to solve this, each time downloading
way over a GB from DMOZ, I even installed Linux and used the Linux version,
but every time the same problem, no matter how much I tweak the settings.
Before I bring down DMOZ, or before DMOZ folds, could anyone help me?
TL;DR: HTTrack appends .html suffixes to URL's that end with a slash, causing
false 404's.
| |