HTTrack Website Copier
Free software offline browser - FORUM
Subject: Unable to download DMOZ
Author: Frank de Groot
Date: 10/11/2010 10:54
 
I tried to download DMOZ but I found the following problem (bug or me not
understanding a feature) in HTTRack, causing an enormous nomber of 404's,
resulting in only a small part of DMOZ being downloaded.

I set the max. nr. of links to 100 000 000, to no avail.

What happens is that HTTrack appends a ".html" to URL's that do not end with
".html" but with a slash. 

Example where it goes wrong is the link
"www.dmoz.org/Health/Conditions_and_Diseases/Nutrition_and_Metabolism_Disorders/"


HTTrack appends ".html" and can't follow that erroneous link as a result,
yielding a 404. I have spent weeks to try to solve this, each time downloading
way over a GB from DMOZ, I even installed Linux and used the Linux version,
but every time the same problem, no matter how much I tweak the settings.

Before I bring down DMOZ, or before DMOZ folds, could anyone help me?
TL;DR: HTTrack appends .html suffixes to URL's that end with a slash, causing
false 404's.







 
Reply


All articles

Subject Author Date
Unable to download DMOZ

10/11/2010 10:54
Re: Unable to download DMOZ

10/11/2010 11:14
Re: Unable to download DMOZ

10/11/2010 11:57
Re: Unable to download DMOZ

10/11/2010 19:33




c

Created with FORUM 2.0.11