Re: HTTrack and Alternatives to wget - HTTrack Website Copier Forum

Subject: Re: HTTrack and Alternatives to wget

Author: Xavier Roche

Date: 04/08/2006 16:45

> Recursive Download (HTTrack can do)
> Preserve Directory Structure (HTTrack can do)
> Preserve Filenames (HTTrack can do*)

Except:
- when the filename is using characters that are not compatible with the
filesystem (that is, characters like ':' or '*'), or with too long names
- when the filename needs to be renamed, due to naming collision (such as
index.html?page=1 and index.html?page=2) as you already figured out

> Is intelligent about MIME Types (HTTrack is pretty
> good at this)

Well, the latest (3.40) release is now waiting for the remote MIME type to
name the pages and take decisions, and handle transparently redirections: the
handling is generally quite good. [ Some bugs seems to remains, however, and
sometimes cause files to be named with a ".del" type, but I couldn't yet track
the bug ]

> Can be embedded in a browser (I don't think HTTrack
> can do this)

You mean, for the capture ? Well, the Linux release (webhttrack) is using a
browser, and adapting the code for remote use is potentially not really
complicated, even if this hasn't yet been tested. The biggest work would be to
organize/control remotely started mirrors, restart servers..

> Can download pages/sites to strings in memory
> instead of files on hard disk (I don't think HTTrack
> can do this)

Not really - actually, not without some external plugin.

> Again, HTTrack is able to do most of this stuff
> pretty well, and I have a feeling that there isn't a
> tool out that will do all the stuff we want, but I
> feel like it was worth asking for your suggestions.

Well, efficient crawlers are quite hard to write, because of an infinite
number of problems and case handling that takes *ages* to solve. Most of the
coding time spent with httrack since 1998 was to fix these numerous cases, and
try to improve the mirror quality. And even with years of improvement, the
mirror is not always perfect, depending on the complexity of the remote site.

> * A somewhat unrelated issue: It seems like the
> problem of downloading multiple files (index.html
> index-2.html) is something that cannot be solved. 

Well, it could be solved by waiting for the file to be completed before
naming/storing it, by comparing the content with already downloaded content.
But this would really slow down the mirrror, IMHO, and take some memory and/or
temporary space.

The problem is that duplicates can not be "guessed" easily, except AFTER you
get them. For exemple, index.html and INDEX.HTML can be different files
(resources), but can also be identical (with a WIN32 http server, for example)
; there is no way to know it before downloading the URL.

> Does anyone know an elegant way to deal with this
> problem?  We are developing a script to filter these
> duplicates out, but it is getting messy quickly.  

Well, not really. On Un*x system, you could build a MD5 database of all
downloaded files, and erase duplicate files, and then symlink to an unique
version to save some space. But if you want to patch all existing URLs, this
is a bit more complicated.

Create subthread

All articles

Subject	Author	Date
HTTrack and Alternatives to wget		04/07/2006 23:33
Re: HTTrack and Alternatives to wget		04/08/2006 16:45
Re: HTTrack and Alternatives to wget		12/27/2014 17:12