HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Questions on HTTrack
Author: Xavier Roche
Date: 12/12/2002 20:12
 
> So, I am testing this product, which is really 
> great.

Thanks :)
 
> -I thought that I had estimated the size of the site, but 
I 
> am watching
> the download process and it seems to be on the 3rd 
> passthrough of the
> site.

The first thing is that default settings means default 
filters, like:
+*.gif +*.jpg

Therefore, all images files linked from the site will be 
grabbed, which can be quite long.


> The first passthrough was 2 gigs, second passthrough 
> was about 2
> gigs, and now it is going through again.  Why does it do 
> this?
This is quite strange, but there might be multiple reasons:

1. cgi files (php, asp..) with bogus parameters. example: 
dynamic pages generating different links each time you 
crawl it. This tend to create infinite loops and causes 
bandwidth explosion. Filtering is generally a good 
solution, even if limiting the depth can be useful too.

2. multiple retries due to errors - I don't think this is 
really possible, but who knows.

3. multiple domain names which are in fact identical - 
example: www.foo.com, foo.com, www2.foo.com... these domain 
names may be identical, but httrack is not able to know it, 
causing to download third times all data

4. "testing links" is too long - this problem should be 
partially fixed in the upcoming 3.23, and may be fixed with 
the current version using Options/MIME Types. the problem 
here is that httrack has to get some remote information 
before downloading the file, like the mime type, to build 
the filename locally. The new keep-alive system should 
speed up all this.

In any cases, always check the hts-log.txt file : it will 
gives you precious information on missing pages, errors, 
and potential bugs in the crawl (bad javascript, server 
errors, engine bugs..)

> How many pass-throughs will it do?
Normally only one!

> -Does the Wintrack version have a cron job feature?
No - this is on the TODO list, but it hasn't been yet 
implemented due to lack of time. You can fire the 
commandline version (httrack.exe) and use the windows 
scheduler, however. Or you can use the cygwin cron daemon.

Anyway, the unix/linux release may fit your needs better, 
as commandline tools are quite easy to handle on such 
systems.

> -Any thoughts of adding a compression feature?   

Compression is normally supported in 3.x releases, if the 
server is able to handle it.

 
Reply Create subthread


All articles

Subject Author Date
Questions on HTTrack

12/12/2002 07:47
Re: Questions on HTTrack

12/12/2002 20:12




b

Created with FORUM 2.0.11