| > So, I am testing this product, which is really
> great.
Thanks :)
> -I thought that I had estimated the size of the site, but
I
> am watching
> the download process and it seems to be on the 3rd
> passthrough of the
> site.
The first thing is that default settings means default
filters, like:
+*.gif +*.jpg
Therefore, all images files linked from the site will be
grabbed, which can be quite long.
> The first passthrough was 2 gigs, second passthrough
> was about 2
> gigs, and now it is going through again. Why does it do
> this?
This is quite strange, but there might be multiple reasons:
1. cgi files (php, asp..) with bogus parameters. example:
dynamic pages generating different links each time you
crawl it. This tend to create infinite loops and causes
bandwidth explosion. Filtering is generally a good
solution, even if limiting the depth can be useful too.
2. multiple retries due to errors - I don't think this is
really possible, but who knows.
3. multiple domain names which are in fact identical -
example: www.foo.com, foo.com, www2.foo.com... these domain
names may be identical, but httrack is not able to know it,
causing to download third times all data
4. "testing links" is too long - this problem should be
partially fixed in the upcoming 3.23, and may be fixed with
the current version using Options/MIME Types. the problem
here is that httrack has to get some remote information
before downloading the file, like the mime type, to build
the filename locally. The new keep-alive system should
speed up all this.
In any cases, always check the hts-log.txt file : it will
gives you precious information on missing pages, errors,
and potential bugs in the crawl (bad javascript, server
errors, engine bugs..)
> How many pass-throughs will it do?
Normally only one!
> -Does the Wintrack version have a cron job feature?
No - this is on the TODO list, but it hasn't been yet
implemented due to lack of time. You can fire the
commandline version (httrack.exe) and use the windows
scheduler, however. Or you can use the cygwin cron daemon.
Anyway, the unix/linux release may fit your needs better,
as commandline tools are quite easy to handle on such
systems.
> -Any thoughts of adding a compression feature?
Compression is normally supported in 3.x releases, if the
server is able to handle it.
| |