HTTrack Website Copier
Free software offline browser - FORUM
Subject: A few problems with httrack
Author: Peter Baumann
Date: 09/19/2012 05:16
 
Bonjous,

The capabilities of httrack are impressive and it we are actually testing it
and thinking of buying a license for integration into our new software
product. But we are encountering some problems. I read the forum but found no
answers to my problems.

This is the scenario:

I have been using httrack release 3.45.4 heavily for about three weeks on
three PC:
1) under Windows7 both in VMware and "native" in the OS
2) on a Win 2008 Sever both in VM and "native" in the OS 
3) on a Win 2003 Server on only "native" (not in VM)

2+3 are hosted by Hetzner.de a large hoster with very fast Internet access. 1)
is on an ADSL like line.

In all cases, jobs are started by bat files, which are generated by my own
software with urls from a database. Each bat file first hat 100 urls, now they
only have 10 urls per process with just two levels each. See a sample bat file
at the end of the posting.

I just need the html file to be downloaded. No mirrorig needed, thus no links
to be adjusted in the local files. This is only for parsing the HTML content
for certain keywords, which is also done by a self-written software. 

Unfortunately, I am having some problems:

1) httrack tasks are hardly ever finished
After starting a processes, a lot of data is downloaded quickly within the
first 10 to 15 hours of processing. After a couple of hours this decreases
dramatically.Switching to only 10 urls per process made no change. Save
behaviour in VM as well as natibe.

Unfortunately most of the processes are never finished, not even after a week
of running. Restarting the processes does not help either.

Some tasks just seem to hang for hours with no accessing of their entire
sub-directory and with the last *.ref files in the hts-cache becoming older
than 6 hours or more.

In the beginning, I was using the -w setting with 100 urls per process. I then
switched over to -g and only 10 urls per tast hoping that the overall
processing time would decrease and that the tasks will be finished properly. 

This is not the case. Instead, there are still tmp files created and the tasks
remain unfinished even after about 48 years or more hours of continuous
processing.

2) Overall slow performance
After the first hours of quite fast downloading, it seems that the processes
becomes very slow or inefficient. Only relatively few more data files are
being downloaded although the process is not yet finished. The task managers
of the VMs are mostly running at 80 to 100% with enough free memory available
when processing 10 to 15 httrack httrack tasks per VM. However, the overall
task manager of the very powerful Win7 PC (very new high-end CPU Intel i7 3770
quad core with 16GB) is running at only about 30% with 8 VMs, therefore a lot
of reserve capacity should be available.

The same behaviour occurs under Windows Server 2008 (same httrack release) on
a slightly slower PC. 
Running the processes directly on Windows7 (without VMs) does not deliver any
better results. In both cases it seems that after a while the processes are
mainly busy with themselves. Same behaviour on 2008 Server amd on 2003 server
(older slower CPU).

3) Disabling hts-log files
It seems that the -I parameter does not work. I have no chance of switching
off the protocol function. The hts-log.txt files are always produced, no
matter what setting I choose.

4) Searchable index
I refer to this:
%I  make an searchable index for this mirror (* %I0 don't make)
(--search-index)
but unfortunately I have never succeeded in having httrack produce such an
index file. When I use this setting, nothing happens or changes with respect
to an index file.

There is also no index file (except the index.html) created in those few
tasks, which are finished properly.

5) Question concerning the number of urls per task
Should I reduce the number of urls even further (such as just 1?) to be
processed per task in order to have the tasks finished? How many urls per task
would be appropriate?
6) Percentage performed
When running httrack via a bat file, I am missing some information equvilent
to the "Links scanned" and remaining information that is available in the GUI.
If this was available one could decide if the current task can be aborted,
because most of it has been processed. This information could be shown in the
cmd window.

Thank you very much in advance for your help and proposals.

Merci beaucoup pour votre reponse et
très sincerement
 
Reply


All articles

Subject Author Date
A few problems with httrack

09/19/2012 05:16
A sample bat file

09/19/2012 05:21
Re: A sample bat file

09/19/2012 17:11
Re: A sample bat file

09/19/2012 17:12
Re: A sample bat file

09/20/2012 06:19
Re: A sample bat file

09/20/2012 14:31




1

Created with FORUM 2.0.11