HTTrack Website Copier
Free software offline browser - FORUM
Subject: * * PLEASE READ if you have performance problems
Author: Xavier Roche
Date: 06/30/2001 16:50
 
Hi,

Some people (in this forum or via email) have 
experienced performance problems when using huge URL 
lists, or when scanning big HTML files, or big 
websites with many links (oftent unknown types like 
asp pages)

If you are one of them, please read the text below, 
thanks!

The upcoming version of HTTrack (3.03) has been 
*greatly* optimized, especially in scan routines. A 
new option has been included, too, which allow to set 
MIME types for "unknown" filetypes, too.
It would be great if several people could do some 
tests using the beta-3.03 release of HTTrack, to test 
speed improvements brought by these optimizations, but 
also to ensure that the very-deep changes done in 
critical macros and functions in the engine will not 
cause other problems
Indeed, many optimizations have been done, and this is 
a potential thread to the overall engine stability!

If you are interested:
---------------------

- contact me ASAP (roche@httrack.com)

- please test the beta 3.03 at 
<http://www.httrack.com/beta.zip>
(replace the existing .exe in WinHTTrack program files 
folder)
AND do not forget to send me any feedback and remarks, 
feelings, bug report, or any other problem which may 
have occured during yous tests!


My preliminary tests for 3.03beta:
---------------------------------
(tested on a PIII@800/256MB)


1. Including 100,000 links using "URL list" parameter:

version 3.02 : = 11 minutes and 50 seconds
version 3.03 : < less than 1 second


2. Scanning a 15MB HTML file with 10,000 "html links":

version 3.02 : = 31 minutes and 10 seconds
version 3.03 : = 27 seconds


3. Besides, many people have experienced performance 
problems when scanning/downloading many cgi-generated 
pages, like "php3" or "asp" links.
This problem occurs because the engine has to test 
each script to know the MIME type, before forming the 
final destination filename.

However, in many cases, "php3" or "asp" are 
always "text/html" and therefore testing these files 
is just a time loss

A new option, called "assume", will allow to "tell" 
the engine that these cgi's always have the same 
types. 

The syntax is:
--assume filesystemtype=mimetype/mimesubtype
[,filesystemtype=mimetype/mimesubtype[,...]]

Example:
httrack www.foo.com/bar.asp --assume 
php3=text/html,asp=text/html,sgif=image/gif,sjpg=image/
jpeg

This feature will speed up many mirrors, for sure! :

3. Scanning a 15MB HTML file with 10,000 "PHP3 links":

version 3.02 : = > few hours (interrupted..)
version 3.03 with --assume php3=text/html: = 19 
seconds for the scan

 
Reply


All articles

Subject Author Date
* * PLEASE READ if you have performance problems

06/30/2001 16:50
Re: * * MORE optimizations

07/01/2001 11:00




2

Created with FORUM 2.0.11