HTTrack Website Copier
Free software offline browser - FORUM
Subject: answers and suggestions requested
Author: Charles Whittle
Date: 04/28/2006 12:01
 
Howdy!
     HTTRACK is a mighty fine program, but problems exist:

I have searched this forum and found that others are also
having problems mirroring huge web sites.  As far as I can
tell HTTRACK is locking up due to the sheer number of links
it encounters in a huge site.  Setting a low download speed
does not help, neither does defragging nor increasing the
maximum number of links in the limits options.  I have
several questions:

Does HTTRACK make a list of pages already scanned so that
it does not try to rescan them when a page links back to
them?  Does anyone know of a program like HTTRACK that will
mirror really enormous web sites without locking up?  If
the option "Do Not Purge Old Files" is turned on and the
option "Use a cache for updates" is turned off, what does
HTTRACK do?  Does it copy over existing files when it is
re-run for a given web site, does it ignore them and go on
to write another file that is not yet in the mirror, or
does it make another copy of the file?  Suppose the scan
limits option is set to go down 20 levels and outward 3
sites; does each site encountered get scanned downwards
20 levels (assuming such levels exist) ???  What is the
upper limit for maximum number of links to be scanned?

Below are messages exchanged between Xavier and myself:


Hello:

   I have been attempting to make a mirror of a VERY large
web site ( <http://community.webshots.com> ) which as of
April 22 2006 has 365,600,739 photos (522,459 new in 24 hours).  Even with the
option for downloading html files
first I only get about 49,000 files.  Aaron's WebVacuum
will get several hundred thousand jpegs, but then it gets
bogged down (even with 2.6 gigabytes of RAM with a Pentium
4 processor (3.3 gigahertz speed).  The large format images
are located in sites <http://image##.webshots.com> and the
thumbnails are in sites <http://thumb##.webshots.com> ( # is a single digit
number).  If you visit the parent URL and
link through it a bit you will find that it has recursive
links in abundance that lead back to earlier pages, so the
option to go down only does not seem to work well (could
be wrong about that).  Even setting the download speed to
dialup rates does not help.  

   This problem also occurs with other huge web sites.  I'm
in no hurry to download the entire site, so how do I set up
the options so that the entire site is mirrored?  Even
excluding all files except html and image files does not 
make a difference.  HTTRACK locks up, the elapsed time
counter and download displays freeze, and the program can
only be ended with the task manager.  I am downloading the
files onto a freshly formatted hard drive, so a lack of
defragging is not the problem.  (nothing else is on the
drive.)  Any help would be most welcome.  Thanks!

           Aloha,
              Charles Whittle

PS:  I am running HTTRACK on a GateWay 550GR, Pentium 4
(3.3 GHz) with 2.5 GB RAM using Windows XP Home Edition
with all updates, primary drive has 250 GB, and secondary
drive has 200 GB (location of My Web Sites [target folder
for HTTRACK]).


Xavier replied:

Wow. With embedded html pages and links, it means more than a million URLs to
catch. This is getting a bit big for a small crawler like httrack ; which is
dimensionned by default for 100,000 links.

You have first to adjust the "maximum number of links" in the httrack
options,
or else the mirror will die when reaching 100Klinks.

Then, take care of not clobbering the site - one million URLs IS REALLY big
for a regular server, and it may cause some bandwidth problems.

Apart from that, there are no specific options to enable, if everything's on
the site.

Howdy, Xavier!

   Does HTTRACK go back to pages it has already scanned
if the option is set to go down only, but links farther
down go back to previous pages anyway?  At any rate,
I've already reset the maximum number of links otion to
1,000,000,000 but don't know if the program will handle
that value.  Will it?  Also, don't think that the server
is getting clobbered; that web site is designed for lots
of traffic.  As I stated in the original post, Aaron's
WebVacuum will get several hundred thousand files before
getting bogged down due to the number of links.  I am
using a DSL connection.  If I set the options to not purge
old files, will HTTRACK try to re-download existing files?I have unchecked
the
option for loading updates to a cache;
is that a problem?  <http://community.webshots.com> is just
one web site that will lock up HTTRACK , so it would be
great if you can find a way to fix this problem.  Thanks!

             Aloha,
               Charles Whittle


 
Reply


All articles

Subject Author Date
answers and suggestions requested

04/28/2006 12:01




b

Created with FORUM 2.0.11