HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Whole of domain copying
Author: Xavier Roche
Date: 04/22/2002 07:56
 
> Please forgive what may seem like a newbie question. 
> How do I copy the whole of a domain? eg .NZ Yes, I 
> know what I've just asked. I really do want to do 
> this. 

Arghhl..
This is quite a huge job, and you may build your own 
crawler, but, anyway, here are some advice:

Okay, basically the first idea is to use filters, this 
is quite simple:
-* +*.nz

Of course, you'll have to start from a "good" page, 
that is, which contains links which contains links 
which...

This is the filters to grab EVERYTHING - if you want 
to grab only html data, you may use 
the "Options/Experts Only/Store HTML files" option 
(this is a very rare case where you can use these 
options, I think)

Then, you may use the %F option ('footer' option in 
Options/Browser ID) if you want to change or remove 
comments in the html code - but removing is generally 
NOT a good idea: (invisible) comments put in the html 
code contains the original site name, the location and 
if you want the timestamp. This is quite useful when 
doing mirror, or archiving: people can do "view page" 
and see the accuracy of the information they're 
reading. I suggest you select the most accurate 
comment (containing the date of the mirror).

Define also more filters than the 100,000 default one: 
use the "Options/Limits/Maximum Number of Links" and 
set it to the desired number of links (example: 
50,000,000 if you cant - BEWARE: the higher the value 
is, the higher memory will be used to prebuild the 
table)

And finally, you may use bandwidth limiters or 
connection limiters, depending on what you are 
crawling ; this may not be necessary if you are only 
scanning sites without the "heavy" files (images, and 
so on..) and in parallel (which may be the case due to 
the "heap" system in httrack)

Ah, a last advise: leave "follow robots.txt" checked 
(this is done by default), as the use here is 
typically a "spider" use, and not only a copy of small 
part of a website. Also ask your provider if you can 
use such bandwidth - and how much this will cost :)
 
Reply Create subthread


All articles

Subject Author Date
Whole of domain copying

04/22/2002 03:31
Re: Whole of domain copying

04/22/2002 07:56
Re: Whole of domain copying

04/24/2002 00:03
Re: Whole of domain copying

04/24/2002 20:42




a

Created with FORUM 2.0.11