| > Please forgive what may seem like a newbie question.
> How do I copy the whole of a domain? eg .NZ Yes, I
> know what I've just asked. I really do want to do
> this.
Arghhl..
This is quite a huge job, and you may build your own
crawler, but, anyway, here are some advice:
Okay, basically the first idea is to use filters, this
is quite simple:
-* +*.nz
Of course, you'll have to start from a "good" page,
that is, which contains links which contains links
which...
This is the filters to grab EVERYTHING - if you want
to grab only html data, you may use
the "Options/Experts Only/Store HTML files" option
(this is a very rare case where you can use these
options, I think)
Then, you may use the %F option ('footer' option in
Options/Browser ID) if you want to change or remove
comments in the html code - but removing is generally
NOT a good idea: (invisible) comments put in the html
code contains the original site name, the location and
if you want the timestamp. This is quite useful when
doing mirror, or archiving: people can do "view page"
and see the accuracy of the information they're
reading. I suggest you select the most accurate
comment (containing the date of the mirror).
Define also more filters than the 100,000 default one:
use the "Options/Limits/Maximum Number of Links" and
set it to the desired number of links (example:
50,000,000 if you cant - BEWARE: the higher the value
is, the higher memory will be used to prebuild the
table)
And finally, you may use bandwidth limiters or
connection limiters, depending on what you are
crawling ; this may not be necessary if you are only
scanning sites without the "heavy" files (images, and
so on..) and in parallel (which may be the case due to
the "heap" system in httrack)
Ah, a last advise: leave "follow robots.txt" checked
(this is done by default), as the use here is
typically a "spider" use, and not only a copy of small
part of a website. Also ask your provider if you can
use such bandwidth - and how much this will cost :)
| |