| > Does HTTrack use the referrer?
Yes
> Is it possible to conigure an automatic wait period
between
> requests?
Yes - you can select 1 connection per second, but also
limit the number of simultaneous connection to 1 or 2.
> How can I enable HTTrack to mirror this web-site?
You may also limit the bandwidth to something like 8KB/s ;
the bandwidth limiter in httrack is now very sharp and
allow you to limit bandwidth abuse
> The above meassures should shield SL from the most
> offensive scripts. What if you would still like to
> mirror/download SL? Use a friendly script such as wget
> which obeys robots.txt.
Therefore, if you leave all httrack options as is (follow
robots.txt), and use bandwith limiter (1 conn/second, 1
simultaneous connection, +bw limit), this should be okay.
> If you use wget don't forget to specify a
> wait period between the requests (at least '-w 3'). Yes,
Err, 3 seconds? I'll have to implement a larger delay in
httrack (which is limited to 1 second) in the future - but
using slower bandwidth limit should be okay (maybe 3 or 4KB)
Also, please cut/paste this filter into the 'Scan rules'
options of httrack (Options/Scan rules) :
-*/*?edit=* -*/*?copy=* -*/*?diff=* -*/*?header=* -*/*?info=* -*/*?search=*
-*/*?blockme=* -*/*?random=* -*/*?edit=*
as the current (basic) handling of robots.txt does not
understand the format of this site (/?foo..) (added on the
todo list..)
| |