| > Glad to hear that your tool will enforce a hard limit for
> resources used. I wish you can keep your words.
I always keep my words. It will be merged for the next release (4 simultaneous
connections, 5 connections/seconds, and a maximum of 100K/s)
BUT an option will still be available for experts (such as researchers
building web archives, administrators that want to make load tests, and
authorized people) to bypass these limits. This option will not be available
through the GUI - so that regular users (who don't need it) will not be
tempted to "click it". The option will be explicitely documented as 'extremly
dangerous', so that users are aware of what they are doing.
> I have downloaded HTTrack and tried it. I was
> frightened that it can disable respecting robots.txt
> rules
Yes, because offline browsers are not robots, but softwares between browsers
and robots, depending on their usage. In large-scale mirrors, they can be
considered as crawlers. For small amount of pages, it's nothing more than a
browser.
> setting the User-Agent string
Also mandatory as many servers will deliver different content according to the
User-Agent (IE, Mozilla.. there are servers that won't deliver anything is the
User-Agent string don't match a known Internet Explorer)
But, again, the default User-Agent is clearly identified as being HTTrack.
> It doesn't work for "Disallow: /" (disallow everything)
It will also be fixed (followed by default).
> Those are not polite for a spider, too.
HTTrack is not a spider.
> Not all contents are GPL or GDL.
But you can still copy them for your own use.
> No, there is have no option for simutanuous connections.
But no default bandwith limit. HTTrack already has one. And no delay (even
small one) between connections AFAIKS.
| |