| Hi,
Thanks for your comments.
You wrote:
>As you know, you can set disallow rules to various agents,
>but it sounds like you want something like the following
>which allows all agents but prevents an "httrack" agent from
>accessing /forum/
>User-agent: *
>Disallow:
>User-agent: httrack
>Disallow: /forum/
I agree, but httrack is not the only mirroring utility I'd
want to filter. I also had such accesses from "Arachmo",
"Spider", and other mirroring utilities that come with a
fake Mozilla or IE signature.
My idea was a "meta-user agent" like for example
"User-agent: *mirroring*" that could be recognised by all
mirroring utilities, in order to prevent them all from a
single line. It'd just need all the developers of such
utilities to agree on a term and to recognize it in robot.txt.
Such a general agreement would greatly ease the job of the
Webmasters.
>But you know, I am not sure exactly if HTTrack has a user-
>agent string which it obeys in robots.txt, and the other
>problem is that a user can ignore robots.txt anyway.
>Sometimes I think the program should enforce a speed/
>connection limit to prevent abuse, but there ARE legitimate
>uses for hitting a server (we've done it with internal
>stress-testing...) and anyway the source code is available
>and that limit could be removed... It's a difficult
>situation.
Well, if the user chooses explicitely to ignore "robot.txt"
and/or to increase the number of simultaneous connections,
then he does it at his own risks. I will then blacklist him
from the whole site through my firewall, and he'll have to
live with it... | |