|
> I could use robot.txt to prevent such users to mirror the
> content of my webboard, but it would also prevent the Web
> spiders to do so. These spiders load pages at a very slow
> rate, and I'd want them to continue indexing the webboard.
Very annoying when a few irresponsible users spoil things
for everyone else.
As you know, you can set disallow rules to various agents,
but it sounds like you want something like the following
which allows all agents but prevents an "httrack" agent from
accessing /forum/
User-agent: *
Disallow:
User-agent: httrack
Disallow: /forum/
But you know, I am not sure exactly if HTTrack has a user-
agent string which it obeys in robots.txt, and the other
problem is that a user can ignore robots.txt anyway.
Sometimes I think the program should enforce a speed/
connection limit to prevent abuse, but there ARE legitimate
uses for hitting a server (we've done it with internal
stress-testing...) and anyway the source code is available
and that limit could be removed... It's a difficult
situation. | |