Re: robot.txt and excessive CPU usage - HTTrack Website Copier Forum

Subject: Re: robot.txt and excessive CPU usage

Author: Leto

Date: 03/03/2005 00:53

> I could use robot.txt to prevent such users to mirror the
> content of my webboard, but it would also prevent the Web
> spiders to do so. These spiders load pages at a very slow
> rate, and I'd want them to continue indexing the webboard.

Very annoying when a few irresponsible users spoil things 
for everyone else.

As you know, you can set disallow rules to various agents, 
but it sounds like you want something like the following 
which allows all agents but prevents an "httrack" agent from 
accessing /forum/

User-agent: *
Disallow:
User-agent: httrack
Disallow: /forum/

But you know, I am not sure exactly if HTTrack has a user-
agent string which it obeys in robots.txt, and the other 
problem is that a user can ignore robots.txt anyway.

Sometimes I think the program should enforce a speed/
connection limit to prevent abuse, but there ARE legitimate 
uses for hitting a server (we've done it with internal 
stress-testing...) and anyway the source code is available 
and that limit could be removed...  It's a difficult 
situation.

Create subthread

All articles

Subject	Author	Date
robot.txt and excessive CPU usage		02/27/2005 10:44
Re: robot.txt and excessive CPU usage		03/03/2005 00:53
Re: robot.txt and excessive CPU usage		03/04/2005 13:48