HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: robot.txt and excessive CPU usage
Author: Leto
Date: 03/03/2005 00:53
 
> I could use robot.txt to prevent such users to mirror the
> content of my webboard, but it would also prevent the Web
> spiders to do so. These spiders load pages at a very slow
> rate, and I'd want them to continue indexing the webboard.

Very annoying when a few irresponsible users spoil things 
for everyone else.

As you know, you can set disallow rules to various agents, 
but it sounds like you want something like the following 
which allows all agents but prevents an "httrack" agent from 
accessing /forum/

User-agent: *
Disallow:
User-agent: httrack
Disallow: /forum/

But you know, I am not sure exactly if HTTrack has a user-
agent string which it obeys in robots.txt, and the other 
problem is that a user can ignore robots.txt anyway.

Sometimes I think the program should enforce a speed/
connection limit to prevent abuse, but there ARE legitimate 
uses for hitting a server (we've done it with internal 
stress-testing...) and anyway the source code is available 
and that limit could be removed...  It's a difficult 
situation.
 
Reply Create subthread


All articles

Subject Author Date
robot.txt and excessive CPU usage

02/27/2005 10:44
Re: robot.txt and excessive CPU usage

03/03/2005 00:53
Re: robot.txt and excessive CPU usage

03/04/2005 13:48




1

Created with FORUM 2.0.11