Re: robot.txt and excessive CPU usage - HTTrack Website Copier Forum

Subject: Re: robot.txt and excessive CPU usage

Author: olivier

Date: 03/04/2005 13:48

Hi,

Thanks for your comments.

You wrote:

>As you know, you can set disallow rules to various agents, 
>but it sounds like you want something like the following 
>which allows all agents but prevents an "httrack" agent from 
>accessing /forum/
>User-agent: *
>Disallow:
>User-agent: httrack
>Disallow: /forum/

I agree, but httrack is not the only mirroring utility I'd
want to filter. I also had such accesses from "Arachmo",
"Spider", and other mirroring utilities that come with a
fake Mozilla or IE signature.

My idea was a "meta-user agent" like for example
"User-agent: *mirroring*" that could be recognised by all
mirroring utilities, in order to prevent them all from a
single line. It'd just need all the developers of such
utilities to agree on a term and to recognize it in robot.txt.
Such a general agreement would greatly ease the job of the
Webmasters.

>But you know, I am not sure exactly if HTTrack has a user-
>agent string which it obeys in robots.txt, and the other 
>problem is that a user can ignore robots.txt anyway.
>Sometimes I think the program should enforce a speed/
>connection limit to prevent abuse, but there ARE legitimate 
>uses for hitting a server (we've done it with internal 
>stress-testing...) and anyway the source code is available 
>and that limit could be removed...  It's a difficult 
>situation.

Well, if the user chooses explicitely to ignore "robot.txt"
and/or to increase the number of simultaneous connections,
then he does it at his own risks. I will then blacklist him
from the whole site through my firewall, and he'll have to
live with it...

Create subthread

All articles

Subject	Author	Date
robot.txt and excessive CPU usage		02/27/2005 10:44
Re: robot.txt and excessive CPU usage		03/03/2005 00:53
Re: robot.txt and excessive CPU usage		03/04/2005 13:48