HTTrack Website Copier
Free software offline browser - FORUM
Subject: robot.txt and excessive CPU usage
Author: olivier
Date: 02/27/2005 10:44

As a webmaster, I am concerned both about proper indexing of
my pages on Web search engines, and keeping my server run
On my site, there are some CGI scripts (webboard, etc) that
can use a lot of CPU resources when accessed extensively.

Yesterday, I had to set up an automatic blacklisting of
mirroring utilies, because an HTTrack user made over 1500
accesses to my webboard script within the last 10 minutes,
slowing down significatively the whole server.

I could use robot.txt to prevent such users to mirror the
content of my webboard, but it would also prevent the Web
spiders to do so. These spiders load pages at a very slow
rate, and I'd want them to continue indexing the webboard.

So, is it a way to enable all Web indexing spiders to index
a part of my site, while disabling this part to be
downloaded locally by a mirroring program?
I know I could enable specifically all known indexing
engines, and disable all the others, but it would result in
a quite big robot.txt file, and it'd have to be updated each
time a new indexing engine appears. Is it a way to tell
robot.txt to enable/disable accesses according to the
category of program, i.e. Indexing / HTML validation / Link
validation / "What's New" monitoring / Mirroring?I couldn't find such a
command in the Robot.txt documentation.

Thanks in advance for the help you could provide,

All articles

Subject Author Date
robot.txt and excessive CPU usage

02/27/2005 10:44
Re: robot.txt and excessive CPU usage

03/03/2005 00:53
Re: robot.txt and excessive CPU usage

03/04/2005 13:48


Created with FORUM 2.0.11