HTTrack Website Copier
Free software offline browser - FORUM
Subject: robots.txt warning/FYI
Author: Haudy Kazemi
Date: 10/23/2002 21:01
I just found this; httrack users who ignore robots.txt may 
want to take close notice and change copying strategy.

"Robotcop is an open source module for webservers which 
helps webmasters prevent spiders from accessing parts of 
their sites they have marked off limits."

Robotcop enforces robots.txt 
"The Robots.txt file is a cooperative way to request that 
crawlers and spiders avoid certain parts of web sites. This 
free server module watches for spiders which read pages 
disallowed in robots.txt, and blocks all further requests 
from that IP address. It is particularly useful for 
blocking email address harvesters, while still allowing 
legitimate search engine spiders. Be sure to double-check 
your robots.txt file (use one or more of the robots.txt 
checkers), before implementing it, and to watch your server 
logs carefully. The August 2002 version (0.6) works with 
Apache 1.3 on FreeBSD and Linux."

All articles

Subject Author Date
robots.txt warning/FYI

10/23/2002 21:01


Created with FORUM 2.0.11