Re: Spider identification in robot.txt - HTTrack Website Copier Forum

Subject: Re: Spider identification in robot.txt

Author: Jase

Date: 05/04/2011 11:04

Your still disobeying the robot txt file though arn't you? The robot.txt file
is specifically there for webmasters to implement and stop certain bots or
programs from accessing the web site and your overriding that or allowing
users to do so with the program when they can simply select "no robot.txt
rules" and technically that is as good as allowing ANY damned spider to
disobey the robot txt file on every web site. How would you like it if every
bot out there combed your whole web site and just ignored your robot txt file
as it is here:

# robots.txt for <http://www.httrack.com>

User-agent: Googlebot
Allow: /
Allow: /page
Allow: /html
Allow: /src
Disallow: /*.zip$
Disallow: /*.exe$
Disallow: /*.tar.gz$
Disallow: /*.deb$

User-agent: *
Disallow:


and just indexed every single  folder you had asked it not to allow bot's to
index???
There has to be somewhere to report your way of getting around such things on
this site and with this program on the web and believe me when I find out
where - you can expect to hear about it!

When a webmaster puts in their robot's.txt file

User Agent: httrack
Disallow: /

It should damned well mean your disallowed from crawling the web site pal. You
may like to allow your users to rob folks that don't know how to protect their
domains or web contents but give the webmasters who do know of the ways to
stop it some credit!  Have some morals you scum bags!

Create subthread

All articles

Subject	Author	Date
Spider identification in robot.txt		05/21/2009 11:09
Re: Spider identification in robot.txt		05/21/2009 15:30
Re: Spider identification in robot.txt		05/21/2009 18:14
Re: Spider identification in robot.txt		05/21/2009 19:08
Re: Spider identification in robot.txt		05/04/2011 11:04