Re: Allows follow robots.txt doesn't quite - HTTrack Website Copier Forum

Subject: Re: Allows follow robots.txt doesn't quite

Author: Xavier Roche

Date: 02/25/2004 19:52

> Disallow: /
> and people are using it, so
> why is -s2 ignoring it?
Well, this hack was implemented to bypass (too) strict 
rules from websites that did not want any indexing robot to 
crawl them. Generally, HTTrack is not used as a regular 
robot, but something between a spider and a regular 
browser: users manually select an URL, and choose to crawl 
it. This was a compromise to avoid "don't index at all" 
rules, and to respect "don't index this because it is not 
suitable" rules.

I admit this is a bit questionnable, but for a "regular" 
(using a GUI) use, it is generally fine. For commandline 
crawlers, I will add a s3 option for that:

sN follow robots.txt and meta robots tags 
(0=never,1=sometimes,* 2=always, 3=always (even strict 
rules)

Create subthread

All articles

Subject	Author	Date
Allows follow robots.txt doesn't quite		02/25/2004 10:21
Re: Allows follow robots.txt doesn't quite		02/25/2004 19:52