HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Allows follow robots.txt doesn't quite
Author: Xavier Roche
Date: 02/25/2004 19:52
 
> Disallow: /
> and people are using it, so
> why is -s2 ignoring it?
Well, this hack was implemented to bypass (too) strict 
rules from websites that did not want any indexing robot to 
crawl them. Generally, HTTrack is not used as a regular 
robot, but something between a spider and a regular 
browser: users manually select an URL, and choose to crawl 
it. This was a compromise to avoid "don't index at all" 
rules, and to respect "don't index this because it is not 
suitable" rules.

I admit this is a bit questionnable, but for a "regular" 
(using a GUI) use, it is generally fine. For commandline 
crawlers, I will add a s3 option for that:

sN follow robots.txt and meta robots tags 
(0=never,1=sometimes,* 2=always, 3=always (even strict 
rules)

 
Reply Create subthread


All articles

Subject Author Date
Allows follow robots.txt doesn't quite

02/25/2004 10:21
Re: Allows follow robots.txt doesn't quite

02/25/2004 19:52




2

Created with FORUM 2.0.11