| > Disallow: /
> and people are using it, so
> why is -s2 ignoring it?
Well, this hack was implemented to bypass (too) strict
rules from websites that did not want any indexing robot to
crawl them. Generally, HTTrack is not used as a regular
robot, but something between a spider and a regular
browser: users manually select an URL, and choose to crawl
it. This was a compromise to avoid "don't index at all"
rules, and to respect "don't index this because it is not
suitable" rules.
I admit this is a bit questionnable, but for a "regular"
(using a GUI) use, it is generally fine. For commandline
crawlers, I will add a s3 option for that:
sN follow robots.txt and meta robots tags
(0=never,1=sometimes,* 2=always, 3=always (even strict
rules)
| |