HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Crawl-Delay and Honored robots.txt lines
Author: Xavier Roche
Date: 04/07/2010 18:42
 
> What robots.txt lines are recognized and honored by
> httrack, including extensions?
No extensions are recognized -- only basic robots rules (ie. the original
Netscape speficication)

For the rate limiter, the default is 25KB/s, and it can not be easily
increased beyond 100KB/s. You may see isolated peaks due to TCP buferring, but
the average rate should be respected if the limits have not been overriden.

Anyway even this rate may cause some issues ; you can put by default
cpu-aggressive pages in robots.txt, or, in this case, temporarily blacklist
the bad citizen which is causing these slowdowns.
 
Reply Create subthread


All articles

Subject Author Date
Crawl-Delay and Honored robots.txt lines

03/30/2010 17:12
Re: Crawl-Delay and Honored robots.txt lines

03/30/2010 19:47
Re: Crawl-Delay and Honored robots.txt lines

03/30/2010 20:40
Re: Crawl-Delay and Honored robots.txt lines

03/30/2010 21:48
Re: Crawl-Delay and Honored robots.txt lines

04/07/2010 18:42




6

Created with FORUM 2.0.11