HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Crawl-Delay and Honored robots.txt lines
Author: Michael Mol
Date: 03/30/2010 20:40
 
iftop was showing me a pull of 110-140Kb/s, with only one connection, so the
throughput ratelimiter might need some tweaking.

I could, of course, block access to some of the dynamically-generated pages
(page history is probably the worst), but I prefer not to, and it's rare that
it's a problem.

The site's already as optimized as I can make it, short of using Squid--but
MediaWiki requires a patched version of Squid that recognizes a few unusual
HTTP headers, and that's not something I want to mess with. All in all,
though, I didn't come for advice on configuring the server, but on what types
of rate-limiting components I can put in robots.txt.

If someone chooses to override robots.txt, and it causes me problems I notice,
I can go draconian, but I'd rather give the benefit of the doubt.
 
Reply Create subthread


All articles

Subject Author Date
Crawl-Delay and Honored robots.txt lines

03/30/2010 17:12
Re: Crawl-Delay and Honored robots.txt lines

03/30/2010 19:47
Re: Crawl-Delay and Honored robots.txt lines

03/30/2010 20:40
Re: Crawl-Delay and Honored robots.txt lines

03/30/2010 21:48
Re: Crawl-Delay and Honored robots.txt lines

04/07/2010 18:42




f

Created with FORUM 2.0.11