| Hello!
My crawling performance is not more than 2-3 rps, but i expect it to be many
times higher.
Its not the problem of the site - it works fast, responses fast, not
IP/cookie/... limit or anything else.
Its not the problem of running machine - there are a lot of cpu/memory/network
resources.
Its not the problem of filers.
How can i find the bottleneck?What should i grep logs for?
There are nearly 80k log messages for 24hour period like "Waiting for type to
be known: ... .html" - is it ok?
Why crawler needs to know that .html is text/html?Can it be the bottleneck?
Thank you.
httrack <http://site/> \
-O "/data/site" \
-r50 \
-A1000000 \
-%c50 \
-c10 \
-T30 \
-R5 \
-K4 \
-n \
-N "%h%p/%n%q_%M.%t" \
-s0 \
-F "Mozilla/5.0" \
-%F "" \
-%l “ru” \
-q \
-z \
-Z \
-v \
--debug-headers \
--disable-security-limits \
"some filters here" | |