| I'm doing regular downloads of www.jp.dk, and I notice a
disturbing effect: When a page has moved, the 302 headers
may get downloaded many times without the page itself ever
appearing. Here's the tail end of a count of how many times
specific URLs were downloaded, together with the response code:
4 <http://www.jp.dk/madvin:aid=2351664> 301
5 <http://www.jp.dk/aar/vejretarhus> 301
48 <http://www.jp.dk/explorer/klimaet_artikler> 301
48 <http://www.jp.dk/explorer/klimaet_logbogen> 301
78 <http://www.jp.dk/vejret> 301
4679 <http://www.jp.dk/telefonsalg> 302
4722 <http://www.jp.dk/info> 302
4722 <http://www2.jp.dk/info> 302
www.jp.dk/info redirects to www2.jp.dk/info, which in turn
redirects to www2.jp.dk/info/, which is never downloaded or
even mentioned in the log. However, it seems that HTTrack
doesn't figure out this dead end and tries to download
www.jp.dk/info every time it encounters it.
www.jp.dk/telefonsalg redirects to
www2.jp.dk/abonnement/job/, for which the log only says
engine: save-name: local name:
www2.jp.dk/abonnement/job/index.html ->
www2.jp.dk/abonnement/job/index.html
These entries make up about half of the downloaded pages.
Shouldn't it be recorded somehow that the redirects have
been followed?
Crawler setup:
HTTrack3.31-noV6-nossl launched on Wed, 07 Apr 2004 04:30:00
at www.jp.dk
(httrack -%W receive-header=httrack-arc:get_header -%W
transfer-status=httrack-arc:dump_chunk -F "HTTrack 3.30.102
(non-archiving test version, see
www.netarkivet.dk/website/info.html)" -B -c10 -i -C2 -n -z
-a -A100000 -#L10000000 www.jp.dk )
-Lars
P.S. I find it amusing that the first line says essentially
'HTTrack launched at <site>'. Sounds like a cruise missile
or something:) | |