Redirecs getting redownloaded many times - HTTrack Website Copier Forum

Subject: Redirecs getting redownloaded many times

Author: Lars Clausen

Date: 04/07/2004 10:40

I'm doing regular downloads of www.jp.dk, and I notice a
disturbing effect:  When a page has moved, the 302 headers
may get downloaded many times without the page itself ever
appearing.  Here's the tail end of a count of how many times
specific URLs were downloaded, together with the response code:

   4 <http://www.jp.dk/madvin:aid=2351664> 301
   5 <http://www.jp.dk/aar/vejretarhus> 301
  48 <http://www.jp.dk/explorer/klimaet_artikler> 301
  48 <http://www.jp.dk/explorer/klimaet_logbogen> 301
  78 <http://www.jp.dk/vejret> 301
4679 <http://www.jp.dk/telefonsalg> 302
4722 <http://www.jp.dk/info> 302
4722 <http://www2.jp.dk/info> 302

www.jp.dk/info redirects to www2.jp.dk/info, which in turn
redirects to www2.jp.dk/info/, which is never downloaded or
even mentioned in the log.  However, it seems that HTTrack
doesn't figure out this dead end and tries to download
www.jp.dk/info every time it encounters it.  

www.jp.dk/telefonsalg redirects to
www2.jp.dk/abonnement/job/, for which the log only says
engine: save-name: local name:
www2.jp.dk/abonnement/job/index.html ->
www2.jp.dk/abonnement/job/index.html

These entries make up about half of the downloaded pages. 
Shouldn't it be recorded somehow that the redirects have
been followed?
Crawler setup:
HTTrack3.31-noV6-nossl launched on Wed, 07 Apr 2004 04:30:00
at www.jp.dk
(httrack -%W receive-header=httrack-arc:get_header -%W
transfer-status=httrack-arc:dump_chunk -F "HTTrack 3.30.102
(non-archiving test version, see
www.netarkivet.dk/website/info.html)" -B -c10 -i -C2 -n -z
-a -A100000 -#L10000000 www.jp.dk )

-Lars

P.S. I find it amusing that the first line says essentially
'HTTrack launched at <site>'.  Sounds like a cruise missile
or something:)

All articles

Subject	Author	Date
Redirecs getting redownloaded many times		04/07/2004 10:40
Re: Redirecs getting redownloaded many times		04/07/2004 21:35