HTTrack Website Copier
Free software offline browser - FORUM
Subject: Redirected link is broken
Author: Alain Desilets
Date: 02/09/2012 21:10
 
I am trying to crawl the first 2 levels of a site with the following command:

httrack <http://www.hc-sc.gc.ca/> -O
C:\wbtwrite\prealigner_data\site_mirrors\www.hc-sc.gc.ca -v -r 2 --update -I0
-s2 +*lang* +*.js -*.jpg -*.jpeg -*.gif -*.mov -*.mp3 -*.zip -*.wav -*.mpg
-*.mpeg -*.tiff

You can try it yourself... it takes max 5 mins to complete.

It mostly works, except for the interlanguage links. For example, if you load
the home-accueil/text-eng.html file from the local mirror, you will see a link
Français (link to the French version) in the upper left corner. Clicking on
it takes you to:

<http://www.hc-sc.gc.ca/cgi-bin/lang_change.pl>

i.e., it takes you outside of the mirror, and onto the original server. As the
name of the lang_change.pl suggests, it is a script that automatically
redirects to the French page for the English page that it was referred from
(or the other way around if the referrer was a French page).

The funny thing is that the French page for home-accueil/text-eng.html is
indeed on the local mirror (it's called home-accueil/text-fra.html). So
obviously, Httrack was able to "follow" the Français link, and save its
redirected content to disk. It's just that it didn't change the actual link in
the mirrored home-accueil/text-eng.html to  point to the mirrored French file
instead of the original lang_change.pl on the original server.

I'm puzzled by this, because I tried to reproduce this problem by creating a 3
page web site locally on my computer. This site uses this kind of
lang_change.pl approach. But when I crawl it, the interlanguage links in the
mirror are fine.

Note that on the hc-sc.gc.ca server, the robots.txt prohibits access to the
cgi-bin directory where lang_change.pl resides. But I am using the -s2 option,
so it shouldn't matter should it? And for good measure, I put a +*lang* option
to force treatment of any file whose name contains lang.

Any idea what might be the matter?
Thanks.

Alain Désilets
 
Reply


All articles

Subject Author Date
Redirected link is broken

02/09/2012 21:10
Re: Redirected link is broken

02/14/2012 13:42




1

Created with FORUM 2.0.11