| I am trying to crawl the first 2 levels of a site with the following command:
httrack <http://www.hc-sc.gc.ca/> -O
C:\wbtwrite\prealigner_data\site_mirrors\www.hc-sc.gc.ca -v -r 2 --update -I0
-s2 +*lang* +*.js -*.jpg -*.jpeg -*.gif -*.mov -*.mp3 -*.zip -*.wav -*.mpg
-*.mpeg -*.tiff
You can try it yourself... it takes max 5 mins to complete.
It mostly works, except for the interlanguage links. For example, if you load
the home-accueil/text-eng.html file from the local mirror, you will see a link
Français (link to the French version) in the upper left corner. Clicking on
it takes you to:
<http://www.hc-sc.gc.ca/cgi-bin/lang_change.pl>
i.e., it takes you outside of the mirror, and onto the original server. As the
name of the lang_change.pl suggests, it is a script that automatically
redirects to the French page for the English page that it was referred from
(or the other way around if the referrer was a French page).
The funny thing is that the French page for home-accueil/text-eng.html is
indeed on the local mirror (it's called home-accueil/text-fra.html). So
obviously, Httrack was able to "follow" the Français link, and save its
redirected content to disk. It's just that it didn't change the actual link in
the mirrored home-accueil/text-eng.html to point to the mirrored French file
instead of the original lang_change.pl on the original server.
I'm puzzled by this, because I tried to reproduce this problem by creating a 3
page web site locally on my computer. This site uses this kind of
lang_change.pl approach. But when I crawl it, the interlanguage links in the
mirror are fine.
Note that on the hc-sc.gc.ca server, the robots.txt prohibits access to the
cgi-bin directory where lang_change.pl resides. But I am using the -s2 option,
so it shouldn't matter should it? And for good measure, I put a +*lang* option
to force treatment of any file whose name contains lang.
Any idea what might be the matter?
Thanks.
Alain Désilets | |