HTTrack Website Copier
Free software offline browser - FORUM
Subject: Another strange behaviour with redirected links
Author: Alain Desilets
Date: 02/10/2012 16:30
 
If I run the following comand:

httrack <http://www.ic.gc.ca/> -O
C:\wbtwrite\prealigner_data\site_mirrors\www.ic.gc.ca -v -r 9999 -c1 --update
-I0 +*lang* +*.js -*.jpg -*.jpeg -*.gif -*.mov -*.mp3 -*.zip -*.wav -*.mpg
-*.mpeg -*.tif

Then:
* The crawl finishes within 20 seconds.
* The mirror contains the URL <http://www.ic.gc.ca/ic_wp-pa.htm>, which is the
URL that the web site redirects you to when you go to <http://www.ic.gc.ca/>.
* The links contained in <http://www.ic.gc.ca/ic_wp-pa.htm> are also put in the
mirror.
* But that's where it stops. Httrack never puts grand-children of
<http://www.ic.gc.ca/ic_wp-pa.htm> in the mirror, eventhough I used -r 9999
(i.e., no depth limit).

The strange thing is that if I try to crawl starting from
<http://www.ic.gc.ca/ic_wp-pa.htm> instead of <http://www.ic.gc.ca/>:

httrack <http://www.ic.gc.ca/ic_wp-pa.htm> -O
C:\wbtwrite\prealigner_data\site_mirrors\www.ic.gc.ca -v -r 9999 -c1 --upda te
-I0 +*lang* +*.js -*.jpg -*.jpeg -*.gif -*.mov -*.mp3 -*.zip -*.wav -*.mpg
-*.mpeg -*.tif

Then:
* The crawl goes on well beyond 20 seconds. I stopped it after 2 mins.
* The crawl does go beyond the grandchildren of
<http://www.ic.gc.ca/ic_wp-pa.htm>.

Note that I have run those two commands several times, and the pattern is
consistent. The first command never goes beyond the children, and the second
command always does. So it can't be explained by traffic conditions, or the
server deciding that I am abusing it.

What am I doing wrong here?
Thx.

Alain
 
Reply


All articles

Subject Author Date
Another strange behaviour with redirected links

02/10/2012 16:30
Re: Another strange behaviour with redirected links

02/11/2012 12:48
Re: Another strange behaviour with redirected links

02/13/2012 15:16




7

Created with FORUM 2.0.11