| If I run the following comand:
httrack <http://www.ic.gc.ca/> -O
C:\wbtwrite\prealigner_data\site_mirrors\www.ic.gc.ca -v -r 9999 -c1 --update
-I0 +*lang* +*.js -*.jpg -*.jpeg -*.gif -*.mov -*.mp3 -*.zip -*.wav -*.mpg
-*.mpeg -*.tif
Then:
* The crawl finishes within 20 seconds.
* The mirror contains the URL <http://www.ic.gc.ca/ic_wp-pa.htm>, which is the
URL that the web site redirects you to when you go to <http://www.ic.gc.ca/>.
* The links contained in <http://www.ic.gc.ca/ic_wp-pa.htm> are also put in the
mirror.
* But that's where it stops. Httrack never puts grand-children of
<http://www.ic.gc.ca/ic_wp-pa.htm> in the mirror, eventhough I used -r 9999
(i.e., no depth limit).
The strange thing is that if I try to crawl starting from
<http://www.ic.gc.ca/ic_wp-pa.htm> instead of <http://www.ic.gc.ca/>:
httrack <http://www.ic.gc.ca/ic_wp-pa.htm> -O
C:\wbtwrite\prealigner_data\site_mirrors\www.ic.gc.ca -v -r 9999 -c1 --upda te
-I0 +*lang* +*.js -*.jpg -*.jpeg -*.gif -*.mov -*.mp3 -*.zip -*.wav -*.mpg
-*.mpeg -*.tif
Then:
* The crawl goes on well beyond 20 seconds. I stopped it after 2 mins.
* The crawl does go beyond the grandchildren of
<http://www.ic.gc.ca/ic_wp-pa.htm>.
Note that I have run those two commands several times, and the pattern is
consistent. The first command never goes beyond the children, and the second
command always does. So it can't be explained by traffic conditions, or the
server deciding that I am abusing it.
What am I doing wrong here?
Thx.
Alain | |