| I am trying to crawl a very large web site as a series of shorter batches,
using the --continue and --max-time options. This is related to my earlier
post about new.zip being corrupted when I try to pause, then stop httrack:
<http://forum.httrack.com/readmsg/28015/28003/index.html>
But when I try this batch approach, it seems httrack keeps revisiting the same
two batches of urls over and over again. It's as if it only remembers the very
last batch that it visited, as opposed to the cumulative set of all batches it
visited before.
Is this normal, or is it a bug?
This happens consistently on any site that I try to use this batch approach
on. However, to test this in a controlled fashion, I have setup a page on my
local web server page_with_links_to_timed_urls.html), which contains 100 of
links to urls of the form:
---
<http://perlcorpusminertest:8080/crawler_testingsite_with_scripts/bin/sleep_for.cgi?nsecs=1&id=0>
<http://perlcorpusminertest:8080/crawler_testingsite_with_scripts/bin/sleep_for.cgi?nsecs=1&id=1>
etc...
---
Each link is designed so that the page takes about 1 sec to load.
I crawl this page using :
------------
httrack
<http://perlcorpusminertest:8080/crawler_testing/site_with_scripts/page_with_links_to_timed_urls.html>
-c1 -O C:\wbtwrite\prealigner_data\site_mirrors\http -v --continue --max-time
10 -I0 -s0 +*.js
-------------
The first time I run this command, I end up with:
---
new.txt, new.lst
- contains page_with_links_to_timed_urls.html and the sleep_for.cgi links with
ids=0..5
old.txt, old.lst
- don't exist
---
This is what I would expect. Then I run the httrack command a second time, and
I get:
---
new.txt, new.lst
- contains page_with_links_to_timed_urls.html and the sleep_for.cgi links with
ids=6..11
old.txt, old.lst
- contains page_with_links_to_timed_urls.html and the sleep_for.cgi links with
ids=0..5
---
This too is what I would expect. The old.* files contain what the new.* files
used to contain. The new.* still contain the URL of the starting point page
(eventhough it was already retrieved in the previous batch), because httrack
still has to load it when it starts when it uses it as the starting point for
the second batch.
But when I invoke the httrack command a third time, things start getting
weird. I get:
---
new.txt, new.lst
- contains page_with_links_to_timed_urls.html and the sleep_for.cgi links with
ids=0..5
old.txt, old.lst
- contains page_with_links_to_timed_urls.html and the sleep_for.cgi links with
ids=6..11
---
In other words, it looks like httrack forgot that it had already done
sleep_for_cgi with ids=0..5 and it did them all over again. If I do the
command a fourth time, new.zip contains 6..11 and old.* contain 0..5 (i.e. the
two files get inverted).
Overall, it seems that httrack only ever remembers the urls that it retrieved
in the VERY LAST batch. That memory is NOT CUMULATIVE.
In an attempt to make it cumulative, I wrote a script which does the
following:
- Append the content of new.* to file cumulative.*
- Copy cumulative.* to old.*
- invoke httrack as above
That way, old.* is always garanteed to contain the cumulative list of all the
old.* files that have ever been produced by httrack.
But that didn't work. The sequence of new.* files is still exactly as before
(of course now, old.* contains a concatenation of the new.* files that are
produced, namely 0..5,6..1,0..5,6..11, etc...).
So I thought, maybe httrack uses the current new.* (not old.*) files to know
what urls were crawled in the previous batch. In other words, maybe the old.*
are not meant for httrack to know what was in the previous batch, but for the
end user to know what was done in the previous batch, once httrack has done
its thing. So, I modified my script so that it does:
- Append the content of new.* to a file cumulative.*
- Copy cumulative.* to new.* (as opposed to old.*)
- invoke httrack as above
In other words, make it so that when httrack starts, new.* contain the
cumulation of all the new.* that have been produced to date.
That STILL doesn't work. First time I run the script, httrack fetches 0..5.
The second time, it fetches 6..12. And the third time, it fetches 0..5 again
and the fourth time, it's back to 6..12.
I am at my wits end with this. Am I misunderstanding what --continue is
supposed to do, or is this a bug? If I am misunderstanding --continue, can
someone tell me how I could achieve my goal of crawling a web site in a series
of short batches? Or, taking a step back, how I can start httrack in the
background, and stop it while garanteeing that new.zip won't be corrupted (as
I documented in this thread):
<http://forum.httrack.com/readmsg/28015/28003/index.html>
Thx,
Alain Désilets | |