| Well, I started to make a "HTTrack library" (the lib/
folder) but this project was discontinued due to lack
of time. For now, you can control which pages has to
be parsed, which pages has to be captured and some
other interesting things (userdefined html processing,
such as some linguistic analysis)
A solution here might be a supplemental handler "give
me more links or I'll stop" in which you would send
all URLs step by step
But a problem will result: links takes some memory,
for example with 1,000,000 links you can have >100MB
of memory used, due to the fact that HTTrack not
only "spider" but also "remembers" - it remembers all
links stored so that it does not take twice a link,
and stored its location. Each link does not takes much
memory (about 100 bytes) but of course with a huge
parsing this can be a problem!
To summarize, I think such cases might be handled
through some supplemental library hacks - maybe I'll
have to continue the library when I have finished the
3.00 :)
| |