| In this forum i read the following:
The crawler is descending all "layers", on a heap
basis ;
that is, it takes ALL links that can be reached
using "one
mouse click" from the primary urls (the addresses you
typed
to crawl), then all links that can be reached using "two
mouse clicks", and so on..
I guess there is something like a array which contains a
list of all pages that are'nt still processed, and the
crawler will take the link which is the "first" in that
array, then the next, putting new links to the end of the
array.
Im sure that this is best for most users, but for some
reasons it might be better if httrack would take a random
link out of that list, and then removing it from the list
(which would certainly mean "mark it as processed"
beacause removing it is difficult). When using external
Links, that would work around problems with servers being
pulled down by massive downloading. I guess in some cases
the time needed to complete the task might also reduce.
Maybe that some non-random order other than the one used
could also be useful for some users.
PS:
If i had some more time, i would take the code and do this
on my own, but i haven't... | |