| I an trying to scan a site hosted on IBM/Lotus Domino.
Most of the pages will have a link back to home page,
and the home page is a frame page contain three
dynamically generated pages, one of it just a counter.
When I download this page, it will repeatly scan this
homepage, and all threads were waiting for that. So
if this site have 10,000 pages, this pages might have
download and scan 10,000 times.
Same situation for each sub-area main page.
A side effect on this is that it will increase the
counter every time, thus the page view shut up a lot
after I tried to scan it.
What should I do to avoid this? I tried depth, but
it's not a good solution.
Is there anyway to add a switch that only update one
URL once in each update operation? Or better, might
allow a counter for max update.
When you insert new URL into the queue, do you check
if this link already been updated or already in queue
or already processed? | |