| > Here's a thing the Heritrix does that HTTrack doesn't seem
> to do: It will attempt to download all the parts that
make
> up a page, considering them all part of the same object.
> All frame contents, pictures, embedded objects etc that
are
> required to redisplay the page
Humm, this might be an idea for a future release ;
somethink like a more powerful "near" option
> Would it be hard to do? Can
> plugins get the info required to pick the right pages
(i.e.
> context of links)?
This would require some coding - as the current "wizard"
(which decides which link has to be downloaded) does not
even know the upstream tag name which generated the link.
This would cause another problem: how to handle a href's
and img src's identical links?a href's won't be downloaded (not "embedded"
file), but img
src's will - so what to do in this case:
<a href="foo.gif">
<img src="foo.gif">
The first link will be rewritten as absolute link ; not the
second. This might cause annoying side effects?
| |