| > > Here's a thing the Heritrix does that HTTrack doesn't seem
> > to do: It will attempt to download all the parts that
> make
> > up a page, considering them all part of the same object.
> > All frame contents, pictures, embedded objects etc that
> are
> > required to redisplay the page
>
> Humm, this might be an idea for a future release ;
> somethink like a more powerful 'near' option
Exactly. Instead of deciding near-ness based on file type,
decide it based on context. Things that will certainly be
embedded should be considered 'near'. Whether to consider
them at the same link-depth is a trickier question. Also a
bit tricky when Javascript links are embedded -- whay if
they have further links in them?
> > Would it be hard to do? Can
> > plugins get the info required to pick the right pages
> (i.e.
> > context of links)?>
> This would require some coding - as the current 'wizard'
> (which decides which link has to be downloaded) does not
> even know the upstream tag name which generated the link.
Ugh -- sounds like it'll take a while to do.
> This would cause another problem: how to handle a href's
> and img src's identical links?> a href's won't be downloaded (not 'embedded'
file), but img
> src's will - so what to do in this case:
>
> <a href='foo.gif'>
> <img src='foo.gif'>
>
> The first link will be rewritten as absolute link ; not the
> second. This might cause annoying side effects?
Me, I don't care about rewriting problems. I store in ARC
format anyway, which doesn't rewrite at all.
-Lars | |