HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Downloading a 'page' only
Author: Lars Clausen
Date: 12/01/2003 14:26
> > Here's a thing the Heritrix does that HTTrack doesn't seem
> > to do:  It will attempt to download all the parts that 
> make
> > up a page, considering them all part of the same object. 
> > All frame contents, pictures, embedded objects etc that 
> are
> > required to redisplay the page
> Humm, this might be an idea for a future release ; 
> somethink like a more powerful 'near' option

Exactly.  Instead of deciding near-ness based on file type,
decide it based on context.  Things that will certainly be
embedded should be considered 'near'.  Whether to consider
them at the same link-depth is a trickier question.  Also a
bit tricky when Javascript links are embedded -- whay if
they have further links in them?
> > Would it be hard to do?  Can
> > plugins get the info required to pick the right pages 
> (i.e.
> > context of links)?> 
> This would require some coding - as the current 'wizard' 
> (which decides which link has to be downloaded) does not 
> even know the upstream tag name which generated the link.

Ugh -- sounds like it'll take a while to do.

> This would cause another problem: how to handle a href's 
> and img src's identical links?> a href's won't be downloaded (not 'embedded'
file), but img 
> src's will - so what to do in this case:
> <a href='foo.gif'>
> <img src='foo.gif'>
> The first link will be rewritten as absolute link ; not the 
> second. This might cause annoying side effects?
Me, I don't care about rewriting problems.  I store in ARC
format anyway, which doesn't rewrite at all.

Reply Create subthread

All articles

Subject Author Date
Re: Downloading a 'page' only

11/28/2003 19:47
Re: Downloading a 'page' only

12/01/2003 14:26


Created with FORUM 2.0.11