Re: Download single page and all it's dependencies

Subject: Re: Download single page and all it's dependencies

Author: Xavier Roche

Date: 02/29/2004 23:07

> Please add this feature "--singlepage", so httrack can 
> save a single page and all it's content as displayed in 
> IExplorer (images, swf, mid, frame, iframe, css, js, ...) 
> without downloading the whole site.

Yes, yes, this is on the TODO list - and on the top of it. 
But this will require some coding (I mean, some hard one, 
as the current parser it not really designed to immediately 
allow it).

> I'm looking the source code, but I'm not able to find the 
> <a href="..."> parser... where is it? :)

htsparse.c, line 961:
p=rech_tageq(adr,"href");

The rest is a bit more complex - there's a list of 
attributes that can be parsed, with some heusistics.
The link extraction itself is a real mess (due to "" and '' 
things, possibly missing, and possibily with embedded trash 
such as carriage returns and so..), and looks pretty bad, 
with many, many hacks to adapt to "bogus" html.

> into htsparse.c I found this piece of code:
> p=rech_tageq(adr,"href");

Yes, yes

> but it doesnt affect subpages scanning...

If triggered, "p" will cause the link extraction. 

For the subpage scanning, the heuristics are located in 
htswizard.c (be prepared to die, the algorithms are a 
little bit complex.. or maybe a little bit messy)

Create subthread

All articles

Subject	Author	Date
Download single page and all it's dependencies		02/28/2004 09:16
Re: Download single page and all it's dependencies		02/29/2004 23:07