| > Please add this feature "--singlepage", so httrack can
> save a single page and all it's content as displayed in
> IExplorer (images, swf, mid, frame, iframe, css, js, ...)
> without downloading the whole site.
Yes, yes, this is on the TODO list - and on the top of it.
But this will require some coding (I mean, some hard one,
as the current parser it not really designed to immediately
allow it).
> I'm looking the source code, but I'm not able to find the
> <a href="..."> parser... where is it? :)
htsparse.c, line 961:
p=rech_tageq(adr,"href");
The rest is a bit more complex - there's a list of
attributes that can be parsed, with some heusistics.
The link extraction itself is a real mess (due to "" and ''
things, possibly missing, and possibily with embedded trash
such as carriage returns and so..), and looks pretty bad,
with many, many hacks to adapt to "bogus" html.
> into htsparse.c I found this piece of code:
> p=rech_tageq(adr,"href");
Yes, yes
> but it doesnt affect subpages scanning...
If triggered, "p" will cause the link extraction.
For the subpage scanning, the heuristics are located in
htswizard.c (be prepared to die, the algorithms are a
little bit complex.. or maybe a little bit messy)
| |