| Thanks for the reply! I've gotten a prototype up this
afternoon, but now I have a bunch of questions (of course:)
> > We're looking at the option of using HTTrack to harvest
> > sites, but to dump the results into the ARC format also
> used
> > by the Internet Archive. Has anybody done any work in
> this
> > direction previously?>
> No - but I guess that archiving would not change the
> internal page link format, isn't it? Then, it would not be
> too difficult to do: you have to fetch headers and related
> data (URL, size..)
Indeed, the ARC format doesn't do any transformation of the
page, but dumps it verbatim into the file.
> > If not, what functions would I need
> > to look at to get headers+body
>
> You will have to use the external callbacks ; see
> <http://www.httrack.com/html/plug.html> for more information.
> The idea is not to recompile httrack at all, but to compile
> very small standalone object files, that will be probed by
> httrack on startup using the --external option.
>
> In your case, the best way would be to wrap two essential
> callbacks:
>
> - 'receive-header'
That one isn't included by default, but only if HTS_ANALYSTE
is != 0 -- but setting that kills main(). I added a
hts_wrap for it and it worked.
Now for the questions:
Can I trust that the htsblk passed to receive-header is not
deallocated before the corresponding call to
transfer-status? Is it perhaps the same htsblk in both cases?How can I get the
IP corresponding to the address?Where's the actual content passed to
transfer-status?Is the size field in lien_back the one sent by the server?
Thanks in advance,
-Lars | |