Re: Writing to ARC format - HTTrack Website Copier Forum

Subject: Re: Writing to ARC format

Author: Lars Clausen

Date: 11/17/2003 15:53

Thanks for the reply!  I've gotten a prototype up this
afternoon, but now I have a bunch of questions (of course:)

> > We're looking at the option of using HTTrack to harvest
> > sites, but to dump the results into the ARC format also 
> used
> > by the Internet Archive.  Has anybody done any work in 
> this
> > direction previously?> 
> No - but I guess that archiving would not change the 
> internal page link format, isn't it? Then, it would not be 
> too difficult to do: you have to fetch headers and related 
> data (URL, size..)

Indeed, the ARC format doesn't do any transformation of the
page, but dumps it verbatim into the file.

> > If not, what functions would I need
> > to look at to get headers+body
> 
> You will have to use the external callbacks ; see 
> <http://www.httrack.com/html/plug.html> for more information.
> The idea is not to recompile httrack at all, but to compile 
> very small standalone object files, that will be probed by 
> httrack on startup using the --external option.
> 
> In your case, the best way would be to wrap two essential 
> callbacks:
> 
> - 'receive-header'

That one isn't included by default, but only if HTS_ANALYSTE
is != 0 -- but setting that kills main().  I added a
hts_wrap for it and it worked.

Now for the questions:
Can I trust that the htsblk passed to receive-header is not
deallocated before the corresponding call to
transfer-status? Is it perhaps the same htsblk in both cases?How can I get the
IP corresponding to the address?Where's the actual content passed to
transfer-status?Is the size field in lien_back the one sent by the server?
Thanks in advance,
-Lars

Create subthread

All articles

Subject	Author	Date
Writing to ARC format		11/14/2003 17:58
Re: Writing to ARC format		11/16/2003 09:30
Re: Writing to ARC format		11/17/2003 15:53
Re: Writing to ARC format		11/17/2003 23:01