Re: Writing to ARC format - HTTrack Website Copier Forum

Subject: Re: Writing to ARC format

Author: Xavier Roche

Date: 11/16/2003 09:30

> We're looking at the option of using HTTrack to harvest
> sites, but to dump the results into the ARC format also 
used
> by the Internet Archive.  Has anybody done any work in 
this
> direction previously?
No - but I guess that archiving would not change the 
internal page link format, isn't it? Then, it would not be 
too difficult to do: you have to fetch headers and related 
data (URL, size..)

> If not, what functions would I need
> to look at to get headers+body

You will have to use the external callbacks ; see 
<http://www.httrack.com/html/plug.html> for more information.
The idea is not to recompile httrack at all, but to compile 
very small standalone object files, that will be probed by 
httrack on startup using the --external option.

In your case, the best way would be to wrap two essential 
callbacks:

- "receive-header"

Called when HTTP headers are recevived from the remote 
server. The buff buffer contains text headers, adr and fil 
the URL, and referer_adr and referer_fil the referer URL. 
The incoming structure contains all information related to 
the current slot.
return value: 1 if the mirror can continue, 0 if the mirror 
must be aborted

Prototype:
int (* myfunction)(char* buff, char* adr, char* fil, char* 
referer_adr, char* referer_fil, htsblk* incoming);

- "transfer-status"

Called when a file has been processed (downloaded, updated, 
or error)
return value: must return 1

Prototype:
int (* myfunction)(lien_back* back);

The most "difficult" part is to keep the headers from 
receive-header in a buffer until the corresponding URL is 
seen in "transfer-status" (using a hashtable - that can be 
grabbed from htsinthash.*)

Don't hesitate to ask me more details ; but the system 
should be straightforward: a single .c file with several 
functions. See the callbacks-example.c file, also!

Create subthread

All articles

Subject	Author	Date
Writing to ARC format		11/14/2003 17:58
Re: Writing to ARC format		11/16/2003 09:30
Re: Writing to ARC format		11/16/2003 09:44
Re: Writing to ARC format		11/17/2003 15:53
Re: Writing to ARC format		11/17/2003 23:01