| > We're looking at the option of using HTTrack to harvest
> sites, but to dump the results into the ARC format also
used
> by the Internet Archive. Has anybody done any work in
this
> direction previously?
No - but I guess that archiving would not change the
internal page link format, isn't it? Then, it would not be
too difficult to do: you have to fetch headers and related
data (URL, size..)
> If not, what functions would I need
> to look at to get headers+body
You will have to use the external callbacks ; see
<http://www.httrack.com/html/plug.html> for more information.
The idea is not to recompile httrack at all, but to compile
very small standalone object files, that will be probed by
httrack on startup using the --external option.
In your case, the best way would be to wrap two essential
callbacks:
- "receive-header"
Called when HTTP headers are recevived from the remote
server. The buff buffer contains text headers, adr and fil
the URL, and referer_adr and referer_fil the referer URL.
The incoming structure contains all information related to
the current slot.
return value: 1 if the mirror can continue, 0 if the mirror
must be aborted
Prototype:
int (* myfunction)(char* buff, char* adr, char* fil, char*
referer_adr, char* referer_fil, htsblk* incoming);
- "transfer-status"
Called when a file has been processed (downloaded, updated,
or error)
return value: must return 1
Prototype:
int (* myfunction)(lien_back* back);
The most "difficult" part is to keep the headers from
receive-header in a buffer until the corresponding URL is
seen in "transfer-status" (using a hashtable - that can be
grabbed from htsinthash.*)
Don't hesitate to ask me more details ; but the system
should be straightforward: a single .c file with several
functions. See the callbacks-example.c file, also!
| |