HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Notes regarding callback functions
Author: Xavier Roche
Date: 10/03/2001 10:04
 
Hi,

> Anyway, here is a list of the callback functions. 
I've 
> tried to trace them a bit to see what they do. I'm 
not 
> sure about these descriptions, perhaps Xavier Roche 
or 
> someone else more familiar with HTTrack could check 
> them when time allows. For what they are worth, here 
> they are:

Seems to be well documented :)) - here are some 
remarks:

First you can look at httrack.c and httrack.h 
functions for a working example (httrack.c is only 
useful to launch the mirror, and display
some funny information)

The steps to follow to launch the engine are quite 
simple, you can do
	hts_init();
	return hts_main(argc,argv);
,filling argc and argv with proper C-style arguments, 
ad if you were calling httrack using the commandline.

You can also define wrappers, so that you can control 
and add features when mirroring:

	hts_init();
	htswrap_add("init",mywrapper_init);
	htswrap_add("free",mywrapper_uninit);
	htswrap_add("start",mywrapper_start);
	htswrap_add("change-options",mywrapper_chopt);
	htswrap_add("end",mywrapper_end);
	htswrap_add("check-html",mywrapper_checkhtml);
	htswrap_add("loop",mywrapper_loop);
	htswrap_add("query",mywrapper_query);
	htswrap_add("query2",mywrapper_query2);
	htswrap_add("query3",mywrapper_query3);
	htswrap_add("check-link",mywrapper_check);
	htswrap_add("pause",mywrapper_pause);
	htswrap_add("save-file",mywrapper_filesave);
	htswrap_add("link-
detected",mywrapper_linkdetected);
	return hts_main(argc,argv);

In this case, that's right, you'll have to define all 
mywrapper_* functions

Here are some inline comments:



 
> List of callback functions
> --------------------------
> 
> Their 'name' is the special identifier which must be 
> passed to htswrap_add()
> -----------------------------------------------------
--
> ---------------------
> 
>    Name:           'change-options'
>    Purpose:        Called when options have been 
> changed by HTTrack
>    Params:         httrackp* opt 
>                    (all the options for this 
session, 
> see 'htsopt.h')
>    Valid Return:   int 1 (ignored as far as I can 
> tell, but to be safe use 1)

Some options can NOT be changed, however, such as the 
path, proxy..

>    Name:           'link-detected'
>    Purpose:        Called when a link is detected
>    Params:         char* link (the text of 
the 'href=' 
> attribute)
>                    links are usually relative, so 
this 
> text will likely
>                    be something like 'filename.ext' 
> or 'subdir/filename.ext'

The link is given 'as is' - and was extracted from any 
tags (a, img, object...) or extracted from javascript -
 it can be relative or
absolute (http://)

>    Name:           'check-html'
>    Purpose:        Called to check if an HTML file 
>                    should be parsed after download
>    Params:         char* buffer_html (address of the 
> HTML buffer)
>                    int buffer_html_size (size of 
this 
> buffer in bytes)
>                    char* host_name (eg: www.foo.com)
>                    char* filename (eg: /index.html)
>    Valid Return:   int
>                    0 = do not parse this file
>                    1 = parse this file (default 
> behaviour)
>    Notes:          This function is also called for 
> the primary URL,
>                    before downloading.
>                    In this case, it is passed the 
URL 
> in 'buffer_html'
>                    and the word 'primary' 
> in 'host_name' 
>                    and '/primary' in 'filename'.

In this function, you can add some code for linguistic 
analysis, search features, and so on..

>    Name:           'save-file'
>    Purpose:        Called when a file is about to be 
> saved
>    Params:         char* filename
>                    (path to local file, starting 
with 
> the prefix given 
>                    with the -O command line option; 
if 
> the given prefix 
>                    was relative then this name will 
be 
> relative also
>    Valid Return:   int 
>                    0 = don't save the file
>                    1 = save the file (default 
behavior)

Note that you can NOT change the filename in this 
routine - I may add wrapper to build specific target 
names

>    Name:           'loop'
>    Purpose:        Called during a download loop, 
> after every chunk of bytes
>    Params:         many, many
>    Valid Return:   int
>                    0 = HTTrack should end
>                    1 = HTTrack should continue

Here are the parameters or int __cdecl 
mywrapper_loop : 

lien_back* back,int back_max,int back_index

The 'back' structure (see htscorer.h) is an array 
(back_max elements) of lien_back entries, and can be 
used to show sharp stats on current downloads. 
The "back_index", if non negative, 
is the index of the current file being processed 
(parsed, or in 'wait' state)

Here are the two main structures used:

- lien_back:

typedef struct {
  char url_adr[HTS_URLMAXSIZE*2];     // address of 
http document
  char url_fil[HTS_URLMAXSIZE*2];     // filename of 
http document
  char url_sav[HTS_URLMAXSIZE*2];     // local 
filename of http document (empty if file not saved)
  char referer_adr[HTS_URLMAXSIZE*2]; // address of 
HTTP-REFERER, if any
  char referer_fil[HTS_URLMAXSIZE*2]; // filename of 
HTTP-REFERER, if any
  char location_buffer[HTS_URLMAXSIZE*2];  // this 
http document sent a redirect to us
  char send_too[1024];    // internal - data to send
  int status;           // status (-1=not used, 0: 
ready, >0: operation/download in progesss)
  int testmode;         // test mode
  int timeout;            // timeout, in seconds
  TStamp timeout_refresh; // internal
  int rateout;            // minimum transfer rate
  TStamp rateout_time;    // internal
  LLint maxfile_nonhtml;  // maximum size of non html 
file
  LLint maxfile_html;     // maximum size of html file
  htsblk r;               // current link object, see 
htsblk structure in htslib.h
  short int is_update;    // has been updated
  int head_request;       // head request
  LLint range_req_size;   // internal
  //
  int http11;             // must use HTTP/1.1
  int is_chunk;           // internal
  char* chunk_adr;        // internal
  LLint chunk_size;       // internal
  //
  short int* pass2_ptr;   // internal
  //
  char info[256];       // internal
  int stop_ftp;         // internal
} lien_back;

- htsblk:

typedef struct {
  int statuscode;        // status-code, -1=error, 
200=OK, 201=..etc (see RFC1945)
  short int notmodified; // not modified?  short int is_write;    //
direct-to-disc
  short int is_chunk;    // internal
  char* adr;             // address if in memory (!
is_write)
  FILE* out;             // internal
  LLint size;            // current downloaded size
  char msg[80];          // error message if any
  char contenttype[64];  // content-type ("text/html" 
for example)
  char* location;        // internal (redirect)
  LLint totalsize;       // total size
  short int is_file;     // this link is a file://
  T_SOC soc;             // internal
  FILE* fp;              // internal
  char lastmodified[64]; // Last-Modified
  char etag[64];         // Etag
  char cdispo[256];      // Content-Disposition 
(truncated)
  /* */
  htsrequest req;  // internal
} htsblk;

int lien_n,int lien_tot

Links scanned, total number of links

Lint stat_bytes,LLint stat_bytes_recv

Bytes received, bytes received (raw)

int stat_time

Time in seconds

int stat_nsocket

N# of connection

LLint stat_written

Bytes written

int stat_updated, int stat_errors

Files updated, n# of errors

int irate

Current transfer rate, estimated each seconds

int nbk

Links successfully anticipated

>    Name:           'pause'
>    Purpose:        Called to wait for the lock file 
to 
> be deleted
>    Params:         char* lockfile
>    Valid Return:   Nothing

See option G


 
Reply Create subthread


All articles

Subject Author Date
Notes regarding callback functions

10/03/2001 03:10
Re: Notes regarding callback functions

10/03/2001 10:04




4

Created with FORUM 2.0.11