HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Notes regarding callback functions
Author: Xavier Roche
Date: 10/03/2001 10:04

> Anyway, here is a list of the callback functions. 
> tried to trace them a bit to see what they do. I'm 
> sure about these descriptions, perhaps Xavier Roche 
> someone else more familiar with HTTrack could check 
> them when time allows. For what they are worth, here 
> they are:

Seems to be well documented :)) - here are some 

First you can look at httrack.c and httrack.h 
functions for a working example (httrack.c is only 
useful to launch the mirror, and display
some funny information)

The steps to follow to launch the engine are quite 
simple, you can do
	return hts_main(argc,argv);
,filling argc and argv with proper C-style arguments, 
ad if you were calling httrack using the commandline.

You can also define wrappers, so that you can control 
and add features when mirroring:

	return hts_main(argc,argv);

In this case, that's right, you'll have to define all 
mywrapper_* functions

Here are some inline comments:

> List of callback functions
> --------------------------
> Their 'name' is the special identifier which must be 
> passed to htswrap_add()
> -----------------------------------------------------
> ---------------------
>    Name:           'change-options'
>    Purpose:        Called when options have been 
> changed by HTTrack
>    Params:         httrackp* opt 
>                    (all the options for this 
> see 'htsopt.h')
>    Valid Return:   int 1 (ignored as far as I can 
> tell, but to be safe use 1)

Some options can NOT be changed, however, such as the 
path, proxy..

>    Name:           'link-detected'
>    Purpose:        Called when a link is detected
>    Params:         char* link (the text of 
the 'href=' 
> attribute)
>                    links are usually relative, so 
> text will likely
>                    be something like 'filename.ext' 
> or 'subdir/filename.ext'

The link is given 'as is' - and was extracted from any 
tags (a, img, object...) or extracted from javascript -
 it can be relative or
absolute (http://)

>    Name:           'check-html'
>    Purpose:        Called to check if an HTML file 
>                    should be parsed after download
>    Params:         char* buffer_html (address of the 
> HTML buffer)
>                    int buffer_html_size (size of 
> buffer in bytes)
>                    char* host_name (eg:
>                    char* filename (eg: /index.html)
>    Valid Return:   int
>                    0 = do not parse this file
>                    1 = parse this file (default 
> behaviour)
>    Notes:          This function is also called for 
> the primary URL,
>                    before downloading.
>                    In this case, it is passed the 
> in 'buffer_html'
>                    and the word 'primary' 
> in 'host_name' 
>                    and '/primary' in 'filename'.

In this function, you can add some code for linguistic 
analysis, search features, and so on..

>    Name:           'save-file'
>    Purpose:        Called when a file is about to be 
> saved
>    Params:         char* filename
>                    (path to local file, starting 
> the prefix given 
>                    with the -O command line option; 
> the given prefix 
>                    was relative then this name will 
> relative also
>    Valid Return:   int 
>                    0 = don't save the file
>                    1 = save the file (default 

Note that you can NOT change the filename in this 
routine - I may add wrapper to build specific target 

>    Name:           'loop'
>    Purpose:        Called during a download loop, 
> after every chunk of bytes
>    Params:         many, many
>    Valid Return:   int
>                    0 = HTTrack should end
>                    1 = HTTrack should continue

Here are the parameters or int __cdecl 
mywrapper_loop : 

lien_back* back,int back_max,int back_index

The 'back' structure (see htscorer.h) is an array 
(back_max elements) of lien_back entries, and can be 
used to show sharp stats on current downloads. 
The "back_index", if non negative, 
is the index of the current file being processed 
(parsed, or in 'wait' state)

Here are the two main structures used:

- lien_back:

typedef struct {
  char url_adr[HTS_URLMAXSIZE*2];     // address of 
http document
  char url_fil[HTS_URLMAXSIZE*2];     // filename of 
http document
  char url_sav[HTS_URLMAXSIZE*2];     // local 
filename of http document (empty if file not saved)
  char referer_adr[HTS_URLMAXSIZE*2]; // address of 
  char referer_fil[HTS_URLMAXSIZE*2]; // filename of 
  char location_buffer[HTS_URLMAXSIZE*2];  // this 
http document sent a redirect to us
  char send_too[1024];    // internal - data to send
  int status;           // status (-1=not used, 0: 
ready, >0: operation/download in progesss)
  int testmode;         // test mode
  int timeout;            // timeout, in seconds
  TStamp timeout_refresh; // internal
  int rateout;            // minimum transfer rate
  TStamp rateout_time;    // internal
  LLint maxfile_nonhtml;  // maximum size of non html 
  LLint maxfile_html;     // maximum size of html file
  htsblk r;               // current link object, see 
htsblk structure in htslib.h
  short int is_update;    // has been updated
  int head_request;       // head request
  LLint range_req_size;   // internal
  int http11;             // must use HTTP/1.1
  int is_chunk;           // internal
  char* chunk_adr;        // internal
  LLint chunk_size;       // internal
  short int* pass2_ptr;   // internal
  char info[256];       // internal
  int stop_ftp;         // internal
} lien_back;

- htsblk:

typedef struct {
  int statuscode;        // status-code, -1=error, 
200=OK, 201=..etc (see RFC1945)
  short int notmodified; // not modified?  short int is_write;    //
  short int is_chunk;    // internal
  char* adr;             // address if in memory (!
  FILE* out;             // internal
  LLint size;            // current downloaded size
  char msg[80];          // error message if any
  char contenttype[64];  // content-type ("text/html" 
for example)
  char* location;        // internal (redirect)
  LLint totalsize;       // total size
  short int is_file;     // this link is a file://
  T_SOC soc;             // internal
  FILE* fp;              // internal
  char lastmodified[64]; // Last-Modified
  char etag[64];         // Etag
  char cdispo[256];      // Content-Disposition 
  /* */
  htsrequest req;  // internal
} htsblk;

int lien_n,int lien_tot

Links scanned, total number of links

Lint stat_bytes,LLint stat_bytes_recv

Bytes received, bytes received (raw)

int stat_time

Time in seconds

int stat_nsocket

N# of connection

LLint stat_written

Bytes written

int stat_updated, int stat_errors

Files updated, n# of errors

int irate

Current transfer rate, estimated each seconds

int nbk

Links successfully anticipated

>    Name:           'pause'
>    Purpose:        Called to wait for the lock file 
> be deleted
>    Params:         char* lockfile
>    Valid Return:   Nothing

See option G

Reply Create subthread

All articles

Subject Author Date
Notes regarding callback functions

10/03/2001 03:10
Re: Notes regarding callback functions

10/03/2001 10:04


Created with FORUM 2.0.11