HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Tracking Hundreds of HTTrack Jobs
Author: Xavier Roche
Date: 08/25/2006 21:05
 
> Thanks very much for
> your patience, advice, and programming.

Thanks :p

> We remain vexed, however, at managing the HTTrack
> jobs for hundreds (soon to be over 1,000) sites that
> have been contributed to the eGranary.

Two options, IMHO:

1. On Linux/Unix, a frontend script which will do something like:

#!/bin/sh
#
(some SQL action)
httrack $*
RETURNCODE=$?(some SQL action)

2. On all platforms, use of the library -- either using plugins (the plugin
library have been simplified greatly) to track start/end of mirrors and/or
completeness

3. On all platforms too, modifying src/httrack.c (which is actually only a
frontend to the httrack core library!) to fit your needs
 
Note that the current 3.41 beta release **should** be thread-safe, and hence
you **should** be able to spawn multiple mirrors in multiple threads.

> Here's our biggest problem: keeping tabs on which
> jobs are running on which machines (we have 20
> dedicated to scraping and updating our mirrors) and
> knowing when the jobs are done.

A simple script frontend would do the trick ?
> It would be much better if we had some option inside
> HTTrack that would "send a signal" to some central
> handler that could then process the data.

Might also be done using the 3.41 library (in the following example, just
modify end_of_mirror):

/usr/share/httrack/libtest/callbacks-example-simple.c:
----------------------------------------

/* system includes */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* standard httrack module includes */
#include "httrack-library.h"
#include "htsopt.h"
#include "htsdefines.h"

/* external functions */
EXTERNAL_FUNCTION int hts_plug(httrackp *opt, const char* argv);
EXTERNAL_FUNCTION int hts_unplug(httrackp *opt);

/* local function called as "check_html" callback */
static int process_file(t_hts_callbackarg /*the carg structure, holding
various information*/*carg, /*the option settings*/httrackp *opt, 
                        /*other parameters are callback-specific*/
                        char* html, int len, const char* url_address, const
char* url_file) {
  void *ourDummyArg = (void*) CALLBACKARG_USERDEF(carg);    /*optional
user-defined arg*/

  /* call parent functions if multiple callbacks are chained. you can skip
this part, if you don't want previous callbacks to be called. */
  if (CALLBACKARG_PREV_FUN(carg, check_html) != NULL) {
    if (!CALLBACKARG_PREV_FUN(carg, check_html)(CALLBACKARG_PREV_CARG(carg),
opt,
                                                html, len, url_address,
url_file)) {
        return 0;  /* abort */
      }
  }

  printf("file %s%s content: %s\n", url_address, url_file, html);
  return 1;  /* success */
}

/* local function called as "end" callback */
static int end_of_mirror(t_hts_callbackarg /*the carg structure, holding
various information*/*carg, /*the option settings*/httrackp *opt) {
  void *ourDummyArg = (void*) CALLBACKARG_USERDEF(carg);    /*optional
user-defined arg*/

  /* processing */
  fprintf(stderr, "That's all, folks!\n");

  /* call parent functions if multiple callbacks are chained. you can skip
this part, if you don't want previous callbacks to be called. */
  if (CALLBACKARG_PREV_FUN(carg, end) != NULL) {
    /* status is ok on our side, return other callabck's status */
    return CALLBACKARG_PREV_FUN(carg, end)(CALLBACKARG_PREV_CARG(carg), opt);
  }

  return 1;  /* success */
}

/*
module entry point
the function name and prototype MUST match this prototype
*/
EXTERNAL_FUNCTION int hts_plug(httrackp *opt, const char* argv) {
  /* optional argument passed in the commandline we won't be using here */
  const char *arg = strchr(argv, ',');
  if (arg != NULL)
    arg++;

  /* plug callback functions */
  CHAIN_FUNCTION(opt, check_html, process_file, /*optional user-defined
arg*/NULL);
  CHAIN_FUNCTION(opt, end, end_of_mirror, /*optional user-defined arg*/NULL);

  return 1;  /* success */
}

/*
module exit point
the function name and prototype MUST match this prototype
*/
EXTERNAL_FUNCTION int hts_unplug(httrackp *opt) {
  fprintf(stderr, "Module unplugged");

  return 1;  /* success */
}

----------------------------------------

> -- an email containing either the data or the
> location of the hts-log sent to a user-configurable
> address 

Or the digest of the hts-log.txt ?
> -- the capacity to post the data or the location of
> the hts-log into and ODBC database

mysql-client ?
 
Reply Create subthread


All articles

Subject Author Date
Tracking Hundreds of HTTrack Jobs

08/24/2006 06:26
Re: Tracking Hundreds of HTTrack Jobs

08/25/2006 21:05
Re: Tracking Hundreds of HTTrack Jobs

08/25/2006 21:18
Re: Tracking Hundreds of HTTrack Jobs

08/26/2006 08:58
Re: Tracking Hundreds of HTTrack Jobs

08/26/2006 20:59
Re: Tracking Hundreds of HTTrack Jobs

08/27/2006 16:29




2

Created with FORUM 2.0.11