| Xavier,
We've been using HTTrack for several years now to mirror hundreds of sites for
the eGranary Digital Library and have marveled at its capacity to handle even
the most complex sites. Thanks very much for your patience, advice, and
programming.
We remain vexed, however, at managing the HTTrack jobs for hundreds (soon to
be over 1,000) sites that have been contributed to the eGranary.
To help us analyze the completeness of our mirrors, we've developed an SQL
database to track the jobs and store basic data, like time taken, pages
processed, and a few other items we snag from the hts-log. As well, we store
the command line parameters so we can compare attempts to see which parameters
are most successful.
(We've also developed a way to snag the command line parameters from the
doit.log, but we find that these are not always reliably updated with each
subsequent scrape.)
Here's our biggest problem: keeping tabs on which jobs are running on which
machines (we have 20 dedicated to scraping and updating our mirrors) and
knowing when the jobs are done.
Sure we can make a program to check for the presence of a lock file or to read
the log, but this is cumbersome: we'd need to configure each scraping machine,
possibly each directory or scrape, and remember to use this procedure whenever
we do an ad hoc scrape from our own workstations.
It would be much better if we had some option inside HTTrack that would "send
a signal" to some central handler that could then process the data.
Here are some ideas:
-- an email containing either the data or the location of the hts-log sent to
a user-configurable address
-- a file containing either the data or the location of the hts-log written to
a user-configurable directory
-- the capacity to post the data or the location of the hts-log into and ODBC
database
I see on this forum some suggestions to create a "wrapper" program that will
sense the termination of HTTrack and then run another program that will update
our database. We've looked into several options for this and all of them are
awkward.
Might it be possible to add the capacity to "send a signal" to a central
processor at the end of a HTTrack job? I'm no programmer, so I can't do this
myself, but if you could suggest someone who might be able to help out, I'd be
glad to explore with them how we could get this done.
Best regards!
-- Cliff
| |