| > > My largest problem is the MD5 hash of query strings. It
> > would be nice if an option was available to convert
> > invalid filename characters to an acceptable replacement
> > (? to @, etc). This way, an Apache rewrite rule would be
> > easier.
>
> Tuning the naming system is possible, using callbacks, but
> requires some coding. You can have a look at the python
> module integration:
> <http://www.satzbau-gmbh.de/staff/abel/httrack-py/>
>
> Which allows you to wrap naming callbacks, and others.
Thanks for the pointer, Xavier;)
Gerald, to sketch a possible solution for the "multiple
mirror archives" problem:
- httrack has an option to replace all external links
with a link to an internal "error page", where the
original URL is passed as a CGI parameter. Replace
the default error page with your own CGI script. This
script looks into a table/database, if the URL belongs to
a page mirrored in another archive; if yes, it sends a
redirect to the browser, otherwise it returns an error
message.
- build the database mapping URLs to archive file
paths by using the callback save-name. In this callback,
simply write the original URL and the path name of the
saved file to a database of your choice. Even a simple
text file might be useful, if you don't have too many
links. If you use the httrack-py module, the required
Python code should be no longer than a few dozen lines.
The only minor annoyance I see right now with this solution
is this: You must of course identify the different archives
somehow in the database, but httrack currently does not
reveal the "project name" in any callback (or did I miss
something?), and it does not allow to pass arbitrary
parameters to the callback module. The only way I know
to pass parameters to the module is to set an environemnt
variable. But you can also use the -O option and its
'representation' path_html in httrack-py's 'start'
callback to figure out, which site is currently mirrored.
Abel | |