HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Query Strings and Web Server Rewrite Rules
Author: Abel Deuring
Date: 09/19/2004 16:47
 
> > My largest problem is the MD5 hash of query strings.  It 
> > would be nice if an option was available to convert 
> > invalid filename characters to an acceptable replacement 
> > (? to @, etc).  This way, an Apache rewrite rule would be 
> > easier.
> 
> Tuning the naming system is possible, using callbacks, but 
> requires some coding. You can have a look at the python 
> module integration:
> <http://www.satzbau-gmbh.de/staff/abel/httrack-py/>
> 
> Which allows you to wrap naming callbacks, and others.

Thanks for the pointer, Xavier;) 

Gerald, to sketch a possible solution for the "multiple 
mirror archives" problem: 

- httrack has an option to replace all external links
with a link to an internal "error page", where the 
original URL is passed as a CGI parameter. Replace
the default error page with your own CGI script. This 
script looks into a table/database, if the URL belongs to 
a page mirrored in another archive; if yes, it sends a
redirect to the browser, otherwise it returns an error
message.

- build the database mapping URLs to archive file 
paths by using the callback save-name. In this callback, 
simply write the original URL and the path name of the 
saved file to a database of your choice. Even a simple
text file might be useful, if you don't have too many 
links. If you use the httrack-py module, the required 
Python code should be no longer than a few dozen lines. 

The only minor annoyance I see right now with this solution
is this: You must of course identify the different archives
somehow in the database, but httrack currently does not 
reveal the "project name" in any callback (or did I miss 
something?), and it does not allow to pass arbitrary 
parameters to the callback module. The only way I know 
to pass parameters to the module is to set an environemnt
variable. But you can also use the -O option and its
'representation' path_html in httrack-py's 'start' 
callback to figure out, which site is currently mirrored.

Abel
 
Reply Create subthread


All articles

Subject Author Date
Query Strings and Web Server Rewrite Rules

09/17/2004 04:43
Re: Query Strings and Web Server Rewrite Rules

09/17/2004 19:07
Re: Query Strings and Web Server Rewrite Rules

09/18/2004 19:06
Re: Query Strings and Web Server Rewrite Rules

09/19/2004 10:08
Re: Query Strings and Web Server Rewrite Rules

09/19/2004 16:47
Re: Query Strings and Web Server Rewrite Rules

10/01/2004 13:54
Re: Query Strings and Web Server Rewrite Rules

10/01/2004 14:08




2

Created with FORUM 2.0.11