HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: HTTrack development question.
Author: Xavier Roche
Date: 01/22/2005 14:53
 
> HTTrack has been a wonderful tool to use building The 
> eGranary Digital Library.

Thanks :p

> An example of one of 
> the things we would like to do differently would be to 
> make links between different mirrored projects work, if 
> there are links. We would also like to have the links 
work 
> when we acquire a new mirror and an old project contains 
> some links to the new content.

Humm, this feature is a very specific feature, and will 
require some *very* hard work. The main problem is that 
existing "live" links inside projects will have to be 
rebuilt into "local" links when a mirror is added, or 
updated (an update may widen the mirror scope, if you 
change the settings). Basically it means that you have to 
rebuild ALL existing projects to be sure that cross-links 
will still work - even for projects with no updated pages: 
local link names may have changed, because of new 
collisions (http://www.example.com/ and 
<http://www.example.com/index.html>, for example)

So this would require a complete rewrite of the caching 
system (merging all different caches, with scalable 
hashtables) and some re-work with the mirror engine. This 
is not impossible to do, but I don't have the necessary 
ressources (time) to do it, and there are already plenty of 
missing features I will have to implement one day.

One solution might be to have a "clever" proxy : when 
hitting a "live" link, such as 
<http://www.example.com/foo/index.html>, the proxy could 
first lookup on a defined directory and check if 
the "www.example.com" directory exists. If so, open the 
corresponding cache, lookup 
the <http://www.example.com/foo/index.html> local name, and 
return the file with proper MIME type and headers (the .zip 
cache file contains both data or a reference to the local 
file, and HTTP headers). If the file can not be found, 
proceed as for any other regular external resource.

IMHO, this would require MUCH less work, because existing 
projects will still be independent, that is, without the 
hassle to handle tens (possibly hundreds) of cross-linked 
projects, and without the need to update the whole universe 
at every add/update. AND with the possibility to add sites 
fro independent sources (for example, two different 
libraries mergeing their own projects)

Just a suggestion - I can provide more details on the 
possible solution if you wish.

 
Reply Create subthread


All articles

Subject Author Date
HTTrack development question.

01/18/2005 22:07
Re: HTTrack development question.

01/22/2005 14:53
Re: HTTrack development question.

02/11/2005 18:33




f

Created with FORUM 2.0.11