| > HTTrack has been a wonderful tool to use building The
> eGranary Digital Library.
Thanks :p
> An example of one of
> the things we would like to do differently would be to
> make links between different mirrored projects work, if
> there are links. We would also like to have the links
work
> when we acquire a new mirror and an old project contains
> some links to the new content.
Humm, this feature is a very specific feature, and will
require some *very* hard work. The main problem is that
existing "live" links inside projects will have to be
rebuilt into "local" links when a mirror is added, or
updated (an update may widen the mirror scope, if you
change the settings). Basically it means that you have to
rebuild ALL existing projects to be sure that cross-links
will still work - even for projects with no updated pages:
local link names may have changed, because of new
collisions (http://www.example.com/ and
<http://www.example.com/index.html>, for example)
So this would require a complete rewrite of the caching
system (merging all different caches, with scalable
hashtables) and some re-work with the mirror engine. This
is not impossible to do, but I don't have the necessary
ressources (time) to do it, and there are already plenty of
missing features I will have to implement one day.
One solution might be to have a "clever" proxy : when
hitting a "live" link, such as
<http://www.example.com/foo/index.html>, the proxy could
first lookup on a defined directory and check if
the "www.example.com" directory exists. If so, open the
corresponding cache, lookup
the <http://www.example.com/foo/index.html> local name, and
return the file with proper MIME type and headers (the .zip
cache file contains both data or a reference to the local
file, and HTTP headers). If the file can not be found,
proceed as for any other regular external resource.
IMHO, this would require MUCH less work, because existing
projects will still be independent, that is, without the
hassle to handle tens (possibly hundreds) of cross-linked
projects, and without the need to update the whole universe
at every add/update. AND with the possibility to add sites
fro independent sources (for example, two different
libraries mergeing their own projects)
Just a suggestion - I can provide more details on the
possible solution if you wish.
| |