| > We have been looking at the files in the hts-cache of
> HTTrack projects and we think we might be able to load the
> information from some of these into a database. Then we
> could use this information to merge different projects
that
> hold host directories with the same URI name. We are
looking
> at the new.txt as the file we would use for this.
>
> Off hand do you see a reason why this would not work?
Well, no. Except maybe some 32-bit limits. HTTrack is
normally able to handle large caches (ZIP files), but it
has never been tested thourougly.
> Here are my questions: do you have any documentation about
> the different fields in the new.txt file? In particular,
the
> Status('servermsg') and the flags. Some of them we've been
> able to figure out. Status -> added ('servermsg') ->
('ok'),
> but not error('Object%20moved') other than it refers to a
> 302 event.
I admit httrack is a bit badly documented for developpers,
but I wrote a summary for the cache:
<http://www.httrack.com/html/cache.html>
> If we use the size comparison, the flags and the
> status('servermsg')do you think we could come up with a
set
> of rules to determine which files to normalize, to ignore
> and to delete?
Err, the remote server status is not a reliable way to
detect files that have changed. Some server always
respond "I got a new version" because of bogus dynamic
scripts behind.
The only reliable way is to do content checksum, using md5
(a very reliable and easy way to detect changes)
| |