HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: analysis of new.txt files....
Author: Xavier Roche
Date: 04/23/2005 11:31
 
> We have been looking at the files in the hts-cache of
> HTTrack projects and we think we might be able to load the
> information from some of these into a database. Then we
> could use this information to merge different projects 
that
> hold host directories with the same URI name. We are 
looking
> at the new.txt as the file we would use for this.
> 
> Off hand do you see a reason why this would not work?
Well, no. Except maybe some 32-bit limits. HTTrack is 
normally able to handle large caches (ZIP files), but it 
has never been tested thourougly.

> Here are my questions: do you have any documentation about
> the different fields in the new.txt file? In particular, 
the
> Status('servermsg') and the flags. Some of them we've been
> able to figure out. Status -> added ('servermsg') -> 
('ok'),
> but not error('Object%20moved') other than it refers to a
> 302 event.

I admit httrack is a bit badly documented for developpers,
but I wrote a summary for the cache:
<http://www.httrack.com/html/cache.html>

> If we use the size comparison, the flags and the
> status('servermsg')do you think we could come up with a 
set
> of rules to determine which files to normalize, to ignore
> and to delete?
Err, the remote server status is not a reliable way to 
detect files that have changed. Some server always 
respond "I got a new version" because of bogus dynamic 
scripts behind.

The only reliable way is to do content checksum, using md5 
(a very reliable and easy way to detect changes)


 
Reply Create subthread


All articles

Subject Author Date
analysis of new.txt files....

04/16/2005 00:54
Re: analysis of new.txt files....

04/23/2005 11:31
Re: analysis of new.txt files....

04/25/2005 19:11




0

Created with FORUM 2.0.11