Re: Replace/update file logic - HTTrack Website Copier Forum

Subject: Re: Replace/update file logic

Author: Xavier Roche

Date: 08/17/2002 14:59

> I'm curious as to what logic HTTrack uses to replace a 
file
> when it does an update.  Does it do some sort of CRC/hash 
on
> a file, does it just check the date/timestamp and/or
> filesize, etc.?
See the previous post (10/08/2002), "filename":

(..) the major update process is handled by the remote 
server, through two important processes:

- during the first download, the server has to send a 
reliable way to tag the file/url ; such as a timestamp 
(current date+time) or, even better, a strong etag 
identifier (which can be an md5 hash of the content ; which 
is the "ultimate weapon" for handling updates). This 
information allow to identify the "freshness" of the data 
being sent.

- during the update, httrack requests the previously 
downloaded file, giving to the server the "hint" previously 
sent (timestamp, and/or etag). It is the duty of the server 
to either respond with a "OK, file not modified" message 
(304), or using a "OOPS, you have to redownload this file" 
message (200)

With this system, the caching process is totally 
transparent, and very reliable. That's the theory. Now 
let's go back to the real world..

Some servers, unfortunately, are really dumb ; and just 
ignore the timestamp/etag ; or do not give any reliable 
information the first time. Because of that, (offline) 
browsers like httrack are forced to re-dowload twice data 
that is identically to the previous version.. clever 
servers, sometimes, are also unable to "handle cleverly" 
stupid scripts that just don't care about bandwidth waste 
and caching problems. 

Because of that, many websites (especially those 
with "dynamic" pages) are not "cache compliant", and 
browsers will always re-download their data.

But this is not something a browser can change - only 
servers could, if only webmasters were concerned about 
caching problems.

(for information, there are ALWAYS methods that allow to 
cache pages, even dynamic ones, and even those using 
cookies and other session-related data)

> Also, when updating a file, am I correct in assuming that
> HTTrack will overwrite the entire file lying on disk
> replacing it with the new data it downloaded?
Yes - if the file was modified (or "seen" as modified).
Note that html data is always rewritten (even data fetched 
from the cache), to match potentially changing options 
(such as link rewriting, filters..)

> Any explanation of these points would be appreciated;
> thanks, and thanks for a great program.

Thanks! :)

Create subthread

All articles

Subject	Author	Date
Replace/update file logic		08/17/2002 14:01
Re: Replace/update file logic		08/17/2002 14:59