| > I'm curious as to what logic HTTrack uses to replace a
file
> when it does an update. Does it do some sort of CRC/hash
on
> a file, does it just check the date/timestamp and/or
> filesize, etc.?
See the previous post (10/08/2002), "filename":
(..) the major update process is handled by the remote
server, through two important processes:
- during the first download, the server has to send a
reliable way to tag the file/url ; such as a timestamp
(current date+time) or, even better, a strong etag
identifier (which can be an md5 hash of the content ; which
is the "ultimate weapon" for handling updates). This
information allow to identify the "freshness" of the data
being sent.
- during the update, httrack requests the previously
downloaded file, giving to the server the "hint" previously
sent (timestamp, and/or etag). It is the duty of the server
to either respond with a "OK, file not modified" message
(304), or using a "OOPS, you have to redownload this file"
message (200)
With this system, the caching process is totally
transparent, and very reliable. That's the theory. Now
let's go back to the real world..
Some servers, unfortunately, are really dumb ; and just
ignore the timestamp/etag ; or do not give any reliable
information the first time. Because of that, (offline)
browsers like httrack are forced to re-dowload twice data
that is identically to the previous version.. clever
servers, sometimes, are also unable to "handle cleverly"
stupid scripts that just don't care about bandwidth waste
and caching problems.
Because of that, many websites (especially those
with "dynamic" pages) are not "cache compliant", and
browsers will always re-download their data.
But this is not something a browser can change - only
servers could, if only webmasters were concerned about
caching problems.
(for information, there are ALWAYS methods that allow to
cache pages, even dynamic ones, and even those using
cookies and other session-related data)
> Also, when updating a file, am I correct in assuming that
> HTTrack will overwrite the entire file lying on disk
> replacing it with the new data it downloaded?
Yes - if the file was modified (or "seen" as modified).
Note that html data is always rewritten (even data fetched
from the cache), to match potentially changing options
(such as link rewriting, filters..)
> Any explanation of these points would be appreciated;
> thanks, and thanks for a great program.
Thanks! :)
| |