| > I'm worried that these dynamic pages would be
> redownloaded the next time I update the mirror even
though
> there's no changes in the page itself.
The way httrack saves/renames *locally* these pages does
not change the way httrack does updates, and does not
influence the whole update process. The original remote
hostname, filename AND query strings are stored in the hts-
cache/ file data ; and httrack only use these information
to perform the update process.
But in fact, the major update process is handled by the
remote server, through two important processes:
- during the first download, the server has to send a
reliable way to tag the file/url ; such as a timestamp
(current date+time) or, even better, a strong etag
identifier (which can be an md5 hash of the content ; which
is the "ultimate weapon" for handling updates). This
information allow to identify the "freshness" of the data
being sent.
- during the update, httrack requests the previously
downloaded file, giving to the server the "hint" previously
sent (timestamp, and/or etag). It is the duty of the server
to either respond with a "OK, file not modified" message
(304), or using a "OOPS, you have to redownload this file"
message (200)
With this system, the caching process is totally
transparent, and very reliable. That's the theory. Now
let's go back to the real world..
Some servers, unfortunately, are really dumb ; and just
ignore the timestamp/etag ; or do not give any reliable
information the first time. Because of that, (offline)
browsers like httrack are forced to re-dowload twice data
that is identically to the previous version.. clever
servers, sometimes, are also unable to "handle cleverly"
stupid scripts that just don't care about bandwidth waste
and caching problems.
Because of that, many websites (especially those
with "dynamic" pages) are not "cache compliant", and
browsers will always re-download their data.
But this is not something a browser can change - only
servers could, if only webmasters were concerned about
caching problems.
(for information, there are ALWAYS methods that allow to
cache pages, even dynamic ones, and even those using
cookies and other session-related data)
| |