| > If I understand correctly, the 'update hack' update does
> not simply accept the 'Modified' status as the ultimate
> truth, but it compares the sizes of the previously saved
> request and the current request.
> It seems to me that the size comparison is done on the
> sizes as reported by the webserver, not the number of
bytes
> actually downloaded.
Update checks are NEVER done using the 'number of bytes
actually downloaded', because basically the purpose is NOT
to download unnecessary data! The only size compared is the
size sent by the remote server in the headers, IF available.
> The actual file size is 28431 bytes, the file size as
> responded by the web server is -1 (unknown).
Yes, therefore the engine won't be able to know if the file
was updated.. or not
> Thus: a retransfer and processing again of all links
inside.
Exactly
> Effectively 90% of the site has to be downloaded and
> processed again. (Everything except static files)
Yep, nasty server! Use of Etag would solve all problems,
and would be easy to implement. But unfortunately, rare are
the servers who can do that (stupid servers!)
> I know that would mean you would have to download and
> examine the new request anyway, but if the conclusion is
> that the actually downloaded content is equal to the
> previously downloaded content, it would eliminate the
need
> to parse the document again.
Err, parsing the links is FAST, but downloading them is not.
If a page was NOT modified, it does ***NOT*** imply that
links inside were not modified!! It would be too easy:
checking the first top html page, and if not modified,
assume the whole site is updated?
> Would this mean a performance improvement?
If would not work, unfortunately :(
| |