| Hi,
If I understand correctly, the 'update hack' update does
not simply accept the 'Modified' status as the ultimate
truth, but it compares the sizes of the previously saved
request and the current request.
It seems to me that the size comparison is done on the
sizes as reported by the webserver, not the number of bytes
actually downloaded.
I came to this conclusion because I updated a mirror of a
99% dynamic website and all *.cfm requests were downloaded
again.
A sample line from old.txt:
10:47:58 28431/-1 ---M-- 200 added ('OK')
text/html date:Fri,%2026%20Jul%202002%
2008:51:01%20GMT
nike.ia.nob.nl/Technology/index.cfm?fuseaction=ShowSubmenuEquipment&lge_id=NO
C:/My%20Web%
20Sites/NikeHO2002/nike.ia.nob.nl/Technology/index3fc7.html
(from nike.ia.nob.nl/Technology/index.cfm?lge_id=NO)
The actual file size is 28431 bytes, the file size as
responded by the web server is -1 (unknown).
From new.txt on an updated mirror I search for the same url
and I find identical sizes: 28431/-1
Still, the file is re-transferred and processed for links
again.
Looking at the source in htsback.c, line 1872 and further I
think the totalsize of the new request (-1 in my case) is
compared to the downloaded size of the old request (28431
in my case).
Conclusion: MODIFIED
Thus: a retransfer and processing again of all links inside.
Effectively 90% of the site has to be downloaded and
processed again. (Everything except static files)
Would it be an idea to take as the 'compare size' for the
new request not the size as responded by the web server,
but the actual resulting file size?
I know that would mean you would have to download and
examine the new request anyway, but if the conclusion is
that the actually downloaded content is equal to the
previously downloaded content, it would eliminate the need
to parse the document again.
I think you could derive the list of urls from the
unmodified document from the old cache. (This is what is
probably done for UNMODIFIED documents anyway)
Would this mean a performance improvement?So I think the question is: is the
main part of the
mirroring time taking by
1)
the downloads or
2)
by the parsing and links replacements and local saving?
Following the approach I described above I think you could
eliminate step 2 for unmodified dynamic pages.
Any comments on this?
Remke
| |