HTTrack Website Copier
Free software offline browser - FORUM
Subject: update hack and file size comparing
Author: Remke Rutgers
Date: 07/26/2002 13:35
 
Hi,

If I understand correctly, the 'update hack' update does 
not simply accept the 'Modified' status as the ultimate 
truth, but it compares the sizes of the previously saved 
request and the current request.
It seems to me that the size comparison is done on the 
sizes as reported by the webserver, not the number of bytes 
actually downloaded.
I came to this conclusion because I updated a mirror of a 
99% dynamic website and all *.cfm requests were downloaded 
again. 
A sample line from old.txt:
10:47:58	28431/-1	---M--	200	added ('OK')
	text/html	date:Fri,%2026%20Jul%202002%
2008:51:01%20GMT
nike.ia.nob.nl/Technology/index.cfm?fuseaction=ShowSubmenuEquipment&lge_id=NO
C:/My%20Web%
20Sites/NikeHO2002/nike.ia.nob.nl/Technology/index3fc7.html
	(from nike.ia.nob.nl/Technology/index.cfm?lge_id=NO)

The actual file size is 28431 bytes, the file size as 
responded by the web server is -1 (unknown).

From new.txt on an updated mirror I search for the same url 
and I find identical sizes: 28431/-1
Still, the file is re-transferred and processed for links 
again.

Looking at the source in htsback.c, line 1872 and further I 
think the totalsize of the new request (-1 in my case) is 
compared to the downloaded size of the old request (28431 
in my case).
Conclusion: MODIFIED
Thus: a retransfer and processing again of all links inside.

Effectively 90% of the site has to be downloaded and 
processed again. (Everything except static files)

Would it be an idea to take as the 'compare size' for the 
new request not the size as responded by the web server, 
but the actual resulting file size? 
I know that would mean you would have to download and 
examine the new request anyway, but if the conclusion is 
that the actually downloaded content is equal to the 
previously downloaded content, it would eliminate the need 
to parse the document again.
I think you could derive the list of urls from the 
unmodified document from the old cache. (This is what is 
probably done for UNMODIFIED documents anyway)

Would this mean a performance improvement?So I think the question is: is the
main part of the 
mirroring time taking by 
1)
the downloads or 
2)
by the parsing and links replacements and local saving?
Following the approach I described above I think you could 
eliminate step 2 for unmodified dynamic pages.

Any comments on this?
Remke

 
Reply


All articles

Subject Author Date
update hack and file size comparing

07/26/2002 13:35
Re: update hack and file size comparing

07/26/2002 20:52




6

Created with FORUM 2.0.11