Re: I can reproduce it - HTTrack Website Copier Forum

Subject: Re: I can reproduce it

Author: Xavier Roche

Date: 07/07/2002 21:43

> It not only restores the deleted file fully, but also 
> reloads other files unnecessarily.
> Error log shows %n option, and all 6 files have been 
> written!

Yes - the first problem is that the server is not fully RFC 
compliant:
$ telnet www.verbis.net 80
GET /httest/cats/cats1.html HTTP/1.0
Host: www.verbis.net
If-None-Match: "0-2e0-3d286503"

HTTP/1.1 200 OK
Date: Sun, 07 Jul 2002 19:24:16 GMT
ETag: "0-2e0-3d286503"

This means that I requested /httest/cats/cats1.html and 
indicated that I had the ressource identified by the 
Etag "0-2e0-3d286503". The server responded with a 200 ("I 
have new data") code, AND the Etag I just gave, instead of 
responding with a 304 ("Okay, the data is up-to-date") code!
 
The server is not configured to accept update requests, and 
therefore updates will ALWAYS result in a full-redownload, 
unfortunately. Something wicked must exist in the 
configuration files.

> Conclusion: Cache was somehow ignored,too..?
The cache problem is clearly a server-side bug.

The second problem, cats1.html downloaded even if erased, 
is due to the difference between handling data and html 
files - the "do not redownload locally erased files" only 
apply to non-html data, because the engine has to scan html 
data anyway to detect links inside them. 

Besides, the engine cache (hts-cache/ folder) provides 
the "missing" data in the background, and therefore erasing 
html files will not bother the crawler. In fact the engine 
only rely on the cache for hypertext data, because html 
files stored on your disk are no longer useful (links 
rebuilt locally, names changed or mangled -> not useable 
for rescan)

To exclude html files, use filters to avoid any redownload. 
In this case:
-*/httest/cats/cats1.html

This is necessary, again, as the engine even does not "see" 
that the html file was erased - binary (images, archives..) 
data are checked on the local filesystem, but html data are 
always checked in the cache.

Create subthread

All articles

Subject	Author	Date
"not re-download locally erased files" doesnt work		07/07/2002 14:10
Re:		07/07/2002 17:14
I can reproduce it		07/07/2002 18:38
Re: I can reproduce it		07/07/2002 21:43
Apache was not well prepared for ETag		07/07/2002 22:22