| > It not only restores the deleted file fully, but also
> reloads other files unnecessarily.
> Error log shows %n option, and all 6 files have been
> written!
Yes - the first problem is that the server is not fully RFC
compliant:
$ telnet www.verbis.net 80
GET /httest/cats/cats1.html HTTP/1.0
Host: www.verbis.net
If-None-Match: "0-2e0-3d286503"
HTTP/1.1 200 OK
Date: Sun, 07 Jul 2002 19:24:16 GMT
ETag: "0-2e0-3d286503"
This means that I requested /httest/cats/cats1.html and
indicated that I had the ressource identified by the
Etag "0-2e0-3d286503". The server responded with a 200 ("I
have new data") code, AND the Etag I just gave, instead of
responding with a 304 ("Okay, the data is up-to-date") code!
The server is not configured to accept update requests, and
therefore updates will ALWAYS result in a full-redownload,
unfortunately. Something wicked must exist in the
configuration files.
> Conclusion: Cache was somehow ignored,too..?
The cache problem is clearly a server-side bug.
The second problem, cats1.html downloaded even if erased,
is due to the difference between handling data and html
files - the "do not redownload locally erased files" only
apply to non-html data, because the engine has to scan html
data anyway to detect links inside them.
Besides, the engine cache (hts-cache/ folder) provides
the "missing" data in the background, and therefore erasing
html files will not bother the crawler. In fact the engine
only rely on the cache for hypertext data, because html
files stored on your disk are no longer useful (links
rebuilt locally, names changed or mangled -> not useable
for rescan)
To exclude html files, use filters to avoid any redownload.
In this case:
-*/httest/cats/cats1.html
This is necessary, again, as the engine even does not "see"
that the html file was erased - binary (images, archives..)
data are checked on the local filesystem, but html data are
always checked in the cache.
| |