| I'm running WinHTtrack on a Plone site and out of 651
files retrieved 14 are corrupted. These files all have a
few things in common:
What has been stored as the content of the file is
actually the gzip data that was sent by the server, except
that all nulls have been converted into spaces.
For each of these files there is a log entry such as:
12:49:30 Warning: File not parsed, looks
like binary:
zzzzzzzzzzzzzzzz/ogb/pressreleases/oxpressrelease.2004-11-
15.3985788716
I masked out the first part of the url (its on an intranet
anyway), but the other thing that may be significant is
that the last part of the URL is quite long although the
total length is only about 80 characters. The filename
that gets used when the file is saved loses everything
from the last dot onwards and indeed this means that some
of the affected files get saved to the same file unless I
specify a user-defined structure.
It looks to me as though for some reason the fact that the
file is gzipped is being missed. I can see from the source
that once the 'looks like binary' message is output the
nulls will be replaced by spaces, so that explains where
the garbage data comes from, but I can't see in the code
why it would fail to uncompress a gzipped file.
It always seems to be the same files that go wrong even if
I delete the mirrored directory and start again.
The entries in hts-ioinfo.txt look the same for these
files as for other files which have similar names but are
received and unzipped correctly.
Any suggestions where I can look to try to track down this
problem? | |