Corrupted gzip files - HTTrack Website Copier Forum

Subject: Corrupted gzip files

Author: Duncan Booth

Date: 12/03/2004 14:50

I'm running WinHTtrack on a Plone site and out of 651 
files retrieved 14 are corrupted. These files all have a 
few things in common:

What has been stored as the content of the file is 
actually the gzip data that was sent by the server, except 
that all nulls have been converted into spaces.

For each of these files there is a log entry such as:

12:49:30	Warning: 	File not parsed, looks 
like binary: 
zzzzzzzzzzzzzzzz/ogb/pressreleases/oxpressrelease.2004-11-
15.3985788716

I masked out the first part of the url (its on an intranet 
anyway), but the other thing that may be significant is 
that the last part of the URL is quite long although the 
total length is only about 80 characters. The filename 
that gets used when the file is saved loses everything 
from the last dot onwards and indeed this means that some 
of the affected files get saved to the same file unless I 
specify a user-defined structure.

It looks to me as though for some reason the fact that the 
file is gzipped is being missed. I can see from the source 
that once the 'looks like binary' message is output the 
nulls will be replaced by spaces, so that explains where 
the garbage data comes from, but I can't see in the code 
why it would fail to uncompress a gzipped file.

It always seems to be the same files that go wrong even if 
I delete the mirrored directory and start again.

The entries in hts-ioinfo.txt look the same for these 
files as for other files which have similar names but are 
received and unzipped correctly.

Any suggestions where I can look to try to track down this 
problem?

All articles

Subject	Author	Date
Corrupted gzip files		12/03/2004 14:50
Re: Corrupted gzip files		12/04/2004 17:12