| I have the same problem with WinHTTrack 3.42-2. It runs on a somewhat aged
system, with a single-core CPU (AMD Athlon XP 2000+ at 1.5Ghz) and 2GB RAM. So
excessive CPU speed and parallel multi-core execution shouldn't be the problem
here :-). Actually, the machine is old enough to still run Windows 2000.
The error appears to be somehow random. Every once in a while, an HTML page
will apeear as binary "garbage". The rror log also mentions that HTTrack
expected HTML but received binary instead. If I re-run the same project again,
the broken page may download corectly this time, but other pages may get
broken instead. This makes the bug hard to track down, one has to spider a
large web site to get a change to trigger this problem at some random page. A
1.5GiB web site (about 29.000 HTML pages and 29.000 JPEGs, with a longest
click path of maybe 4 clicks to reach every page from teh start page) may
contain a few broken pages on the first run, but may get evetually fully
downloaded with a few more mirroring attempts. Larger web sites with long
click paths become impossible to download. Each one of the long click paths
breaks pretty early with a binary page. The mirroring may stop after a few
hours, with only a fraction of the web site downloaded.
Anyway, enough of my complaits, the title mentioned a WORKAROUND, didn't it?
:-)
As mentioned above, the problem occurs somewhat randomly, so I don't know if
the workaround really works 100% reliably, but it appears to do so for me
during the last 36 hours.
I guessed the binary data might come from an error in the parsing of received
HTTP header data. No, not the MIME type; this won't turn text into binary! So
what I did was to lauch Proxomitron (I think I use the (german langage)
version 4.51-P-2.0.6, available at
<http://www.buerschgens.de/Prox/Seiten/Download/index.html>; I don't know how
much it differs from the english version at <http://proxomitron.info/> -
probably the included filter sets are different). I configured HTTrack to use
the Proxomitron as a web proxy. Then, in the Proxomitron configuration I
UNchecked everything EXCEPT "filter outgoing headers". Then I clicked the
button the edit the header filters. I UNchecked "Accept encoding: Allow
webpage encoding (out)" and checked the "Out" checkbox on "Accept encoding
Prevent webpage encoding (Out)". Then I hit OK, saved the configuration as the
default configuration (otherwise, it will be lost when Proxomitron is closed)
and startet the mirroring with WinHTTrack. As mentioned above, it appears to
work for me now.
Assumption: if my workaround indeed fixes the problem, HTTrack appears to have
a random problem when receiving encoded (compressed) data. I think GZIP
compression was used by at least one of te sites with which I had problems.
The Proxomitron helps to solve th eproblem by altering the outgoing request to
allow for UNencoded data only.
Best regards, Klaus | |