HTTrack Website Copier
Free software offline browser - FORUM
Subject: Weird translation of german umlauts etc
Author: Andy
Date: 09/15/2008 19:05
 
Hi,
I have used HTTrack V3.43-RC1(beta) on Win XP to copy a web site to a
directory on my local NTFS drive.

Any characters in the file names whose Unicode character code is > U+0080 that
I am aware of are translated into two characters, when the files are created.
They stay correct though, in the log file new.lst. Some of the internal links
to these files with the two characters work ok, they use the correct
%-escaping for the UTF-8 encodings of these two characters, but some others
don't.

Before we go into the issue with those links that don't work, I'd like to
understand why one character is sometimes translated into two characters.

Here are the translations I observed (on the names of any offline copy files
that are created):

--correct------   --incorrect---------
char      UTF-8   char URL-escaped
--------------------------------------
ä U+00E4  C3 A4   ä   %C3%83%C2%A4
ö U+00F6  C3 B6   ö   %C3%83%C2%B6
ü U+00FC  C3 BC   ü   %C3%83%C2%BC
Ä U+00C4  C3 84   Ä   %C3%83%E2%80%9E

The "correct" columns show the correct character, its Unicode codepoint and
its UTF-8 encoding.  The "incorrect" columns show the two characters that are
created for the one character, and the %-escaping generated for these
characters in some internal links.

It seems the two incorrect characters created are representing the two bytes
of the UTF-8 encoding of the correct character. This can be seen when looking
at the Unicode encodings of the two characters that are created:

char      UTF-8
--------------------
à U+00C3  C3 83
¤ U+00A4  C2 A4
¶ U+00B6  C2 B6
¼ U+00BC  C2 BC
„ U+201E  E2 80 9E

So for (correct) char "ä" (U+00E4), the first byte of its UTF-8 encoding (C3)
is interpreted as U+00C3 and then represented as its character "Ã". The
second byte (A4) is interpreted as U+00A4 and then represented as its
character "¤".

Note that this explanation does not match the observed behavior for the last
row in the first table.

Again, the file names in the log files new.lst and new.txt are correct (they
use the correct one character), the file names in new.zip are also correct but
use the %-escaped UTF-8 encodings.

It is only in the local file system where I see the incorrect two characters,
and in any internal links to them, by means of the %-escaped URLs.

Again, my local file system is NTFS and I can successfully have the correct
characters in the file names if I edit the file names manually.

Any explanation as to what happens, would be helpful.

If you need more information, let me know.

Andy
 
Reply


All articles

Subject Author Date
Weird translation of german umlauts etc

09/15/2008 19:05
Re: Weird translation of german umlauts etc

09/15/2008 20:41
Re: Weird translation of german umlauts etc

09/16/2008 00:22
Re: Weird translation of german umlauts etc

02/16/2010 15:20
Re: Weird translation of german umlauts etc

07/15/2010 20:37
Re: Weird translation of german umlauts etc

02/17/2011 21:07
Re: Weird translation of german umlauts etc

12/20/2011 08:45
Re: Weird translation of german umlauts etc

05/06/2012 19:22




7

Created with FORUM 2.0.11