| Hi,
I have used HTTrack V3.43-RC1(beta) on Win XP to copy a web site to a
directory on my local NTFS drive.
Any characters in the file names whose Unicode character code is > U+0080 that
I am aware of are translated into two characters, when the files are created.
They stay correct though, in the log file new.lst. Some of the internal links
to these files with the two characters work ok, they use the correct
%-escaping for the UTF-8 encodings of these two characters, but some others
don't.
Before we go into the issue with those links that don't work, I'd like to
understand why one character is sometimes translated into two characters.
Here are the translations I observed (on the names of any offline copy files
that are created):
--correct------ --incorrect---------
char UTF-8 char URL-escaped
--------------------------------------
ä U+00E4 C3 A4 ä %C3%83%C2%A4
ö U+00F6 C3 B6 ö %C3%83%C2%B6
ü U+00FC C3 BC ü %C3%83%C2%BC
Ä U+00C4 C3 84 Ã %C3%83%E2%80%9E
The "correct" columns show the correct character, its Unicode codepoint and
its UTF-8 encoding. The "incorrect" columns show the two characters that are
created for the one character, and the %-escaping generated for these
characters in some internal links.
It seems the two incorrect characters created are representing the two bytes
of the UTF-8 encoding of the correct character. This can be seen when looking
at the Unicode encodings of the two characters that are created:
char UTF-8
--------------------
à U+00C3 C3 83
¤ U+00A4 C2 A4
¶ U+00B6 C2 B6
¼ U+00BC C2 BC
U+201E E2 80 9E
So for (correct) char "ä" (U+00E4), the first byte of its UTF-8 encoding (C3)
is interpreted as U+00C3 and then represented as its character "Ã". The
second byte (A4) is interpreted as U+00A4 and then represented as its
character "¤".
Note that this explanation does not match the observed behavior for the last
row in the first table.
Again, the file names in the log files new.lst and new.txt are correct (they
use the correct one character), the file names in new.zip are also correct but
use the %-escaped UTF-8 encodings.
It is only in the local file system where I see the incorrect two characters,
and in any internal links to them, by means of the %-escaped URLs.
Again, my local file system is NTFS and I can successfully have the correct
characters in the file names if I edit the file names manually.
Any explanation as to what happens, would be helpful.
If you need more information, let me know.
Andy
| |