| > Here are the translations I observed (on the names
> of any offline copy files that are created):
> --correct------ --incorrect---------
> char UTF-8 char URL-escaped
> --------------------------------------
> ä U+00E4 C3 A4 ä %C3%83%C2%A4
> ö U+00F6 C3 B6 ö %C3%83%C2%B6
> ü U+00FC C3 BC ü %C3%83%C2%BC
> Ä U+00C4 C3 84 Ã %C3%83%E2%80%9E
Humm, Unicode filenames is something that I will have to work on after the
3.43 release -- but this will require some coding.
To be clear: unicode handling is just missing. Any attempt to save files using
accents will just break randomly, depending on the system escaping (html
entities, url-encoded chars, or direct chars)
The current handling is definitely "to be fixed".
The big plan:
-------------
1. Charset within html pages has to be detected, first by checking the
"charset=" attribute of the HTTP headers, which is prioritary, then by looking
up meta-data within the html file, and if nothing was found, attempt to
autodetect (?) it. This charset will be used to initialize the
platform-dependent decoder (windows's WideCharToMultiByte() functions, or Un*x
iconv() functions).
2. URL conversion is to be done in utf-8 ; by converting regular non-7 bit
characters to utf-8 encoded unicode equivalents according to the charset ;
escaped characters have to be treated as regular charset characters (not
direct unicode chars), and html entities (é for example) have to be
converted directly to unicode.
3. All file I/O (fopen, unlink, mkdir..) have to be replaced by utf-8 aware
functions, that will dispatch to unicode versions on WIN32 (wfopen, _wunlink,
_wmkdir ..), and to regular version on linux (ie. we assume utf-8 encoding on
the filesystem ; this will work even with iso-8859-1 or "C" locale, as the
filenames themselves will be opaque 8-bit filenames)
4. Do some testing, with many cases (multiple charsets, including misdeclared
ones, ..)
| |