Re: Weird translation of german umlauts etc - HTTrack Website Copier Forum

Subject: Re: Weird translation of german umlauts etc

Author: Xavier Roche

Date: 09/15/2008 20:41

> Here are the translations I observed (on the names
> of any offline copy files that are created):
> --correct------   --incorrect---------
> char      UTF-8   char URL-escaped
> --------------------------------------
> ä U+00E4  C3 A4   Ã¤   %C3%83%C2%A4
> ö U+00F6  C3 B6   Ã¶   %C3%83%C2%B6
> ü U+00FC  C3 BC   Ã¼   %C3%83%C2%BC
> Ä U+00C4  C3 84   Ã   %C3%83%E2%80%9E

Humm, Unicode filenames is something that I will have to work on after the
3.43 release -- but this will require some coding.

To be clear: unicode handling is just missing. Any attempt to save files using
accents will just break randomly, depending on the system escaping (html
entities, url-encoded chars, or direct chars)

The current handling is definitely "to be fixed".

The big plan:
-------------

1. Charset within html pages has to be detected, first by checking the
"charset=" attribute of the HTTP headers, which is prioritary, then by looking
up meta-data within the html file, and if nothing was found, attempt to
autodetect (?) it. This charset will be used to initialize the
platform-dependent decoder (windows's WideCharToMultiByte() functions, or Un*x
iconv() functions).

2. URL conversion is to be done in utf-8 ; by converting regular non-7 bit
characters to utf-8 encoded unicode equivalents according to the charset ;
escaped characters have to be treated as regular charset characters (not
direct unicode chars), and html entities (&eacute; for example) have to be
converted directly to unicode.

3. All file I/O (fopen, unlink, mkdir..) have to be replaced by utf-8 aware
functions, that will dispatch to unicode versions on WIN32 (wfopen, _wunlink,
_wmkdir ..), and to regular version on linux (ie. we assume utf-8 encoding on
the filesystem ; this will work even with iso-8859-1 or "C" locale, as the
filenames themselves will be opaque 8-bit filenames)

4. Do some testing, with many cases (multiple charsets, including misdeclared
ones, ..)

Create subthread

All articles

Subject	Author	Date
Weird translation of german umlauts etc		09/15/2008 19:05
Re: Weird translation of german umlauts etc		09/15/2008 20:41
Re: Weird translation of german umlauts etc		09/16/2008 00:22
Re: Weird translation of german umlauts etc		02/16/2010 15:20
Re: Weird translation of german umlauts etc		07/15/2010 20:37
Re: Weird translation of german umlauts etc		02/17/2011 21:07
Re: Weird translation of german umlauts etc		12/20/2011 08:45
Re: Weird translation of german umlauts etc		05/06/2012 19:22