HTTrack Website Copier
Free software offline browser - FORUM
Back to the forum

Subject: Re: Weird translation of german umlauts etc
Author: Xavier Roche
Date: 09/15/2008 20:41
 
> Here are the translations I observed (on the names
> of any offline copy files that are created):
> --correct------   --incorrect---------
> char      UTF-8   char URL-escaped
> --------------------------------------
> ä U+00E4  C3 A4   ä   %C3%83%C2%A4
> ö U+00F6  C3 B6   ö   %C3%83%C2%B6
> ü U+00FC  C3 BC   ü   %C3%83%C2%BC
> Ä U+00C4  C3 84   Ä   %C3%83%E2%80%9E

Humm, Unicode filenames is something that I will have to work on after the
3.43 release -- but this will require some coding.

To be clear: unicode handling is just missing. Any attempt to save files using
accents will just break randomly, depending on the system escaping (html
entities, url-encoded chars, or direct chars)

The current handling is definitely "to be fixed".

The big plan:
-------------

1. Charset within html pages has to be detected, first by checking the
"charset=" attribute of the HTTP headers, which is prioritary, then by looking
up meta-data within the html file, and if nothing was found, attempt to
autodetect (?) it. This charset will be used to initialize the
platform-dependent decoder (windows's WideCharToMultiByte() functions, or Un*x
iconv() functions).

2. URL conversion is to be done in utf-8 ; by converting regular non-7 bit
characters to utf-8 encoded unicode equivalents according to the charset ;
escaped characters have to be treated as regular charset characters (not
direct unicode chars), and html entities (é for example) have to be
converted directly to unicode.

3. All file I/O (fopen, unlink, mkdir..) have to be replaced by utf-8 aware
functions, that will dispatch to unicode versions on WIN32 (wfopen, _wunlink,
_wmkdir ..), and to regular version on linux (ie. we assume utf-8 encoding on
the filesystem ; this will work even with iso-8859-1 or "C" locale, as the
filenames themselves will be opaque 8-bit filenames)

4. Do some testing, with many cases (multiple charsets, including misdeclared
ones, ..)


 
Reply Create subthread


All articles

Subject Author Date
Weird translation of german umlauts etc

Andy

09/15/2008 19:05
Re: Weird translation of german umlauts etc

Xavier Roche

09/15/2008 20:41
Re: Weird translation of german umlauts etc

Andy

09/16/2008 00:22
Re: Weird translation of german umlauts etc

amine

02/16/2010 15:20
Re: Weird translation of german umlauts etc

Jason

07/15/2010 20:37
Re: Weird translation of german umlauts etc

Michael Eidam

02/17/2011 21:07
Re: Weird translation of german umlauts etc

Chris

12/20/2011 08:45
Re: Weird translation of german umlauts etc

Xavier Roche

05/06/2012 19:22




2

Created with FORUM 2.0.11