Re: Word database (RC13) - HTTrack Website Copier Forum

Subject: Re: Word database (RC13)

Author: Adem

Date: 08/28/2003 08:38

Ooops, I should have guessed: index.txt is only 
created *after* the mirroring is *finished* --it 
does not happen in the paused state...

And, I have 3 questions now :-)

1) Having looked at index.txt, and I see that it 
is not Unicode. Infact all the characters are 
ISO-8859-1. 

ISO-8859-1 might be useful for a search engine 
(I am not so sure though), but it definitely 
can not be used for linguistic analysis. 

Is this a bug, or is it a known design feature?
2) Is there a command-line setting to apply 
'Word database' to other previously mirrored 
sites? 

Some of which was mirrored with HTTrack and 
some with others.(before HTTrack)?
3) This seems to be definitely a bug: 'Word database'
option only works with html files that contain only
chars in the ASCI charset. It does not seem to work
with ISO-8859-X (where X is greater than 1). When X
is greater than one HTTrack splits that word whenever
it sees non-ASCII char.

Cheers,
Adem

Create subthread

All articles

Subject	Author	Date
Word database (RC13)		08/28/2003 08:15
Re: Word database (RC13)		08/28/2003 08:38
Re: Word database (RC13)		08/28/2003 20:29
Re: Word database (RC13)		08/29/2003 02:48
Re: Word database (RC13)		08/30/2003 08:36