| > 1) Having looked at index.txt, and I see that it
> is not Unicode. Infact all the characters are
> ISO-8859-1.
> Is this a bug, or is it a known design feature?
This is a limit. The word database is really basic ; and
the htsindex.c contains these definitions:
#define
KEYW_ACCEPT "abcdefghijklmnopqrstuvwxyz0123456789-
_."
// Convert A to a, and so on.. to avoid case problems in
indexing
// This can be a generic table, containing characters that
are in fact not accepted by KEYW_ACCEPT
// MUST HAVE SAME SIZES!!
#define KEYW_TRANSCODE_FROM (\
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
\
"àâä" \
"ÀÂÄ" \
"éèêë" \
"ÈÈÊË" \
"ìîï" \
"ÌÎÏ" \
"òôö" \
"ÒÔÖ" \
"ùûü" \
"ÙÛÜ" \
"ÿ" \
)
#define KEYW_TRANSCODE_TO ( \
"abcdefghijklmnopqrstuvwxyz"
\
"aaa" \
"aaa" \
"eeee" \
"eeee" \
"iii" \
"iii" \
"ooo" \
"ooo" \
"uuu" \
"uuu" \
"y" \
KEYW_ACCEPT should be set to all valid characters for a
keyword (that is, adding characters 128-255) and
KEYW_TRANSCODE_FROM/KEYW_TRANSCODE_TO be set to ""
> 2) Is there a command-line setting to apply
> 'Word database' to other previously mirrored
> sites?
No. But you can activate the option and "continue an
interrupted mirror", operation which should be fast.
> Some of which was mirrored with HTTrack and
> some with others.(before HTTrack)?
For those mirrored without httrack: no.
| |