Re: Word database (RC13) - HTTrack Website Copier Forum

Subject: Re: Word database (RC13)
Author: Adem
Date: 08/29/2003 02:48
[Warning: Long post]

> 1) Having looked at index.txt, and I see that it 
> > is not Unicode. Infact all the characters are 
> > ISO-8859-1. 
> > Is this a bug, or is it a known design feature?> 
> This is a limit. The word database is really basic ; and 
> the htsindex.c contains these definitions:
> 
> #define 
> KEYW_ACCEPT          "abcdefghijklmnopqrstuvwxyz0123456789-
> _."
> // Convert A to a, and so on.. to avoid case problems in 
> indexing
> // This can be a generic table, containing characters that 
> are in fact not accepted by KEYW_ACCEPT
> #define KEYW_TRANSCODE_FROM  (\
[snip]
> 
> #define KEYW_TRANSCODE_TO    ( \
[snip]
> 
> KEYW_ACCEPT should be set to all valid characters for a 
> keyword (that is, adding characters 128-255) and 
> KEYW_TRANSCODE_FROM/KEYW_TRANSCODE_TO be set to ""

Right... I can see the intention here is towards
a search engine where case issues and non-ASCI chars 
are eliminated from the equation.

But, (quoting from WinHTTrack Website Copier Help)
this is how it is described there:

     Make a word database 

     Generate an index.txt database on the top of 
     the directory. 
     
     Very useful for linguistic analysis, this feature 
     will allow you to list all words of all mirrored 
     pages in the current project.

     With this index file, you will be able to list 
     which words were detected, and where. 

Now, while the technique used for index.txt is probably
adequate for local search engine (I will comment how
it might not be so later below), it is far from usable
for linguistic analysis. 

Reason is, not only are you converting the cases but 
you are also replacing a number of characters with others.

OK, I could override them by altering the source code
but it has 2 drawbacks.

i) It assumes I can actually compile from code --a major
undertaking at times.

ii) Altering these affects index.txt such that it might
not be suitable for a search engine.

Might I suggest another, probably simpler approach:

Could you alter the 'Make a word database' option so
that it becomes two options and not none as it is, such
that:

  -- Create index.txt for local search engine
  -- Create a words.txt (Unicode) for linguistic analysis

The second option should simply produce a file that contains
every word HTTrack finds in the HTML files (possibly with 
another column to show the file name(s) it was found). 
This would be ideal for linguistic analysis.

Furthermore, having words.txt also makes it possible to
search for 'exact' Unicode text.

Now, why the algo that produces the current index.txt
is sort of flawed:

First, it seems to consider every char that is not 
in KEYW_ACCEPT a delimiter. IMHO, this is not the right
way of doing that --because you end up getting a lot of
words split up in the wrong and meaningless places.

It should only assume chars that are in the range (single 
byte, hex) 00..20 (inclusive) to be natural delimiters. 
Maybe you could let the user define a number of others 
delimiters.

Also, my gut feeling is that it seems character substitution 
(transcoding) and case conversion are best handled after 
the creation of words.txt. If it finds chars that are not
in KEYW_ACCEPT than it should tell the user about them and
ask him to edit them in the index.txt (using a simple text
editor will do. After all the user will only be expected 
to search and replace a few chars with their acceptable
counterparts).

Finally, if you do not agree with any of these, could you
at least alter the text in the help file so that poor souls
like myself do not get their hopes raised.
Create subthread
All articles
Subject	Author	Date
Word database (RC13)		08/28/2003 08:15
Re: Word database (RC13)		08/28/2003 08:38
Re: Word database (RC13)		08/28/2003 20:29
Re: Word database (RC13)		08/29/2003 02:48
Re: Word database (RC13)		08/30/2003 08:36