| > I'm interested in building and advanced word database of a
> mirrored site. I checked the option "Make a word database"
> but it only deals with single words.
> Is it possible to list expression of 2, 3 or even 4 words
in
> this database?
No - the word database feature is a very basic feature, and
neither non-ascii characters not multiwords can be handled.
The best way would be to plug a .so to process all
downloaded pages - see examples such as libtest/callbacks-
example-baselinks.c and the process_file function:
EXTERNAL_FUNCTION int process_file(char* html, int len,
char* url_adresse, char* url_f
ichier) {
..
}
> My second question concerns the same thing but in another
> aspect. Suppose that I know some expressions that are
> keywords for the analysed site.
> Is it possible to list the URL of the pages that best
match
> with these expressions? By "match" I mean that I want to
> list the pages that contain these expressions, ordered by
> descending term frequency (occurrences).
Same advice: use pluggable callbacks
| |