HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Index and word database options
Author: Xavier Roche
Date: 01/08/2005 17:02
 
> I'm interested in building and advanced word database of a
> mirrored site. I checked the option "Make a word database"
> but it only deals with single words.
> Is it possible to list expression of 2, 3 or even 4 words 
in
> this database?
No - the word database feature is a very basic feature, and 
neither non-ascii characters not multiwords can be handled.

The best way would be to plug a .so to process all 
downloaded pages - see examples such as libtest/callbacks-
example-baselinks.c and the process_file function:

EXTERNAL_FUNCTION int process_file(char* html, int len, 
char* url_adresse, char* url_f
ichier) {
..
}

> My second question concerns the same thing but in another
> aspect. Suppose that I know some expressions that are
> keywords for the analysed site.
> Is it possible to list the URL of the pages that best 
match
> with these expressions? By "match" I mean that I want to
> list the pages that contain these expressions, ordered by
> descending term frequency (occurrences).

Same advice: use pluggable callbacks

 
Reply Create subthread


All articles

Subject Author Date
Index and word database options

01/08/2005 16:24
Re: Index and word database options

01/08/2005 17:02




6

Created with FORUM 2.0.11