New feature in test: indexing/linguistic analysis

Subject: New feature in test: indexing/linguistic analysis

Author: Xavier Roche

Date: 10/19/2001 14:19

A new feature, introduced on 3.00 release (but never 
really tested), allows to crawl several websites, and 
index them after the mirror.

The result is a text file, listing all 'relevant' 
words, the number of hits, the %1000 value, and their 
position. Example: (after crawling www.httrack.com)

..
ability
	1 www.httrack.com/HelpHtml/dev.html
	=1
	(0)
able
	2 www.httrack.com/HelpHtml/fcguide.html
	1 www.httrack.com/HelpHtml/abuse.html
	1 www.httrack.com/HelpHtml/dev.html
	1 www.httrack.com/HelpHtml/step9_opt9.html
	=5
	(0)
about
	7 www.httrack.com/HelpHtml/fcguide.html
	3 www.httrack.com/HelpHtml/faq.html
	3 www.httrack.com/HelpHtml/index-2.html
	3 www.httrack.com/HelpHtml/index.html
	1 www.httrack.com/HelpHtml/abuse.html
	1 www.httrack.com/HelpHtml/contact.html
	1 www.httrack.com/HelpHtml/filters.html
	1 www.httrack.com/HelpHtml/step9_opt9.html
	=20
	(0)
above
	6 www.httrack.com/HelpHtml/fcguide.html
	1 www.httrack.com/HelpHtml/faq.html
	=7
	(0)
abridged
	1 www.httrack.com/HelpHtml/fcguide.html
	=1
	(0)
absence
	1 www.httrack.com/HelpHtml/fcguide.html
	=1
	(0)
..

To activate the feature: first setup a 'regular' 
mirror, and limit the scope to html files (using 
either filters, or advanced settings). 
Then, add the option (for Windows version: add the 
option after URLs)
-%I

This will generate the index.txt file, for various 
purposes (doctionnaries, indexing/hash table indexing, 
linguistic analysis...)
The index routines are located in is htsindex.c and 
htsindex.h, and can be easily customized.

Feel free to any feedback, remarks or bug report for 
this feature (or any other one :) )!

All articles

Subject	Author	Date
New feature in test: indexing/linguistic analysis		10/19/2001 14:19
Index.txt		02/05/2004 13:15