HTTrack Website Copier
Free software offline browser - FORUM
Subject: New feature in test: indexing/linguistic analysis
Author: Xavier Roche
Date: 10/19/2001 14:19
 
A new feature, introduced on 3.00 release (but never 
really tested), allows to crawl several websites, and 
index them after the mirror.

The result is a text file, listing all 'relevant' 
words, the number of hits, the %1000 value, and their 
position. Example: (after crawling www.httrack.com)

..
ability
	1 www.httrack.com/HelpHtml/dev.html
	=1
	(0)
able
	2 www.httrack.com/HelpHtml/fcguide.html
	1 www.httrack.com/HelpHtml/abuse.html
	1 www.httrack.com/HelpHtml/dev.html
	1 www.httrack.com/HelpHtml/step9_opt9.html
	=5
	(0)
about
	7 www.httrack.com/HelpHtml/fcguide.html
	3 www.httrack.com/HelpHtml/faq.html
	3 www.httrack.com/HelpHtml/index-2.html
	3 www.httrack.com/HelpHtml/index.html
	1 www.httrack.com/HelpHtml/abuse.html
	1 www.httrack.com/HelpHtml/contact.html
	1 www.httrack.com/HelpHtml/filters.html
	1 www.httrack.com/HelpHtml/step9_opt9.html
	=20
	(0)
above
	6 www.httrack.com/HelpHtml/fcguide.html
	1 www.httrack.com/HelpHtml/faq.html
	=7
	(0)
abridged
	1 www.httrack.com/HelpHtml/fcguide.html
	=1
	(0)
absence
	1 www.httrack.com/HelpHtml/fcguide.html
	=1
	(0)
..

To activate the feature: first setup a 'regular' 
mirror, and limit the scope to html files (using 
either filters, or advanced settings). 
Then, add the option (for Windows version: add the 
option after URLs)
-%I

This will generate the index.txt file, for various 
purposes (doctionnaries, indexing/hash table indexing, 
linguistic analysis...)
The index routines are located in is htsindex.c and 
htsindex.h, and can be easily customized.

Feel free to any feedback, remarks or bug report for 
this feature (or any other one :) )!

 
Reply


All articles

Subject Author Date
New feature in test: indexing/linguistic analysis

10/19/2001 14:19
Index.txt

02/05/2004 13:15




8

Created with FORUM 2.0.11