| A new feature, introduced on 3.00 release (but never
really tested), allows to crawl several websites, and
index them after the mirror.
The result is a text file, listing all 'relevant'
words, the number of hits, the %1000 value, and their
position. Example: (after crawling www.httrack.com)
..
ability
1 www.httrack.com/HelpHtml/dev.html
=1
(0)
able
2 www.httrack.com/HelpHtml/fcguide.html
1 www.httrack.com/HelpHtml/abuse.html
1 www.httrack.com/HelpHtml/dev.html
1 www.httrack.com/HelpHtml/step9_opt9.html
=5
(0)
about
7 www.httrack.com/HelpHtml/fcguide.html
3 www.httrack.com/HelpHtml/faq.html
3 www.httrack.com/HelpHtml/index-2.html
3 www.httrack.com/HelpHtml/index.html
1 www.httrack.com/HelpHtml/abuse.html
1 www.httrack.com/HelpHtml/contact.html
1 www.httrack.com/HelpHtml/filters.html
1 www.httrack.com/HelpHtml/step9_opt9.html
=20
(0)
above
6 www.httrack.com/HelpHtml/fcguide.html
1 www.httrack.com/HelpHtml/faq.html
=7
(0)
abridged
1 www.httrack.com/HelpHtml/fcguide.html
=1
(0)
absence
1 www.httrack.com/HelpHtml/fcguide.html
=1
(0)
..
To activate the feature: first setup a 'regular'
mirror, and limit the scope to html files (using
either filters, or advanced settings).
Then, add the option (for Windows version: add the
option after URLs)
-%I
This will generate the index.txt file, for various
purposes (doctionnaries, indexing/hash table indexing,
linguistic analysis...)
The index routines are located in is htsindex.c and
htsindex.h, and can be easily customized.
Feel free to any feedback, remarks or bug report for
this feature (or any other one :) )!
| |