HTTrack Website Copier
Free software offline browser - FORUM
Subject: Creating keyword index fails on UTF-8
Author: Mark
Date: 06/27/2009 13:16
 
Hi to everyone,

today i stumbled upon the nice feature to create a keyword index using
HTTrack.
After a few quick tests I realized, that it doesn't seem to work very good.
When mirroring an UTF-8 website, all words are being split when an UTF-8
character is detected.

On a common german web page, this results in many erroneous keywords that are
just garbage. Furthermore "abc.def" is detected as an single keyword, although
is contains an valid delimiter.

A second important point is, that all URLs within the index.txt are rewritten
to the HTTrack URLs (eg. index.php?x=3 becomes index234x.html), which makes
the complete index.txt invalid for external tools unless you are only working
on mirrored sites and not the live one.

A indexing test on www.httrack.com reveals, that many keywords are being
ignored. Where can those keywords be configured?
faqs
	ignored (6)
files
	ignored (6)
fixes
	ignored (6)

If you can help me with any of these points, I would appreciate your reply ;)

Thanks in advance,
Mark
 
Reply


All articles

Subject Author Date
Creating keyword index fails on UTF-8

06/27/2009 13:16




f

Created with FORUM 2.0.11