| I'm building a web crawler and using HTTrack to bring copies of thousands of
smallish websites local before performing some pretty heavy parsing of their
data. Ultimately this will enable structured search on these sites. Similar
to some others who have posted in the forum, I need the links and files to be
faithfully reproduced in the relevant sub-directories, WITHOUT the hash that
is inserted into the filenames that contain php?, asp?, cfm? etc. The reason
these are re-written is stated in the documentation as being to avoid
collisions between filenames, but in all the sites I'm downloading, there is
no possibility of collision, since the filenames always include a unique
number after the php?... string. I've tried multiple combinations of options
in HTTrack but it always seems to re-write either the href or the filename,
and of course these have to be consistent or the parser will fail to find the
content. Since I'm parsing links that will later be used in a search engine
open to internet users, the hrefs need to resolve to the mirrored site, not
what I have local on disk. Given that I'm confident there are no collisions,
and I can state the filetype of all these pages as html, is there a way to
turn off the re-writing both in hrefs and in the filenames? If not, maybe you
could point me to the relevant code and I can turn it off myself....
By the way, other than this issue, this is great software! Many thanks. | |