| I've run into a few issues when attempting my current project using httrack
(mirroring a RPG website released under Open Game Content).
First, I've discovered I sometimes want the extended attempts at parsing
links; it has been useful in some javascript blocks. On the other hand, it
screws up on some css blocks (e.g. google_color_url="[HexVal]" gets turned
into google_color_url="[HexVal].html") Is there any way to set this somewhat
selectively? Or provide a list of known non-URL string literals or regexps?
(For that matter, the -%P0 flag doesn't seem to change this behavior.)
Second, the documentation in the online user guide
(http://www.httrack.com/html/fcguide.html) with respect to wildcards seems
inaccurate. I tried to use the *[file], *[path], and *[] wildcards, and that
just failed. Are these supported any more?
Third, I'm confused as to how filters are applied. In an attempt to stop
getting the [hexval].html files, even if the archived webpage had the results
of the mis-parse, I tried adding "-*/0066CC.html" as a filter, and yet I kept
getting the resulting webpage created. (the website I'm copying is 'helpful'
and generates a list of suggestions, rather than providing a 404.) Is there a
reason this file is generated, even with that filter?
Fourth, is there any way to stop generating pages for
[hostname/path]?[querystring] ? I tried -%q0, but I still am generating the
files with a hash value appended.
I've even tried adding '-[hostname]/*3d59.html' as a filter (many of the query
strings are hashed to 3d59), and the files are still generated.
Fifth, is there a way to change the settings for an already created archive?
For instance, if I discover a new filter I want to add, or a setting I want to
change, after already generating a mirror?
(My command line is "httrack -%q0 -%k -o0 -%P0 [hostname]
'+*.gstatic.com/*.ttf' '+*.gstatic.com/*.css' '-[hostname]/*/tools/'
'-[hostname]/*/extras/*' '-[hostname]/*goog_*.html'
'-[hostname]/*/NodeNotFound*.html' '-*/CSI/index.html' '-*/0066CC.html'
'-*/003965.html' '-*/4E7DBF.html'" (where hostname is the actual website name)
) | |