HTTrack Website Copier
Free software offline browser - FORUM
Subject: Link parsing ; wildcards; other Q's
Author: Galen
Date: 09/25/2014 06:36
I've run into a few issues when attempting my current project using httrack
(mirroring a RPG website released under Open Game Content).

First, I've discovered I sometimes want the extended attempts at parsing
links; it has been useful in some javascript blocks.  On the other hand, it
screws up on some css blocks (e.g. google_color_url="[HexVal]" gets turned
into google_color_url="[HexVal].html")  Is there any way to set this somewhat
selectively?  Or provide a list of known non-URL string literals or regexps? 
(For that matter, the -%P0 flag doesn't seem to change this behavior.)

Second, the documentation in the online user guide
( with respect to wildcards seems
inaccurate.  I tried to use the *[file], *[path], and *[] wildcards, and that
just failed.  Are these supported any more? 

Third, I'm confused as to how filters are applied.  In an attempt to stop
getting the [hexval].html files, even if the archived webpage had the results
of the mis-parse, I tried adding "-*/0066CC.html" as a filter, and yet I kept
getting the resulting webpage created. (the website I'm copying is 'helpful'
and generates a list of suggestions, rather than providing a 404.)  Is there a
reason this file is generated, even with that filter?
Fourth, is there any way to stop generating pages for
[hostname/path]?[querystring] ?  I tried -%q0, but I still am generating the
files with a hash value appended.
I've even tried adding '-[hostname]/*3d59.html' as a filter (many of the query
strings are hashed to 3d59), and the files are still generated.

Fifth, is there a way to change the settings for an already created archive? 
For instance, if I discover a new filter I want to add, or a setting I want to
change, after already generating a mirror?
(My command line is "httrack -%q0 -%k -o0 -%P0 [hostname]
'+**.ttf' '+**.css' '-[hostname]/*/tools/'
'-[hostname]/*/extras/*' '-[hostname]/*goog_*.html' 
'-[hostname]/*/NodeNotFound*.html' '-*/CSI/index.html' '-*/0066CC.html'
'-*/003965.html' '-*/4E7DBF.html'" (where hostname is the actual website name)

All articles

Subject Author Date
Link parsing ; wildcards; other Q's

09/25/2014 06:36


Created with FORUM 2.0.11