| The scan rules aren't exactly regular expressions, they're
not shell globs... so what are all the valid patterns and
special characters that httrack uses? I've gone through
the manuals and lots of messages on the forums, here's
what I've gathered so far:
[0-9] will match a single digit. I have no idea if it can
be done with letters like [a-zA-Z], and I don't think you
can specify repitition.
* acts like .* in a regular expression, not sure if it's
greedy or non-greedy, though <-- would really help out if
I knew for sure
appending *[<20>50] to a rule filters content less than
20KB in size, and greater than 50KB. Just *[<20] would
mean filter less than 20KB.
and that's about all I've found, which leaves me with some
more questions:
Are there other special characters (Like +)? How do I
embed them in a scan rule? I thought of using URL submit
encoding, like "," -> "%2C" but I'm not sure if they will
match each other (Do they?).
I did start expirementing to answer my questions, but
after trying some long scan rules and wasting time and
bandwidth, I'm still not sure exactly how these scan rules
are parsed. Tried looking at htsparse.c (169KB) not even
sure if that's the right file :(
So does anyone know how exactly these scan rule patterns
work?Thanks!
| |