|
> Had a quick look. I would not think the javascript is
> preventing HTTrack from finding the URLs to the movies, but
> I wonder if it's the HTML coding and a parsing problem...
Well, I wouldn't claim to be very familiar with the
internals of httrack's HTML and Javascript parsers, but
I think this site is a good example how easy it is to
make "interpreting" Javascript really complicated for a
mirror program like httrack. The page defines a JS
function launchit, which is mainly a wrapper for
window.open(). Hence, httrack's Javascript parser will
fail, if it simply searches for the string 'window.open'.
Another nasty javascript trick I've seen somewhere is
to use something like
document.write('<a hr' + 'ef="some_url.html">') I don't
expect that httrack properly parses all sorts of such weird
code.
>
> I'll try to add it to my test page soon. Not really much
> you can do about it except maybe adding each movie page (the
> ones that appear in the pop-up windows; not many of them) as
> extra project start URLs.
>
> Abel, what's this httrack-py? :)
It is a little plugin module for httrack which defines most
callbacks as specified in <http://www.httrack.com/html/plug.html>.
The module doesn't do anything useful by itself, but it
allows to write "real" callbacks in Python. In this case,
one could implement the callbacks preprocess-html and
postprocess-html (new in httrack version 3.33-beta3),
where preprocess-html uses a regex like
r"javaScript:launchit\('(.*?)'\)" to look for links. These
links can then be added to the page as regular
<a href="http://...."> links, and the modified HTML text
is returned by the preprocess-html callback. When this
modified text has been parsed by the httrack core,
postprocess-html is called, where the inserted links can be
parsed in order to catch possible URL modifications by
httrack, and the modified URLs can be inserted as arguments
into the launchit calls. Finally, the postprocess-html
callback can remove the inserted simple <a href> links
and return the HTML text.
Abel | |