| Hi,
I've recently started using httrack and it's by far the best crawler that I've
found, so thankyou for releasing it!
I've hit a problem though with a (stupidly designed) site whose links, instead
of putting the URL in the href tag where it belongs, uses Javascript onClick
event to call a function.
I know it would be too much to expect a complete Javascript engine inside
httrack... but in this particular case, the function is really simple:
<a href="#" onClick="foo(123)">
<script>
function foo(pagenum) {
document.location = <http://example.com/p/> + pagenum;
}
</script>
httrack spots the static part of the URL but obviously fails to deal with the
dynamic part (it repeatedly tries to download <http://example.com/p/>)
For now I'm writing a perl script to assemble the links myself and feed them
back to httrack, fetch all the pages, then rewrite the static part of the
URL-building function so that it builds a relative URL instead of an absolute.
(I could alternatively replace the bogus JS links with proper ones, but that
might overstep my duty to preserve the site as accurately as possible to the
original...)
I wonder if it would be possible to handle this within a future version of
httrack -- at least for simple functions which just assemble a URL based on
their arguments, with no processing of those arguments.
I think (unfortunately) there are quite a lot of CMS or database-driven sites
that use this kind of link.
Cheers,
Ben
| |