| > I totally agree with you Xavier: it is a very bad idea to
> make such sites.
>
> But why do you think it is so hard to make a software that
> crawls javascript links? A web browser is able to interpret
> Javascript so it might be possible to crawl such a site?
There are many reasons; just one of them:
Some sites use choiceboxes, often combined with a javascript
attribute like onclick="..." to let the user select a link
to another page. Choiceboxes are supposed to be used in
form, where the user must perhaps produe more input than
just select an option from the choicebox. Think for example
of a form, where you you should type in your name, combined
with a choicebox for "Mr/Mrs/Ms".
Hence a general purpose mirror program like httrack or a
crawler like those used by Google and its competitors would
need to be able to somehow "interpret" the context of a
choicebox: Is it worth a try to execute the javascript
code of an "onclick" attribute of a <select> tag, or not?
If you have a half way reliable general solution for this
problem, I'd bet that you'll get a well paid job at
Google ;)
If you only want to mirror one or a few web sites with
Javascript links, the situation is different: reading the
HTML code, you'll most likely find some pattern in the
onclick attributes (or whatever else is used to generate
links with some javascript code) that you can easily
parse to find the links. Httrack supports this approach
with plugins.
But remember that a web site can at any time change some
details of the javascript code that break your parser.
Catching up with such changes is not a big problem if you
deal only with a few sites, but trying to maintain a general
purpose program like httrack would be a nightmare.
Abel | |