Re: Can I use httrack to spider IBM WebSphere site

Subject: Re: Can I use httrack to spider IBM WebSphere site

Author: Abel Deuring

Date: 01/09/2005 13:59

> I totally agree with you Xavier: it is a very bad idea to
> make such sites.
> 
> But why do you think it is so hard to make a software that
> crawls javascript links? A web browser is able to interpret
> Javascript so it might be possible to crawl such a site?
There are many reasons; just one of them:

Some sites use choiceboxes, often combined with a javascript 
attribute like onclick="..." to let the user select a link 
to another page. Choiceboxes are supposed to be used in 
form, where the user must perhaps produe more input than 
just select an option from the choicebox. Think for example 
of a form, where you you should type in your name, combined
with a choicebox for "Mr/Mrs/Ms". 

Hence a general purpose mirror program like httrack or a 
crawler like those used by Google and its competitors would 
need to be able to somehow "interpret" the context of a 
choicebox: Is it worth a try to execute the javascript 
code of an "onclick" attribute of a <select> tag, or not?
If you have a half way reliable general solution for this
problem, I'd bet that you'll get a well paid job at 
Google ;)

If you only want to mirror one or a few web sites with 
Javascript links, the situation is different: reading the
HTML code, you'll most likely find some pattern in the 
onclick attributes (or whatever else is used to generate
links with some javascript code) that you can easily
parse to find the links. Httrack supports this approach
with plugins.

But remember that a web site can at any time change some 
details of the javascript code that break your parser. 
Catching up with such changes is not a big problem if you 
deal only with a few sites, but trying to maintain a general
purpose program like httrack would be a nightmare.

Abel

Create subthread

All articles

Subject	Author	Date
Can I use httrack to spider IBM WebSphere sites?		01/08/2005 16:58
Re: Can I use httrack to spider IBM WebSphere sites?		01/08/2005 17:11
Re: Can I use httrack to spider IBM WebSphere site		01/08/2005 19:10
Re: Can I use httrack to spider IBM WebSphere site		01/09/2005 13:59