| I think it would be dangerous for a program like HTTrack to have functionality
for following forms, or even
"understanding" javascript. Sure there are always going to be times when it
would be great to have, but generally, how could the software know what's
"safe" to follow/parse and what's unsafe?
Forms for subscriptions, myriad 'search this site forms, sending feedback (via
server-side email), performing editing functions, navigation with variables...
there are a huge number of things like this that a crawler would get stuck on,
or hit a server pointlessly.
> However, some issues still not solved in HTTrack (or any other crawlers I
know), I don't know how google does this!
>
> The problem is, how can the crawler follow <form method=post> and some
javascript tricks? For example: on a web site (such as a forum or bbs), it
lists lots of items, and there is a link called "next page" on it. This link
does NOT use things like "list.php?page=2" etc, instead, it calls a javascript
to set a hidden var, then post the form to direct to the next page.
>
> How can HTTrack solve this? As far as I know, it is very hard. I am thinking
of using IE's DOM capability to control it to download such pages (refer to
<http://wtr.rubyforge.org/>). What do you think? | |