| > The way HTTrack was used on our site was much
like a
> denial of service attack.
Some users do abuse with tools like HTTrack, that's a
fact. This was the same with tools like lftp, which
are really great, but which can be used in a wrong way
(through multiple ftp connections). I have set up some
standard behaviour to avoid problems (like default
cnx. to 8 and limiting the number of connections per
second), but these limits can be overridden, and
therefore can cause a network abuse.
I can not set hard limits - setting up 32 simultaneous
connection, for example, can be useful to test the
intranet response and performances (I often use
HTTrack to do some load tests on the intranet), or to
crawl websites with limited-banwidth option (useful
for sites with numerous links to test : many
simultaneous parsings, but with limited bandwidth to
avoid overload). The same goes for robots.txt
disabling option: useful for forums, useless for
standard spiders, but useful for a user that wants to
keep some messages archive. And the same also for user-
agent: some websites will only give the correct
content if the user-agent is MSIE-5.5 like browser.
The problem is to mix ALL these rules in a bad way:
many simultaneous connections, bandwidth limit
disabled, robots disabled, connection limit disabled
and so on.. it's really impossible to detect a
potential abusing configuration, which will also
depends on the internet connection speed (you won't
overload a website with a 56K modem, even with 32
connections. You will with your T3 line..)
The best way is to warn the users not do abuse the
bandwidth, or more efficiently set limits on the n# of
requests or on the bandwidth allowed per users (see
mod_bandwidth and mod_throttle for Apache), as it is
done for many webservers and ftp servers.
Filtering the user-agent, or setting robots.txt rules
will work will all "normal" users - but this
won't for
people who REALLY want to abuse. In case of repetive
abuses, that might be attacks too, the best way is to
warn the admin of the abuser, or temporary filter the
IP (a good solution is to count the hits per 30'' -
and if reaching a limit, temporary ban for 3' the IP
address)
Again, as said in the faqs, abuse is really a problem
for tools like offline browsers. We didn't develop
this tool to cause neither any bandwidth abuse or
attacks, nor any other abuse, like the famous email-
grabbers that some nasty companies sells. We are GPL,
free, with sourcecode given, so we really don't have
any interest to give people potentially dangerous
tools.
We tried to disable obviously dangerous options
(setting up multiple-proxies for load balancing, email
catcher..). But some people will always use in a bad
way all tools they'll get.
| |