| I have began to write a FAQ answer on this problem.
This is only a pre-draft, to do not hesitate to point
out errors or remarks. (or grammar/syntax problems, as
you may have noticed, I am not a native american!)
HTTrack Website Copier FAQ (update - DRAFT)
Q. How to block offline browsers, like HTTrack?
A. This is a complex question, let's study it
First, there are several different reasons for that
Why do you want to block offline browsers? :
1. Because a large part of your bandwidth is used by
some users, who are slowing down the rests
2. Because of copyright questions (you do not want
people to copy parts of your website)
3. Because of privacy (you do not want email grabbers
to steal all your user's emails)
1. Bandwidth abuse:
Many Webmasters are concerned about bandwidth abuse,
even if this problem is caused by
a minority of people. Offline browsers tools, like
HTTrack, can be used in a WRONG way, and
therefore are sometimes considered as a potential
danger.
But before thinking that all offline browsers are BAD,
consider this:
students, teachers, IT consultants, websurfers and
many people who like your website, may want to copy
parts of it, for their work, their studies, to teach
or demonstrate to people during class school or
shows. They might do that because they are connected
through expensive modem connection,
or because they would like to consult pages while
travelling, or archive sites that may be removed
one day, make some data mining, comiling information
("if only I could find this website I saw one day..").
There are many good reasons to mirror websites, and
this helps many good people.
As a webmaster, you might be interested to use such
tools, too: test broken links, move a website to
another location, control which external links are put
on your website for legal/content control,
test the webserver response and performances, index
it..
Anyway, bandwidth abuse can be a problem. If your site
is regularly "clobbered" by evil downloaders, you have
various solutions. You have radical solutions, and
intermediate solutions. I strongly recomment not to use
radical solutions, because of the previous remarks
(good people often mirror websites).
In general, for all solutions,
the good thing: it will limit the bandwidth abuse
the bad thing: depending on the solution, it will be
either a small constraint, or a fatal nuisance (you'll
get 0 visitors)
or, to be extreme: if you unplug the wire, there will
be no bandwidth abuse
a- Inform people, explain why ("please do not clobber
the bandwidth")
Good: Will work with good people. Many good people
just don't KNOW that they can slow down a network.
Bad: Will **only** work with good people
How to do: Obvious - place a note, a warning, an
article, a draw, a poeme or whatever you want
b- Use "robots.txt" file
Good: Easy to setup
Bad: Easy to override
How to do: Create a robots.txt file on top dir, with
proper parameters
Example:
User-agent: *
Disallow: /bigfolder
c- Ban registered offline-browsers User-agents
Good: Easy to setup
Bad: Radical, and easy to override
How to do: Filter the "User-agent" HTTP header field
d- Limit the bandwidth per IP (or by folders)
Good: Efficient
Bad: Multiple users behind proxies will be slow down,
not really easy to setup
How to do: Depends on webserver. Might be done with
low-level IP rules (QoS)
e- Priorize small files, against large files
Good: Efficient if large files are the cause of abuse
Bad: Not always efficient
How to do: Depends on the webserver
f- Ban abuser IPs
Good: Immediate solution
Bad: Annoying to do, useless for dynamic IPs, and not
very user friendly
How to do: Either ban IP's on the firewall, or on the
webserver (see ACLs)
g- Limit abusers IPs
Good: Intermediate and immediate solution
Bad: Annoying to do, useless for dynamic IPs, and
annoying to maintain..
How to do: Use routine QoS (fair queuing), or
webserver options
h- Use technical tricks (like javascript) to hide URLs
Good: Efficient
Bad: The most efficient tricks will also cause your
website to he heavy, and not user-friendly (and
therefore less attractive, even for surfing users).
Remember: clients or visitors might want to consult
offline your website. Advanced users will also be
still able to note the URLs and catch them. Will not
work on non-javascript browsers. It will not work if
the user clicks 50 times and put downloads in
background with a standard browser
How to do: Most offline browsers (I would say all, but
let's say most) are unable to "understand"
javascript/java properly. Reason: very tricky to
handle!
Example:
You can replace:
<a href="bigfile.zip">Foo</a>
by:
<script language="javascript">
<!--
document.write('<a h' + 're' + 'f="');
document.write('bigfile' + '.' + 'zip">');
// -->
</script>
Foo
</a>
You can also use java-based applets. I would say that
it is the "best of the horrors". A big, fat, slow,
bogus java applet. Avoid!
i- Use technical tricks to lag offline browsers
Good: Efficient
Bad: Can be avoided by advanced users, annoying to
maintain, AND potentially worst that the illness
(cgi's are often taking some CPU usage). . It will not
work if the user clicks 50 times and put downloads in
background with a standard browser
How to do: Create fake empty links that point to
cgi's, with long delays
Example: Use things like <a href="slow.cgi?p=12786549"><nothing></a> (example
in php:)
<?php
for($i=0;$i<10;$i++) {
sleep(6);
echo " ";
}
?>
j- Use technical tricks to temporarily ban IPs
Good: Efficient
Bad: Radical (your site will only be available online
for all users), not easy to setup
How to to: Create fake links with "killing" targets
Example: Use things like <a
href="killme.cgi"><nothing></a> (again an example in
php:)
<?php
// Of course, "add_temp_firewall_rule"
has to be written..
add_temp_firewall_rule
($REMOTE_ADDR,"30s");
?>
2. Copyright issues
You do not want people to "steal" your website, or
even copy parts of it. First, stealing a website does
not
require to have an offline browser. Second, direct
(and credited) copy is sometimes better than disguised
plagiarism. Besides, several previous remarks are also
interesting here: the more protected your website will
be,
the potentially less attractive it will also be. There
is no perfect solution, too. A webmaster asked me one
day
to give him a solution to prevent any website copy.
Not only for offline browsers, but also against "save
as",
cut and paste, print.. and print screen. I replied
that is was not possible, especially for the print
screen - and
that another potential threat was the evil
photographer. Maybe with a "this document will self-
destruct in 5 seconds.."
or by shooting users after consulting the document.
More seriously, once a document is being placed on a
website, there will always be the risks of copy (or
plagiarism)
To limit the risk, previous a- and h- solutions,
in "bandwidth abuse" section, can be used
3. Privacy
Might be related to section 2. But the greatest risk
is maybe email grabbers. A solution can be to use
javascript to
hide emails.
Good: Efficient
Bad: Will not work on non-javascript browsers
How to do: Use javascript to build mailto: links
Example: (in php)
<script language="javascript">
<!--
function FOS(host,nom,info) {
var s;
if (info == "") info=nom+"@"+host;
s="mail";
document.write("<a
href='"+s+"to:"+nom+"@"+host+"'>"+info+"</a>");
}
FOS('mycompany.com','smith?subject=Hi,
John','Click here to email me!')
// -->
</script>
| |