HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Any reliable way to block HTTrack
Author: Xavier Roche
Date: 03/27/2001 21:40
 
I have began to write a FAQ answer on this problem. 
This is only a pre-draft, to do not hesitate to point 
out errors or remarks. (or grammar/syntax problems, as 
you may have noticed, I am not a native american!)



HTTrack Website Copier FAQ (update - DRAFT)

Q. How to block offline browsers, like HTTrack?
A. This is a complex question, let's study it

First, there are several different reasons for that
Why do you want to block offline browsers? :

1. Because a large part of your bandwidth is used by 
some users, who are slowing down the rests
2. Because of copyright questions (you do not want 
people to copy parts of your website)
3. Because of privacy (you do not want email grabbers 
to steal all your user's emails)


1. Bandwidth abuse:

Many Webmasters are concerned about bandwidth abuse, 
even if this problem is caused by
a minority of people. Offline browsers tools, like 
HTTrack, can be used in a WRONG way, and
therefore are sometimes considered as a potential 
danger.
But before thinking that all offline browsers are BAD, 
consider this: 
students, teachers, IT consultants, websurfers and 
many people who like your website, may want to copy
parts of it, for their work, their studies, to teach 
or demonstrate to people during class school or
shows. They might do that because they are connected 
through expensive modem connection,
or because they would like to consult pages while 
travelling, or archive sites that may be removed
one day, make some data mining, comiling information  
("if only I could find this website I saw one day.."). 
There are many good reasons to mirror websites, and 
this helps many good people.
As a webmaster, you might be interested to use such 
tools, too: test broken links, move a website to 
another location, control which external links are put 
on your website for legal/content control, 
test the webserver response and performances, index 
it..

Anyway, bandwidth abuse can be a problem. If your site 
is regularly "clobbered" by evil downloaders, you have 
various solutions. You have radical solutions, and 
intermediate solutions. I strongly recomment not to use
radical solutions, because of the previous remarks 
(good people often mirror websites).

In general, for all solutions,
the good thing: it will limit the bandwidth abuse
the bad thing: depending on the solution, it will be 
either a small constraint, or a fatal nuisance (you'll 
get 0 visitors)
or, to be extreme: if you unplug the wire, there will 
be no bandwidth abuse

a- Inform people, explain why ("please do not clobber 
the bandwidth")
Good: Will work with good people. Many good people 
just don't KNOW that they can slow down a network.
Bad: Will **only** work with good people
How to do: Obvious - place a note, a warning, an 
article, a draw, a poeme or whatever you want

b- Use "robots.txt" file
Good: Easy to setup
Bad: Easy to override
How to do: Create a robots.txt file on top dir, with 
proper parameters
Example:
	User-agent: *

	Disallow: /bigfolder

c- Ban registered offline-browsers User-agents
Good: Easy to setup
Bad: Radical, and easy to override
How to do: Filter the "User-agent" HTTP header field

d- Limit the bandwidth per IP (or by folders)
Good:  Efficient
Bad: Multiple users behind proxies will be slow down, 
not really easy to setup
How to do: Depends on webserver. Might be done with 
low-level IP rules (QoS)

e- Priorize small files, against large files
Good: Efficient if large files are the cause of abuse
Bad: Not always efficient
How to do: Depends on the webserver

f- Ban abuser IPs
Good: Immediate solution
Bad: Annoying to do, useless for dynamic IPs, and not 
very user friendly
How to do: Either ban IP's on the firewall, or on the 
webserver (see ACLs)

g- Limit abusers IPs
Good: Intermediate and immediate solution
Bad: Annoying to do, useless for dynamic IPs, and 
annoying to maintain..
How to do: Use routine QoS (fair queuing), or 
webserver options

h- Use technical tricks (like javascript) to hide URLs
Good: Efficient
Bad: The most efficient tricks will also cause your 
website to he heavy, and not user-friendly (and 
therefore less attractive, even for surfing users). 
Remember: clients or visitors might want to consult 
offline your website. Advanced users will also be 
still able to note the URLs and catch them. Will not 
work on non-javascript browsers. It will not work if 
the user clicks 50 times and put downloads in 
background with a standard browser
How to do: Most offline browsers (I would say all, but 
let's say most) are unable to "understand" 
javascript/java properly. Reason: very tricky to 
handle!
Example: 
You can replace:
	<a href="bigfile.zip">Foo</a>
by:
	<script language="javascript">
	<!--
	  document.write('<a h' + 're' + 'f="');
	  document.write('bigfile' + '.' + 'zip">');
	// -->
	</script>
	Foo
	</a>

You can also use java-based applets. I would say that 
it is the "best of the horrors". A big, fat, slow, 
bogus java applet. Avoid!

i- Use technical tricks to lag offline browsers
Good: Efficient
Bad: Can be avoided by advanced users, annoying to 
maintain, AND potentially worst that the illness 
(cgi's are often taking some CPU usage). . It will not 
work if the user clicks 50 times and put downloads in 
background with a standard browser
How to do: Create fake empty links that point to 
cgi's, with long delays
Example: Use things like <a href="slow.cgi?p=12786549"><nothing></a> (example
in php:)
	<?php
	for($i=0;$i<10;$i++) {
		sleep(6);
		echo " ";
	}
	?>

j- Use technical tricks to temporarily ban IPs
Good: Efficient
Bad: Radical (your site will only be available online 
for all users), not easy to setup
How to to: Create fake links with "killing" targets
Example: Use things like <a 
href="killme.cgi"><nothing></a> (again an example in 
php:)
	<?php
		// Of course, "add_temp_firewall_rule" 
has to be written..
		add_temp_firewall_rule
($REMOTE_ADDR,"30s");
	?>


2. Copyright issues

You do not want people to "steal" your website, or 
even copy parts of it. First, stealing a website does 
not
require to have an offline browser. Second, direct 
(and credited) copy is sometimes better than disguised 
plagiarism. Besides, several previous remarks are also 
interesting here: the more protected your website will 
be,
the potentially less attractive it will also be. There 
is no perfect solution, too. A webmaster asked me one 
day
to give him a solution to prevent any website copy. 
Not only for offline browsers, but also against "save 
as", 
cut and paste, print.. and print screen. I replied 
that is was not possible, especially for the print 
screen - and
that another potential threat was the evil 
photographer. Maybe with a "this document will self-
destruct in 5 seconds.."
or by shooting users after consulting the document.
More seriously, once a document is being placed on a 
website, there will always be the risks of copy (or 
plagiarism)

To limit the risk, previous a- and h- solutions, 
in "bandwidth abuse" section, can be used


3. Privacy

Might be related to section 2. But the greatest risk 
is maybe email grabbers. A solution can be to use 
javascript to
hide emails. 
Good: Efficient
Bad: Will not work on non-javascript browsers
How to do: Use javascript to build mailto: links
Example: (in php)
	<script language="javascript">
	<!--
	function FOS(host,nom,info) {
	  var s;
	  if (info == "") info=nom+"@"+host;
	  s="mail";
	  document.write("<a 
href='"+s+"to:"+nom+"@"+host+"'>"+info+"</a>");
	}
	FOS('mycompany.com','smith?subject=Hi, 
John','Click here to email me!')
	// -->
	</script>






 
Reply Create subthread


All articles

Subject Author Date
Any reliable way to block HTTrack

03/21/2001 10:51
Re: Any reliable way to block HTTrack

03/21/2001 13:43
Re: Any reliable way to block HTTrack

03/21/2001 21:25
Re: Any reliable way to block HTTrack

03/24/2001 09:11
Re: Any reliable way to block HTTrack

03/26/2001 22:53
Re: Any reliable way to block HTTrack

03/27/2001 15:34
Re: Any reliable way to block HTTrack

03/27/2001 21:40
Re: Any reliable way to block HTTrack

03/31/2001 06:12
Re: Any reliable way to block HTTrack

06/13/2005 00:35
Re: Any reliable way to block HTTrack

07/28/2011 10:40




6

Created with FORUM 2.0.11