| Using HTTrack I was blocked by a web-site:
<http://senseis.xmp.net/>
On the web-site they explained the reasons:
They check the referrer, they use trap links and they want
a wait peroid between requests.
Does HTTrack use the referrer?Is it possible to conigure an automatic wait
period between
requests?
How can I enable HTTrack to mirror this web-site?
I attached the content of the access blocked message.
account37
----------------------------------------------------------
Access Blocked
------------------------------------------------------------
--------------------
Keywords: SL description
Sensei's Library tries to protect itself from dumb
mirroring scripts that issue some thousand requests within
minutes bringing our server to its knees.
A first meassure is to block access to any function other
than viewing a page, if there is no referrer information
present[1]. What does this mean?
Every time you click on a link your browser sends a request
to our server to get the desired page and pictures. This
request not only contains the pagename which you would like
to see, but also which page you are coming from (referrer
information).
However, mirroring scripts don't send this information. So
checking for the referrer information is an easy way to
distinguish between scripts and regular browsers.
If there is no referrer information than everything but
viewing a page is blocked (e.g. diff, edit, save, search,
pageinfo, ...).
If you get the "AccessBlocked" message as a regular user
than either you are not using a standard browser, or have
configured your browser in a way to not send the referrer
information, or a proxy you are using is removing this
information. Solution: change your settings or set a cookie
[1].
As the referrer information has to originate from within SL
it is no longer possible to link to diff, pageinfo, etc.
from other websites. Note: you can still link to pages
themselves.
As a second messure, if the misbehaving script insists on
requesting such pages over and over it will be dynamically
added to a block list for 48 hours. There's also a trap
link on the pages for scripts to follow. Users should not
normally be able to see this link. (You can see it in the
source, but don't try it out or your address will be
blocked for 48 hours. Really. We mean it.)
The above meassures should shield SL from the most
offensive scripts. What if you would still like to
mirror/download SL? Use a friendly script such as wget
which obeys robots.txt. Or download a ready packed snapshot
at SLSnapshot. If you use wget don't forget to specify a
wait period between the requests (at least "-w 3"). Yes, it
will take some hours, but that way our server will still be
accessible to others as well. If this advice isn't followed
we may think of even more restrictive meassures. You have
been warned.
Contact ArnoHollosi or MortenPahle if you have further
questions.
[1] The referrer check is circumvented, if you have the
SLPrefs cookie set. (e.g. Mozilla currently doesn't send
referrer information if you open a page in a new window).
------------------------------------------------------------
--------------------
Gorobei Dumb Question: you do have a robots.txt files to
keep well-behaved spiders from hammering the site?
Arno: yes we do: <http://senseis.xmp.net/robots.txt> But
those dumb scripts used recently don't obey robots.txt.
------------------------------------------------------------
--------------------
I'm getting 'Access blocked because of missing referrer
information' if I try to use 'open in new window' to get at
a diff page. Would it perhaps be possible to allow access
to agents which send an SLPrefs cookie, even in the absence
of referrer information? (using oldish Mozilla (0.9.1) on
Gnu/Linux.)
--Matthew Woodcraft
Added by-pass of check for users who have the cookie set. I
verified this with Mozilla 0.9.4 - bug is still there.
Actually it's known to the Mozilla team as bug #48902 for
over a year now. It seems that the next build will contain
a fix. We will see.... --Arno
Bill Spight: I am writing this using Internet Explorer. For
some reason this morning my Netscape does not allow me to
edit pages on SL. I do not know of any changes that might
have caused that. Using Netscape, my user name does not
show up, either, just a '-'. I tried resetting my User
Preferences, but just got the Access Blocked message when I
submitted them. I usually use Netscape, so I would
appreciate any help in getting it to work on SL again.
Thanks. :-)
Arno: I could not verify this behaviour with Mozilla 0.9.9
(Windows) nor with Netscape 4.7 (Linux). Do you still have
this problem? The server logs don't show anything
suspicious....
------------------------------------------------------------
--------------------
Access Blocked last edited by ArnoHollosi (80.109.254.29)
on April 9, 2002 - 21:10
| |