| > I recently copied a website that is a password protected
> message forum. I used the Capture URL feature to record
my
> username and password. After completing the mirror, Many,
> but not all, of the messages that I had posted on the
site
> were deleted from the website, but were intact on my
local
> copy that I had just made.
> Is there a setting or filter that I should have used to
> prevent this?
This is definitely a design bug on the server, because
regular URLs (generating GET requests) should not have side-
effects to on the database. Especially, "delete" or "move"
actions should always be triggere by POSTed forms, so that
regular crawlers do not "f up" the forum when running.
Anyway, the mandatory analysis to be done BEFORE crawling a
forum is to list which links are composing the forum ; such
as:
- links to display a regular page ; such as
<http://www.example.com/myforum/forum.cgi?id=1234&foobar=cherry>
- links to display the next page or previous page following
a regular page ; such as
<http://www.example.com/myforum/forum.cgi?id=1234&foobar=cherry&next>
which is, in this example, identical to:
<http://www.example.com/myforum/forum.cgi?id=1235&foobar=cherry>
- links to make an action, such as delete or reply
<http://www.example.com/myforum/forum.cgi?delete&id=1234>
or
<http://www.example.com/myforum/forum.cgi?reply&id=1234>
Here, you'll have to use scan rules such as:
-* +www.example.com/myforum/forum.cgi*
-www.example.com/myforum/forum.cgi*delete*
-www.example.com/myforum/forum.cgi*reply*
-www.example.com/myforum/forum.cgi*next*
-www.example.com/myforum/forum.cgi*previous*
To get all regular forum pages, except the "delete/reply"
links, and except "previous" ans "next" pages, which would
cause to fetch all pages in 3 identical versions, wasting
3X bandwidth.
You can also, optionnally, include images or related files
that could be located outside the forum:
+*.gif +*.jpg +*.png +*.css +*.js
| |