| > threads, as ignoring the robots.txt file and so on,
Good because <http://a.aaaarg.org/robots.txt> blocks all bots.
> result is the same as without capturing the url:
> the whole site seems to work, but the final link
Yes, it appears the entire "front end" of the site (or at least the links I
clicked) is accessible without a login UNTIL you get to, for example:
<http://a.aaaarg.org/text/7479/attention-aaaargorg-administrator>
then it appears "discussion" pages require the login.
> the whole site seems to work, but the final link
> from the introduction page to the real pdf file of
> the text only leads to the login page.
Sorry but I looked all over for an "introduction page" and was unable to find
same. Thus this "final link" to the "real pdf" eluded me as well :)
HOWEVER...
This is what I noticed:
Similar to the link above, if you go to-
<http://a.aaaarg.org/text/2289/chapter-3-verification>
(as well as some of the other pages I browsed)
you will find a "TEXT" link highlighted. In this example, the link points to-
<http://a.aaaarg.org/node/2289/download>
and as you know - I interpreted from your posts - brings up a PDF of the text
in question. The problem comes in with the redirection of this link.
In your other post, one of the filters you listed is
+http://a.aaaarg.org/files/textz/*
(n/b: the "http://" part is not to be used, regardless...)
unfortunately for you, the "download" link above does not forward to anything
on <http://a.aaaarg.org/> as it - for whatever reason - drops the hostname ("a")
and links to-
<http://aaaaarg.org/files/textz/2007/10/verification.pdf>
I think to get the most complete mirror of the site, I would use these
filters/scan rules:
-* +*.aaaarg.org/* -*logout*
and set HTT to get "near" files, aka
Options->Links->Get non-HTML files related...
leaving the Spider set to "no robots.txt rules"
At least methinks that (and what you have already done with the CatchURL,
etc.) is the place to start!
HTH,
~B^D
| |