HTTrack Website Copier
Free software offline browser - FORUM
Subject: hostname dropped from starting URL = external site
Author: Bandit
Date: 01/13/2010 04:02
 
> threads, as ignoring the robots.txt file and so on,

Good because <http://a.aaaarg.org/robots.txt> blocks all bots.


> result is the same as without capturing the url:
> the whole site seems to work, but the final link

Yes, it appears the entire "front end" of the site (or at least the links I
clicked) is accessible without a login UNTIL you get to, for example:
 <http://a.aaaarg.org/text/7479/attention-aaaargorg-administrator> 
then it appears "discussion" pages require the login.


> the whole site seems to work, but the final link
> from the introduction page to the real pdf file of
> the text only leads to the login page.

Sorry but I looked all over for an "introduction page" and was unable to find
same.  Thus this "final link" to the "real pdf" eluded me as well :)

HOWEVER...
This is what I noticed:

Similar to the link above, if you go to-
 <http://a.aaaarg.org/text/2289/chapter-3-verification> 
(as well as some of the other pages I browsed)
you will find a "TEXT" link highlighted.  In this example, the link points to-

 <http://a.aaaarg.org/node/2289/download> 
and as you know - I interpreted from your posts - brings up a PDF of the text
in question.  The problem comes in with the redirection of this link.

In your other post, one of the filters you listed is
 +http://a.aaaarg.org/files/textz/* 
(n/b: the "http://" part is not to be used, regardless...)
unfortunately for you, the "download" link above does not forward to anything
on <http://a.aaaarg.org/> as it - for whatever reason - drops the hostname ("a")
and links to-
 <http://aaaaarg.org/files/textz/2007/10/verification.pdf> 

I think to get the most complete mirror of the site, I would use these
filters/scan rules:
-* +*.aaaarg.org/* -*logout*
and set HTT to get "near" files, aka 
Options->Links->Get non-HTML files related...
leaving the Spider set to "no robots.txt rules"

At least methinks that (and what you have already done with the CatchURL,
etc.) is the place to start!
HTH,
~B^D
 
Reply Create subthread


All articles

Subject Author Date
Sites with login-previous answers aren t useful

01/09/2010 14:47
Re: Sites with login-previous answers aren t useful

01/09/2010 16:29
Re: Sites with login-previous answers aren t useful

01/11/2010 19:04
Re: Sites with login-previous answers aren t useful

01/11/2010 20:45
Re: Sites with login-previous answers aren t useful

01/11/2010 20:48
Re: Sites with login-previous answers aren t useful

01/12/2010 17:37
hostname dropped from starting URL = external site

01/13/2010 04:02
Do not purge old files

01/13/2010 04:10
Re: hostname dropped from starting URL = external site

01/13/2010 16:17
Re: hostname dropped from starting URL = external site

03/03/2010 14:08




4

Created with FORUM 2.0.11