HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Search Intensive Sites
Author: Jim Rems
Date: 08/26/2008 07:17
 
> > I spent the better part of the day trying to make
> > this work but couldn't.  I get the page I want,
> but
> > not the files associated with the record
> > descriptions.
> 
> Does the first page mirrored show the results page?
Yes.  I ran a test limit of 100 on the site description above.  The page
displays correctly, but when I click on an image link (or record link) it
takes me to an ARC Time-Out Page.  The same happens for the record link.

> what files aren't you getting?
The linked image files (gif).  Scan Rule is set to get *gif, *jpg, etc.

> What does the log say? Did you set the log to
> debug?> 
I don't know if the log is set to debug (I'm basically using defaults, except
as you recommended).  After some experimenting, the log file generally returns
no errors.

I've tried a number of default mirrors, but the results are the same, i.e.,
Arc Time-Out. I captured the URL for the Hierarchy Tab, the page that displays
correctly.

Here is what I did:

National Archives arcweb.archives.gov
Select Digital Copies
Limit 100
Search for: Ansel Adams
Set-up CatchURL
Click Hierarchy Tab
URL inserted into HTTrack
Continue with HTTrack

I tried several different mirror depths, from 1 to 5 both internal/external. 2
internal and 0 external work best.
Lastly, the log suggested turning off the robot rules.

Thanks again for your help.



 
Reply Create subthread


All articles

Subject Author Date
Search Intensive Sites

08/24/2008 08:16
Re: Search Intensive Sites

08/24/2008 16:06
Re: Search Intensive Sites

08/25/2008 06:48
Re: Search Intensive Sites

08/25/2008 17:30
Re: Search Intensive Sites

08/26/2008 07:17
Re: Search Intensive Sites

08/26/2008 20:10
Re: Search Intensive Sites

08/27/2008 08:32




5

Created with FORUM 2.0.11