HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: how to config to download Interrupt List web site
Author: Carl Johnson
Date: 10/18/2009 01:57
 
Solved it by looking at the log file, noticed this:

16:54:55 Info:  Note: due to www.ctyme.com remote robots.txt rules, links
begining with these path will be forbidden: /doc/, /intr/, /linuxdoc/,
/molaw/, /mai/, /wsdocs/, /javadoc/, /perkel/, /sex/, /graphics/, /reality/,
/webx.cgi/, /webx.cgi, /webx, /pics, /cgi (see in the options to disable
this)


So I went into "Set options"->Spider, changed Spider from "follow robots.txt
rules" to "no robots.txt rules", and it downloaded the intr/*.htm files. 
However, every file had a bunch of garbage data before the actual data.

How can I get the files?  What is robots.txt about?  Does the web site have a
robots.txt file and so it denies access to the above files?
 
Reply Create subthread


All articles

Subject Author Date
how to config to download Interrupt List web site

10/18/2009 01:29
Re: how to config to download Interrupt List web site

10/18/2009 01:57
Re: how to config to download Interrupt List web site

10/18/2009 02:04




c

Created with FORUM 2.0.11