Re: how to config to download Interrupt List web site

Subject: Re: how to config to download Interrupt List web site

Author: Carl Johnson

Date: 10/18/2009 01:57

Solved it by looking at the log file, noticed this:

16:54:55 Info:  Note: due to www.ctyme.com remote robots.txt rules, links
begining with these path will be forbidden: /doc/, /intr/, /linuxdoc/,
/molaw/, /mai/, /wsdocs/, /javadoc/, /perkel/, /sex/, /graphics/, /reality/,
/webx.cgi/, /webx.cgi, /webx, /pics, /cgi (see in the options to disable
this)


So I went into "Set options"->Spider, changed Spider from "follow robots.txt
rules" to "no robots.txt rules", and it downloaded the intr/*.htm files. 
However, every file had a bunch of garbage data before the actual data.

How can I get the files?  What is robots.txt about?  Does the web site have a
robots.txt file and so it denies access to the above files?

Create subthread

All articles

Subject	Author	Date
how to config to download Interrupt List web site		10/18/2009 01:29
Re: how to config to download Interrupt List web site		10/18/2009 01:57
Re: how to config to download Interrupt List web site		10/18/2009 02:04