HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: how to config to download Interrupt List web site
Author: Carl Johnson
Date: 10/18/2009 01:57
 
Solved it by looking at the log file, noticed this:

16:54:55 Info:  Note: due to www.ctyme.com remote robots.txt rules, links
begining with these path will be forbidden: /doc/, /intr/, /linuxdoc/,
/molaw/, /mai/, /wsdocs/, /javadoc/, /perkel/, /sex/, /graphics/, /reality/,
/webx.cgi/, /webx.cgi, /webx, /pics, /cgi (see in the options to disable
this)


So I went into "Set options"->Spider, changed Spider from "follow robots.txt
rules" to "no robots.txt rules", and it downloaded the intr/*.htm files. 
However, every file had a bunch of garbage data before the actual data.

How can I get the files?  What is robots.txt about?  Does the web site have a
robots.txt file and so it denies access to the above files?
 
Reply Create subthread


All articles

Subject Author Date
how to config to download Interrupt List web site

Carl Johnson

10/18/2009 01:29
Re: how to config to download Interrupt List web site

Carl Johnson

10/18/2009 01:57
Re: how to config to download Interrupt List web site

Carl Johnson

10/18/2009 02:04




c

Created with FORUM 2.0.11