HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Robots.txt - Can't download specific files
Author: Xavier Roche
Date: 11/24/2013 14:11
 
> I'd like to download *.mrc files on this url :
> <http://www.librairiedialogues.fr/ws/book/>
> For example, this is an url with a mrc file :
> <http://www.librairiedialogues.fr/ws/book/97822650935>
> 77/unimarc_utf-8/

HTTrack can not list "directories" such as
<http://www.librairiedialogues.fr/ws/book/>, because the remote server won't
allow that. This is why the only way for httrack to detect links is to "crawl"
html pages and enumerate all possible links inside the pages.

You probably need to crawl html pages, which seems difficult in this case
because html pages do not have clear naming (ie. no .html type)

You could try to crawl the site, and in scan rules, exclude everything which
is not html or mrc files

For example: (Options / Scan rules)
-*.gif -*.jpg -*.jpeg -*.png -*.css -*.js

 
Reply Create subthread


All articles

Subject Author Date
Robots.txt - Can't download specific files 11/21/2013 18:40
Re: Robots.txt - Can't download specific files 11/24/2013 14:11




8

Created with FORUM 2.0.11