Re: Robots.txt - Can't download specific files - HTTrack Website Copier Forum

Subject: Re: Robots.txt - Can't download specific files

Author: Xavier Roche

Date: 11/24/2013 14:11

> I'd like to download *.mrc files on this url :
> <http://www.librairiedialogues.fr/ws/book/>
> For example, this is an url with a mrc file :
> <http://www.librairiedialogues.fr/ws/book/97822650935>
> 77/unimarc_utf-8/

HTTrack can not list "directories" such as
<http://www.librairiedialogues.fr/ws/book/>, because the remote server won't
allow that. This is why the only way for httrack to detect links is to "crawl"
html pages and enumerate all possible links inside the pages.

You probably need to crawl html pages, which seems difficult in this case
because html pages do not have clear naming (ie. no .html type)

You could try to crawl the site, and in scan rules, exclude everything which
is not html or mrc files

For example: (Options / Scan rules)
-*.gif -*.jpg -*.jpeg -*.png -*.css -*.js

Create subthread

All articles

Subject	Author	Date
Robots.txt - Can't download specific files		11/21/2013 18:40
Re: Robots.txt - Can't download specific files		11/24/2013 14:11