HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: site crawling and specific files only download
Author: William Roeder
Date: 05/02/2012 20:02
 
> I have a list of websites which I wish to crawl and
> download sepcific file types only.
Unless you have the urls of the specific files, you MUST let it spider the
site to get them. You must get the html.
> 
> In the Action I am using "Get Seperated Files"
1) That is for specific file urls. If you want all, you MUST download the
site.

2) Always post the actual command line used (or log file line two) not what
you think you did.

> my filters box is currently set up like this
> +*.png +*.gif +*.jpg *.pdf *.txt *.doc *.docx
3) says add all png gif jpg and add (bogus) sites *.pdf *.txt *.doc *.docx

> +www.*.com/*.html +*.zip +*.pdf
> +www.*.co.uk/*.html +*.zip +*.pdf
> +www.*.net/*.html +*.zip +*.pdf
> +www.*.org/*.html +*.zip +*.pdf
4) saying +*.zip more than once changes nothing.

5) The default is to mirror all files in the starting site. so your filters do
nothing except let it escape to all .com/.net/.org html files. 
6) Pages ending in asp, cgi, .htm, or even those without an extention at all
are not matched by *.html

7) accept nothing but html and your specific extensions
-mime:* +mime:text/html + *.png +*.gif +*.jpg +*.pdf +*.txt + *.doc +*.docx
+*.zip
8) put the three starting urls in the box and action=download.
 
Reply Create subthread


All articles

Subject Author Date
site crawling and specific files only download 05/02/2012 18:36
Re: site crawling and specific files only download 05/02/2012 20:02




f

Created with FORUM 2.0.11