Re: site crawling and specific files only download

Subject: Re: site crawling and specific files only download

Author: William Roeder

Date: 05/02/2012 20:02

> I have a list of websites which I wish to crawl and
> download sepcific file types only.
Unless you have the urls of the specific files, you MUST let it spider the
site to get them. You must get the html.
> 
> In the Action I am using "Get Seperated Files"
1) That is for specific file urls. If you want all, you MUST download the
site.

2) Always post the actual command line used (or log file line two) not what
you think you did.

> my filters box is currently set up like this
> +*.png +*.gif +*.jpg *.pdf *.txt *.doc *.docx
3) says add all png gif jpg and add (bogus) sites *.pdf *.txt *.doc *.docx

> +www.*.com/*.html +*.zip +*.pdf
> +www.*.co.uk/*.html +*.zip +*.pdf
> +www.*.net/*.html +*.zip +*.pdf
> +www.*.org/*.html +*.zip +*.pdf
4) saying +*.zip more than once changes nothing.

5) The default is to mirror all files in the starting site. so your filters do
nothing except let it escape to all .com/.net/.org html files. 
6) Pages ending in asp, cgi, .htm, or even those without an extention at all
are not matched by *.html

7) accept nothing but html and your specific extensions
-mime:* +mime:text/html + *.png +*.gif +*.jpg +*.pdf +*.txt + *.doc +*.docx
+*.zip
8) put the three starting urls in the box and action=download.

Create subthread

All articles

Subject	Author	Date
site crawling and specific files only download		05/02/2012 18:36
Re: site crawling and specific files only download		05/02/2012 20:02