Re: Restricting harvests to certain file types - HTTrack Website Copier Forum

Subject: Re: Restricting harvests to certain file types

Author: Leto

Date: 01/06/2003 22:20

> Hello, I want to restrict Win HTTrack to only download PDF, 
> DOC and XLS files from a number of websites. I've been 
> through past forum discussions and have been experimenting 
> with scan rules such as -* +*.htm +*.html +*.asp +*.php 
> +*.pdf +*.doc +.xls However, I can't seem to exclude HTML 
> or HTM files. I wish to restrict my impact on host servers 
> (I have limits set) and only download PDF, DOC and XLS 
> files. When I try to exclude HTML and HTM files I get 
> nothing. Can I only download these files and exclude at 
> HTML? Does anybody have any scan rule suggestions? Many 
> thanks, dnt

G'day dnt -- it's been a while ;)

If all the files you want to download (PDF, DOC, etc) are on a single page,
then filters like

-* +*.pdf +*.doc +*.xls

would work because you are telling HTTrack to not go anywhere past your
starting URL but only to those filetypes.

But when the website is multiple pages, you DO need to allow HTTrack to follow
on to those other pages.  So if the pages are HTML, you add +*.htm +*.html

There is no way around this -- HTTrack needs to go to the pages to find all
the files you want.

One feature you could use, though, is build structure.  You could tell the
program to put HTML files in one folder and everything else in another folder. 
When the capture is complete, the files will be nicely separated.

Create subthread

All articles

Subject	Author	Date
Restricting harvests to certain file types		01/05/2003 23:54
Re: Restricting harvests to certain file types		01/06/2003 22:20
Re: Restricting harvests to certain file types		01/06/2003 22:42
Re: Restricting harvests to certain file types		01/07/2003 21:14
Re: Restricting harvests to certain file types		01/07/2003 21:48
Re: Restricting harvests to certain file types		01/09/2003 09:54