| > Hello, I want to restrict Win HTTrack to only download PDF,
> DOC and XLS files from a number of websites. I've been
> through past forum discussions and have been experimenting
> with scan rules such as -* +*.htm +*.html +*.asp +*.php
> +*.pdf +*.doc +.xls However, I can't seem to exclude HTML
> or HTM files. I wish to restrict my impact on host servers
> (I have limits set) and only download PDF, DOC and XLS
> files. When I try to exclude HTML and HTM files I get
> nothing. Can I only download these files and exclude at
> HTML? Does anybody have any scan rule suggestions? Many
> thanks, dnt
G'day dnt -- it's been a while ;)
If all the files you want to download (PDF, DOC, etc) are on a single page,
then filters like
-* +*.pdf +*.doc +*.xls
would work because you are telling HTTrack to not go anywhere past your
starting URL but only to those filetypes.
But when the website is multiple pages, you DO need to allow HTTrack to follow
on to those other pages. So if the pages are HTML, you add +*.htm +*.html
There is no way around this -- HTTrack needs to go to the pages to find all
the files you want.
One feature you could use, though, is build structure. You could tell the
program to put HTML files in one folder and everything else in another folder.
When the capture is complete, the files will be nicely separated.
| |