| > > I would like to just crawl the website, not
> download
> > everything, but I don't see how to do this. Can
> > anyone advise, please? TIA.
>
> What do you mean by crawling ? Only html (/php etc.)
> pages ?>
> Use something like
> Scan rules =>
> -* +www.yoursite.com/*.html +www.yoursite.com/*.php
> +www.yoursite.com/*.asp +www.yoursite.com/*/
>
Like that, but not exactly on our website. A quick look shows me a number of
links that go to new pages that don't have any extension, like
www.oursite.com/news or www.oursite.com/resources. If I want to use htttrack,
I may have to develop a lot of exclusions: pdf, zip, MS-Office app extensions,
Images (jpg/tiff/map/geotiff), ESRI data files, etc. I notice while watching
the crawl session (-spider option) that a good number of files were taking
awhile to download.
| |