HTTrack Website Copier
Free software offline browser - FORUM
Subject: Stopping crawl in specific directory
Author: Sparklepaws
Date: 03/28/2018 13:40
 
Hey guys. These days with the rise of specially-formatted websites it's hard to
find a "normal" site ending in a clear format (ie .htm, .html). Instead most
URLs are composed in a way that leaves them open, such as
www.foo.com/bar/foobar/. An actual example of a site done this way is Reddit.

This presents a tough issue for HTTrack since it needs a filetype to confirm
the download, otherwise you have to crawl everything. For example:

+www.foo.com/bar/foobar/* 
(Turns any loose pages into index.html files, which is good, but it also
crawls DEEP).

+www.foo.com/bar/foobar/*.html
(Doesn't work because technically the page isn't an htm, html or shtml file).


Is there any way besides setting an External Depth to stop HTTrack from
crawling beyond a certain point in a path?
 
Reply


All articles

Subject Author Date
Stopping crawl in specific directory

03/28/2018 13:40
Re: Stopping crawl in specific directory

03/28/2018 13:48
Re: Stopping crawl in specific directory

04/25/2018 12:06
Re: Stopping crawl in specific directory

04/25/2018 12:08




6

Created with FORUM 2.0.11