| Hey guys. These days with the rise of specially-formatted websites it's hard to
find a "normal" site ending in a clear format (ie .htm, .html). Instead most
URLs are composed in a way that leaves them open, such as
www.foo.com/bar/foobar/. An actual example of a site done this way is Reddit.
This presents a tough issue for HTTrack since it needs a filetype to confirm
the download, otherwise you have to crawl everything. For example:
+www.foo.com/bar/foobar/*
(Turns any loose pages into index.html files, which is good, but it also
crawls DEEP).
+www.foo.com/bar/foobar/*.html
(Doesn't work because technically the page isn't an htm, html or shtml file).
Is there any way besides setting an External Depth to stop HTTrack from
crawling beyond a certain point in a path?
| |