| > Is it possible to exclude the end of each page of a
> website from link-extraction?No such capability
> The reason is that the end of the pages include
> index pages which cannot be filtered out:
> Crawling over the index pages (see example: page4,
> page5) increases the amount of uninteresting data
> dramatically.
Filter out what you do not want.
-* +*/podo/*
Even if it spiders the index pages it won't go there.
> <href="www.site.org/podo/Page1">
> <href="www.site.org/podo/Page2">
> <href="www.site.org/podo/Page3">
Alternatively, if all the pages are reachable from the starting url, you can
set the depth so stuff beyond the index pages aren't allowed. | |