Re: Exclude the end of each page from link-extraction?

Subject: Re: Exclude the end of each page from link-extraction?

Author: William Roeder

Date: 07/22/2009 01:11

> Is it possible to exclude the end of each page of a
> website from link-extraction?No such capability

> The reason is that the end of the pages include
> index pages which cannot be filtered out:
> Crawling over the index pages (see example: page4,
> page5) increases the amount of uninteresting data
> dramatically.
Filter out what you do not want.
-* +*/podo/*
Even if it spiders the index pages it won't go there.
> <href="www.site.org/podo/Page1">
> <href="www.site.org/podo/Page2">
> <href="www.site.org/podo/Page3">

Alternatively, if all the pages are reachable from the starting url, you can
set the depth so stuff beyond the index pages aren't allowed.

Create subthread

All articles

Subject	Author	Date
Exclude the end of each page from link-extraction?		07/21/2009 22:12
Re: Exclude the end of each page from link-extraction?		07/22/2009 01:11
Re: Exclude the end of each page from link-extraction?		07/22/2009 15:24
Re: Exclude the end of each page from link-extraction?		07/22/2009 16:20