| > I tried to download Support Knowledge Base
> by URL which you recommended.
> There are 972 questions and answers and
> each single page includes several links to
> other pages like Login, Search, Answer, etc.
> There were so many links that only 10MB
> out of 70MB I downloaded was what I wanted.
> More than 70% of HTML pages were repeated
> and redundant.
Generally the use of filters is the answer to this problem.
Example: you want to catch all links that look like:
www.foo.com/bar.cgi?page=78
but you also get duplicate files in links which which look
like:
www.foo.com/bar.cgi?page=78&next
Then, just add the following filter:
-www.foo.com/bar.cgi?*&next*
You'll have to adjust this example to fit your needs, of
course -- but you get the idea here.
| |