| I have requirement to get the content form the sites that
we have partenered with. There are around 100 different
sites to crawl. I need to crawl the web site and get just
the articles. The articles content is in two formats. Some
sites the articles are in HTML format and other sites the
articiles are in PDF format.The no of articles from each
site varies from 200K to 500K.
Some sites the link to PDF file is not ending with .pdf
the link is similar to this <web root>/ViewPdf?artid=1000.
When the user clicks on the link it is lonching the adobi
with the PDF file.
As you know saving all the site is not a good solution
because they are prety big sites.
Is there any way I can scan whole site and get the
articles that need.
Thsi is a very good tool so far I have found. I
appreciate your help.
Thank you,
Ravi | |