Extracting text only from an entire website - HTTrack Website Copier Forum

Subject: Extracting text only from an entire website

Author: Jay Bolt

Date: 10/01/2015 20:35

I am completing a Masters degree in Linguistics and wish to create a corpus
from all the text on a company's website. The website runs to over 300 pages
and so what I need is to be able to scrape all the text from each page -
anything between the <body></body> tags is fine - from every page and dump it
all into a single .txt file or similar.

Can HTTrack do this and, if so, how could I achieve this?
Your help is greatly appreciated.

Thanks

Jay

All articles

Subject	Author	Date
Extracting text only from an entire website		10/01/2015 20:35
Re: Extracting text only from an entire website		10/08/2015 00:36