HTTrack Website Copier
Free software offline browser - FORUM
Subject: Extracting text only from an entire website
Author: Jay Bolt
Date: 10/01/2015 20:35
 
I am completing a Masters degree in Linguistics and wish to create a corpus
from all the text on a company's website. The website runs to over 300 pages
and so what I need is to be able to scrape all the text from each page -
anything between the <body></body> tags is fine - from every page and dump it
all into a single .txt file or similar.

Can HTTrack do this and, if so, how could I achieve this?
Your help is greatly appreciated.

Thanks

Jay
 
Reply


All articles

Subject Author Date
Extracting text only from an entire website

10/01/2015 20:35
Re: Extracting text only from an entire website

10/08/2015 00:36




0

Created with FORUM 2.0.11