HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Extracting text only from an entire website
Author: patricio
Date: 10/08/2015 00:36
 
You could try grabbing everything and filtering for the text. There's a python
package called 'textract' you can use for dumping text from all filetypes, and
with a simple script you can create a text counterpart to every non-text file
(e.g., for every FOO.pdf you get a FOO.pdf.txt).

Why do you want to pull only text?
 
Reply Create subthread


All articles

Subject Author Date
Extracting text only from an entire website

10/01/2015 20:35
Re: Extracting text only from an entire website

10/08/2015 00:36




e

Created with FORUM 2.0.11