Re: Extracting text only from an entire website

Subject: Re: Extracting text only from an entire website

Author: patricio

Date: 10/08/2015 00:36

You could try grabbing everything and filtering for the text. There's a python
package called 'textract' you can use for dumping text from all filetypes, and
with a simple script you can create a text counterpart to every non-text file
(e.g., for every FOO.pdf you get a FOO.pdf.txt).

Why do you want to pull only text?

Create subthread

All articles

Subject	Author	Date
Extracting text only from an entire website	Jay Bolt	10/01/2015 20:35
Re: Extracting text only from an entire website	patricio	10/08/2015 00:36