Subject: Re: Can't scrape text only from Mic.com |
Author: Nancy, fun with Data |
Date: 08/27/2021 05:12 |
| I looked through the log files and for the majority of the articles pages it
says
" Warning: could not detect encoding for:
<https://www.mic.com/p/this-couples-gender-reveal-stunt-started-a-deadly-wildfire-theyre-now-facing-20-years-in-jail-82561477>
so any time HTTrack cannot detect encoding for a page it just keeps it as a
TMP file and deletes it in the end. Strange because when i interrupted a
previous scan it turned a LOT of those TMP files into HTML files. So I think
there's a glitch with HTTrack which is over 10 years old now. The devs keep
saying "those are temp files they're suppose to be deleted" but it's not true. | |
|
|
|
|