| Hello,
I have a basically simple problem, but it's been giving me headaches for days
now.
I have a bunch of websites I have to crawl and I only want to download the
text, no binary stuff. First I've tried this:
-* +*htm* +*cgi* +*asp* +*php* +*jsp* +*xml* +*dhm* +*xhtm*
And it works fine, except that content files can have any extension, even no
extension. So, now I'm trying this:
-u1 -mime:*/* +mime:text/html
It's great, but I get a lot of ".delayed" files for all the
images/pdf/zip/.... files and I really don't want them.
Please advice me.
PS: I'm using v3.4
Best regards! | |