| PS
To expand on the last couple of paragraphs, I should add that I know I could
exclude all HTML and then whitelist particular sites (like my target site) but
(a) this won't work with Wiki*edia.org (unless there is a scan filter for MIME
type, which I can't find documented) and (b) I should actually be able to grab
the HTML from external sites when directly linked without fearing that I will
go on crawling the entire site from there | |