|  | Oh, you're right, it isn't a HEAD.  Rusty me :)
You're right that in the event of a single site crawl/recrawl the md5 check is
no better or worse than a url+size comparison.  The use case I have in mind is
different though -- I expect to be encountering the same image across many
sites (such as many news sites using the same graphic in a story), and I want
to be able to detect the image quickly and not waste bandwidth on downloading
it many times.
The approach you suggest of downloading some bytes then deciding if a
disconnect is appropriate -- is this possible to implement as a plugin, and if
so at which phase?
Thanks! |  |