| Having moved again I don't have all my stuff with me, but looking in what I
have I find lines like the following in new.lst:
20:38:56 57335/57335 ---M-- 200 added ('OK') text/html
date:Mon,%2002%20Aug%202010%2008:25:52%20GMT
<http://animals.howstuffworks.com/insects/fig-wasp.htm>
E:/I/Escape/animals.howstuffworks.com/insects/fig-wasp.htm (from
<http://auto.howstuffworks.com/stirling-engine.htm>)
But the downloaded copy of <http://auto.howstuffworks.com/stirling-engine.htm>
doesn't have a link to this page, even though new.txt pretends it does. (I
removed all /" *\+ *"/ from the page so as to check obfuscated links as
well.)
The way I discovered ocw.mit.edu was because some years ago another download
with external=1 ended up copying that whole site (which is huge) in my
absense. That was also when I became aware of the implications of this bug.
Because of the misattribution in new.txt I haven't been able to track down the
precise point where that spider escaped; the howstuffworks case is the
smallest example I have found so far. Sorry it isn't cleaner. | |