| > > I've been analyzing the new.txt generated from my
large
> > crawl of epanorama and have a question...
> > Why does www.hut.fi/~then/circuits/covox.zip get
listed
> in
> > new.txt twice, but with different local file names
> > (covox.zip, covox-2.zip)?>
> In fact there are two links: the 'driver' one and
> the 'driver files' one. The latter has two different
links:
> <http://www.hut.fi/Misc/Electronics/circuits/covox.zip>
> <http://www.epanorama.net/circuits/covox.zip>
>
> And there is also the first one,
> <http://www.hut.fi/~then/circuits/covox.zip>
>
> I assume this is the reason why the engine detected a
> collision somewhere ; but can't really guess as I don't
> have the full logs
The full logs are here along with some related files:
<http://kazemizadeh.net/httrack/epanorama.com>
The file new-diff.xls is a Microsoft Excel document where
I loaded and parsed the new.txt. The new-diff.xls is a
subset of new.txt, where all cases where MIME-type = file-
type were removed. The remaining cases contain some false-
positives (text/html with a htm extension, etc.) but it is
my starting point for determining which links may be
problematic or not (like the false 404 pages when
accessing a gif/jpg file). If you want to look at the
formulas they are in columns G,H, and L. | |