HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: minor new.txt technicality (FYI)...
Author: Haudy Kazemi
Date: 01/06/2003 03:20
 
> > I've been analyzing the new.txt generated from my 
large 
> > crawl of epanorama and have a question...
> > Why does www.hut.fi/~then/circuits/covox.zip get 
listed 
> in 
> > new.txt twice, but with different local file names 
> > (covox.zip, covox-2.zip)?> 
> In fact there are two links: the 'driver' one and 
> the 'driver files' one. The latter has two different 
links:
> <http://www.hut.fi/Misc/Electronics/circuits/covox.zip>
> <http://www.epanorama.net/circuits/covox.zip>
> 
> And there is also the first one, 
> <http://www.hut.fi/~then/circuits/covox.zip>
> 
> I assume this is the reason why the engine detected a 
> collision somewhere ; but can't really guess as I don't 
> have the full logs

The full logs are here along with some related files:
<http://kazemizadeh.net/httrack/epanorama.com>

The file new-diff.xls is a Microsoft Excel document where 
I loaded and parsed the new.txt.  The new-diff.xls is a 
subset of new.txt, where all cases where MIME-type = file-
type were removed.  The remaining cases contain some false-
positives (text/html with a htm extension, etc.) but it is 
my starting point for determining which links may be 
problematic or not (like the false 404 pages when 
accessing a gif/jpg file).  If you want to look at the 
formulas they are in columns G,H, and L.
 
Reply Create subthread


All articles

Subject Author Date
minor new.txt technicality (FYI)...

Haudy Kazemi

01/05/2003 10:16
Re: minor new.txt technicality (FYI)...

Xavier Roche

01/05/2003 15:21
Re: minor new.txt technicality (FYI)...

Haudy Kazemi

01/06/2003 03:20
Re: minor new.txt technicality (FYI)...

Xavier Roche

01/07/2003 21:34




8

Created with FORUM 2.0.11