HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: minor new.txt technicality (FYI)...
Author: Haudy Kazemi
Date: 01/06/2003 03:20
 
> > I've been analyzing the new.txt generated from my 
large 
> > crawl of epanorama and have a question...
> > Why does www.hut.fi/~then/circuits/covox.zip get 
listed 
> in 
> > new.txt twice, but with different local file names 
> > (covox.zip, covox-2.zip)?> 
> In fact there are two links: the 'driver' one and 
> the 'driver files' one. The latter has two different 
links:
> <http://www.hut.fi/Misc/Electronics/circuits/covox.zip>
> <http://www.epanorama.net/circuits/covox.zip>
> 
> And there is also the first one, 
> <http://www.hut.fi/~then/circuits/covox.zip>
> 
> I assume this is the reason why the engine detected a 
> collision somewhere ; but can't really guess as I don't 
> have the full logs

The full logs are here along with some related files:
<http://kazemizadeh.net/httrack/epanorama.com>

The file new-diff.xls is a Microsoft Excel document where 
I loaded and parsed the new.txt.  The new-diff.xls is a 
subset of new.txt, where all cases where MIME-type = file-
type were removed.  The remaining cases contain some false-
positives (text/html with a htm extension, etc.) but it is 
my starting point for determining which links may be 
problematic or not (like the false 404 pages when 
accessing a gif/jpg file).  If you want to look at the 
formulas they are in columns G,H, and L.
 
Reply Create subthread


All articles

Subject Author Date
minor new.txt technicality (FYI)...

01/05/2003 10:16
Re: minor new.txt technicality (FYI)...

01/05/2003 15:21
Re: minor new.txt technicality (FYI)...

01/06/2003 03:20
Re: minor new.txt technicality (FYI)...

01/07/2003 21:34




8

Created with FORUM 2.0.11