HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: minor new.txt technicality (FYI)...
Author: Haudy Kazemi
Date: 01/06/2003 03:20
> > I've been analyzing the new.txt generated from my 
> > crawl of epanorama and have a question...
> > Why does get 
> in 
> > new.txt twice, but with different local file names 
> > (,> 
> In fact there are two links: the 'driver' one and 
> the 'driver files' one. The latter has two different 
> <>
> <>
> And there is also the first one, 
> <>
> I assume this is the reason why the engine detected a 
> collision somewhere ; but can't really guess as I don't 
> have the full logs

The full logs are here along with some related files:

The file new-diff.xls is a Microsoft Excel document where 
I loaded and parsed the new.txt.  The new-diff.xls is a 
subset of new.txt, where all cases where MIME-type = file-
type were removed.  The remaining cases contain some false-
positives (text/html with a htm extension, etc.) but it is 
my starting point for determining which links may be 
problematic or not (like the false 404 pages when 
accessing a gif/jpg file).  If you want to look at the 
formulas they are in columns G,H, and L.
Reply Create subthread

All articles

Subject Author Date
minor new.txt technicality (FYI)...

01/05/2003 10:16
Re: minor new.txt technicality (FYI)...

01/05/2003 15:21
Re: minor new.txt technicality (FYI)...

01/06/2003 03:20
Re: minor new.txt technicality (FYI)...

01/07/2003 21:34


Created with FORUM 2.0.11