Re: rewritten urls cause duplicated local mirrors

Subject: Re: rewritten urls cause duplicated local mirrors

Author: Xavier Roche

Date: 01/16/2002 14:47

> i'm getting almost 2 complete copies of the same 
site 
> because some of the code refers to:
> <http://foo.com>
> while others refer to:
> <http://foo.companyname.com>
> is there a way to rewrite this to only capture one 
> copy of the site?
No - the engine can not 'know' that a website X is 
identical to a website Y - the only thing you can try 
is to download ONE site, and exclude the other, using 
filters like in:
-* +foo.com/*
and specifying only foo.com as starting URL.

You can then wipe all <http://foo.companyname.com> 
occurences in a script similar to:
find ./ -type f -name "*.html" -exec sh -c "cat {} | 
sed -e 's/http:\/\/foo\.companyname\.com//g' > _tmp; 
mv _tmp {}" \;

But this may cause broken links in some cases 
(example: www.foo.com/~bar/ is generally replaced by 
www.foo.com/_bar/)

Create subthread

All articles

Subject	Author	Date
rewritten urls cause duplicated local mirrors		01/16/2002 14:14
Re: rewritten urls cause duplicated local mirrors		01/16/2002 14:47