| > i'm getting almost 2 complete copies of the same
site
> because some of the code refers to:
> <http://foo.com>
> while others refer to:
> <http://foo.companyname.com>
> is there a way to rewrite this to only capture one
> copy of the site?
No - the engine can not 'know' that a website X is
identical to a website Y - the only thing you can try
is to download ONE site, and exclude the other, using
filters like in:
-* +foo.com/*
and specifying only foo.com as starting URL.
You can then wipe all <http://foo.companyname.com>
occurences in a script similar to:
find ./ -type f -name "*.html" -exec sh -c "cat {} |
sed -e 's/http:\/\/foo\.companyname\.com//g' > _tmp;
mv _tmp {}" \;
But this may cause broken links in some cases
(example: www.foo.com/~bar/ is generally replaced by
www.foo.com/_bar/)
| |