| I'm looking for a solution to this problem:
Some sites on their internal links refer to themselves
as "www.greatsite.com" in some locations,
and "greatsite.com" in other locations. The way their
webserver is configured, you can access their all
their pages with either URL.
The problem is in mirroring these sites:
www.greatsite.com is considered different from
greatsite.com, with the result of the website being
copied twice, even if all the content is the same.
So, what I'm asking, is there a way to tell HTTrack
that "www.greatsite.com = greatsite.com" so it can
consolidate the copy into one directory, downloading
the links only once. Most of the time I've seen this
problem I've simply told HTTrack to grab both URLs,
but now I'm trying to get a large site (I've capped
HTTrack at 10KB/sec with 1 connection) and telling it
to use both URLs isn't going to be very feasible.
Is there a way to tell HTTrack's URL rewriting engine
about things like this? Is there a way to do this
with an external program? | |