| Hi,
This isn't a HTTrack specific question, but I'd like some ideas for
determining that two pages are the same.
For example, <http://www.website.com> => <http://www.website.com/index.php>
There are links on the page returned from www.website.com that link to
www.website.com/index.php
There are dynamic links on this page that reflect the URL of the page e.g. a
href="/#jjj" => a href="/index.php#jjj"
So far I've just decided to work off the page Title to move forward and give
the user the option of basing duplicates on this criteria. But I'd like to be
a little smarter.
Avoiding duplicates is important for my purposes as I am generating content
from a site structure so no double ups.
I could parse the content and generate a score somehow, there are a few
discussions out there...
Any comments? | |