Deciding two pages are the same - HTTrack Website Copier Forum

Subject: Deciding two pages are the same

Author: Tim S

Date: 09/19/2010 05:39

Hi,

This isn't a HTTrack specific question, but I'd like some ideas for
determining that two pages are the same.

For example, <http://www.website.com> => <http://www.website.com/index.php>

There are links on the page returned from www.website.com that link to
www.website.com/index.php

There are dynamic links on this page that reflect the URL of the page e.g. a
href="/#jjj" => a href="/index.php#jjj"

So far I've just decided to work off the page Title to move forward and give
the user the option of basing duplicates on this criteria.  But I'd like to be
a little smarter.

Avoiding duplicates is important for my purposes as I am generating content
from a site structure so no double ups.

I could parse the content and generate a score somehow, there are a few
discussions out there...

Any comments?

All articles

Subject	Author	Date
Deciding two pages are the same		09/19/2010 05:39
Re: Deciding two pages are the same		09/19/2010 15:49
Re: Deciding two pages are the same		09/20/2010 01:06