HTTrack Website Copier
Free software offline browser - FORUM
Subject: Deciding two pages are the same
Author: Tim S
Date: 09/19/2010 05:39
 
Hi,

This isn't a HTTrack specific question, but I'd like some ideas for
determining that two pages are the same.

For example, <http://www.website.com> => <http://www.website.com/index.php>

There are links on the page returned from www.website.com that link to
www.website.com/index.php

There are dynamic links on this page that reflect the URL of the page e.g. a
href="/#jjj" => a href="/index.php#jjj"

So far I've just decided to work off the page Title to move forward and give
the user the option of basing duplicates on this criteria.  But I'd like to be
a little smarter.

Avoiding duplicates is important for my purposes as I am generating content
from a site structure so no double ups.

I could parse the content and generate a score somehow, there are a few
discussions out there...

Any comments?
 
Reply


All articles

Subject Author Date
Deciding two pages are the same

09/19/2010 05:39
Re: Deciding two pages are the same

09/19/2010 15:49
Re: Deciding two pages are the same

09/20/2010 01:06




6

Created with FORUM 2.0.11