Re: Deciding two pages are the same - HTTrack Website Copier Forum

Subject: Re: Deciding two pages are the same

Author: William Roeder

Date: 09/19/2010 15:49

> some ideas for determining that two pages are the
> same.
> 
> For example, <http://www.website.com> =>
> <http://www.website.com/index.php>
There's no way to know before downloading, they can be different

> There are links on the page returned from
> www.website.com that link to
> www.website.com/index.php
The site should have redirected one to the other. Most sites don't bother.

> There are dynamic links on this page that reflect
> the URL of the page e.g. a href="/#jjj" => a
> href="/index.php#jjj"
Internal links should be relative so no modifications are required.

> So far I've just decided to work off the page Title
> to move forward and give the user the option of
> basing duplicates on this criteria.  But I'd like to
> be a little smarter.
No easy way. Time stamps may be off. Internal metadata may be different but
not visible on the page.
Even page?arg1=x&arg2=y can differ from page?arg2=y&arg1=x

Create subthread

All articles

Subject	Author	Date
Deciding two pages are the same		09/19/2010 05:39
Re: Deciding two pages are the same		09/19/2010 15:49
Re: Deciding two pages are the same		09/20/2010 01:06