HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Deciding two pages are the same
Author: William Roeder
Date: 09/19/2010 15:49
 
> some ideas for determining that two pages are the
> same.
> 
> For example, <http://www.website.com> =>
> <http://www.website.com/index.php>
There's no way to know before downloading, they can be different

> There are links on the page returned from
> www.website.com that link to
> www.website.com/index.php
The site should have redirected one to the other. Most sites don't bother.

> There are dynamic links on this page that reflect
> the URL of the page e.g. a href="/#jjj" => a
> href="/index.php#jjj"
Internal links should be relative so no modifications are required.

> So far I've just decided to work off the page Title
> to move forward and give the user the option of
> basing duplicates on this criteria.  But I'd like to
> be a little smarter.
No easy way. Time stamps may be off. Internal metadata may be different but
not visible on the page.
Even page?arg1=x&arg2=y can differ from page?arg2=y&arg1=x
 
Reply Create subthread


All articles

Subject Author Date
Deciding two pages are the same

09/19/2010 05:39
Re: Deciding two pages are the same

09/19/2010 15:49
Re: Deciding two pages are the same

09/20/2010 01:06




c

Created with FORUM 2.0.11