| > some ideas for determining that two pages are the
> same.
>
> For example, <http://www.website.com> =>
> <http://www.website.com/index.php>
There's no way to know before downloading, they can be different
> There are links on the page returned from
> www.website.com that link to
> www.website.com/index.php
The site should have redirected one to the other. Most sites don't bother.
> There are dynamic links on this page that reflect
> the URL of the page e.g. a href="/#jjj" => a
> href="/index.php#jjj"
Internal links should be relative so no modifications are required.
> So far I've just decided to work off the page Title
> to move forward and give the user the option of
> basing duplicates on this criteria. But I'd like to
> be a little smarter.
No easy way. Time stamps may be off. Internal metadata may be different but
not visible on the page.
Even page?arg1=x&arg2=y can differ from page?arg2=y&arg1=x | |