Re: avoid scanning multiple copies of same file

Subject: Re: avoid scanning multiple copies of same file

Author: Xavier Roche

Date: 11/30/2013 15:06

> I have been trying to download a 'wiki' as well as
> several forum websites. In all cases the download
> seems endless, with multiple copies of the same
> file(s) being created.

Unfortunately, httrack can not "guess" that the links are actually the same.
You can not "collate" links either with httrack - but you may exclude download
of the additional links using scan rules.

> It seems that the "index.php?title=X" part leads
> Httrack to create separate html files. Is there any
> way by using either filters, options or both, to
> force Httracks to only do one copy of each file it
> find, rather than multiples? Thanks in advance.

It seems that you are crawling diffs, history, etc. - which are different
content.

You may however exclude them, for example using the following scan rules
(Options / Scan Rules):

-*action=* -*diff=*

Create subthread

All articles

Subject	Author	Date
avoid scanning multiple copies of same file		11/30/2013 04:24
Re: avoid scanning multiple copies of same file		11/30/2013 15:06
Re: avoid scanning multiple copies of same file		11/30/2013 16:42
Re: avoid scanning multiple copies of same file		12/06/2013 10:45