| > If a site is mirrored over multiple hosts (eg:
> www.foo.com and www.bah.com) what would the
difference
> be in supplying the two urls at the beginning (eg:
> httrack <http://www.foo.com> <http://www.bah.com> ...
> [plus options, filters, etc]), or supplying one url
at
> the begining and the other as an accepting host
filter?
No difference, IF there is at least one link on the
first website which refers to the second.
In fact, when you enter an URL, the engine does the
following:
- adding the URL on the list of URLs to mirror
(stack), exactly as if it was discovering a link on an
HTML page
- adding a default filter <URL>* before all filters
(BEFORE is important: if you specified '-*' as filter
to forbide everything, this will not change anything
because in '+<URL>* -*' the last filter will be
prioritary)
> The reason I ask is that we have just made a website
> unhappy by hitting them a fair bit due to an
infinite
> loop we got stuck in (hitting the same page a lot),
> which had a multiple host setup as described above.
Argh.. may be due to a cgi or something similar.. in
this case, adding something like '-www.foo.com/*/cgi-
bin/*' is a good idea.. or setup a depth which will
block loops (example: depth=20 - large enough for most
sites, and small enough to limit loops)
| |