| I am doing a new thread so it is easier to read.
The + operation does not work as expected.
For instance, <http://ext-web-apps.library.mun.ca> (a) only has 4 link to 3
sites.
<http://ext-web-apps.library.mun.ca/mrc_psc/> (still in a)
<http://code.google.com/p/capline-opac/> (b)
<http://capelin.library.mun.ca/> (c)
<http://weblogs.library.mun.ca/blogs/> (d)
for command
httrack <http://ext-web-apps.library.mun.ca/> -O "/tmp/q" -Q "-*"
"+*library.mun.ca/*"
It first crawled (a) and find (b) (c) (d). Site (b) is not under
*library.mun.ca that is out, then put 3 sites into new.lst which is fine.
Next a few run (c) will discover
<http://www.library.mun.ca> (e) which is under *library.mun.ca/* as well,
problem is that it will be fetched too. Then (e) may discover
xyz.library.mun.ca (f) (g) ... which is almost all subdomain of
*library.mun.ca/*. This is not expected.
I believe the problem is that all url links are in new.lst, which does not
differentiate where the links come from. It may be better to put the url from
source (a) site into new.lst file, and other sites (c) (d) into another file
(e.g. support.lst). For all url in support.lst only retrieval 1 level and
done. In this way, it only focus on new.lst, site (a), which is what we are
interested.
Hope this would be fixed in the next release. | |