HTTrack Website Copier
Free software offline browser - FORUM
Subject: Possible wrong logic in filter
Author: Jing
Date: 04/25/2012 11:42
 
I am doing a new thread so it is easier to read.

The + operation does not work as expected.

For instance, <http://ext-web-apps.library.mun.ca> (a) only has 4 link to 3
sites.

<http://ext-web-apps.library.mun.ca/mrc_psc/> (still in a)
<http://code.google.com/p/capline-opac/> (b)
<http://capelin.library.mun.ca/> (c)
<http://weblogs.library.mun.ca/blogs/> (d)


for command

httrack <http://ext-web-apps.library.mun.ca/> -O "/tmp/q"  -Q "-*"
"+*library.mun.ca/*"

It first crawled (a) and find (b) (c) (d). Site (b) is not under
*library.mun.ca that is out, then put 3 sites into new.lst which is fine.

Next a few run (c) will discover 
<http://www.library.mun.ca> (e) which is under *library.mun.ca/* as well,
problem is that it will be fetched too. Then (e) may discover
xyz.library.mun.ca (f) (g) ... which is almost all subdomain of
*library.mun.ca/*. This is not expected.

I believe the problem is that all url links are in new.lst, which does not
differentiate where the links come from. It may be better to put the url from
source (a) site into new.lst file, and other sites (c) (d) into another file
(e.g. support.lst). For all url in support.lst only retrieval 1 level and
done. In this way, it only focus on new.lst, site (a), which is what we are
interested.

Hope this would be fixed in the next release.
 
Reply


All articles

Subject Author Date
Possible wrong logic in filter

04/25/2012 11:42
Re: Possible wrong logic in filter

04/25/2012 11:55
Re: Possible wrong logic in filter

04/25/2012 12:47
Re: Possible wrong logic in filter

04/25/2012 16:00
Re: Possible wrong logic in filter

04/25/2012 17:08
Re: Possible wrong logic in filter

04/25/2012 19:03
Re: Possible wrong logic in filter

05/02/2012 17:31




b

Created with FORUM 2.0.11