| >Please explain me what's the reason of
>this strange behaviour of your program.
Your not setting it up properly and just relying on a start page, is forcing
the program to ASSUME hundreds of things, and so wow, in your case, with that
specific site it didn't guess exactly what you wanted. (RTFM) The Default
assumptions are for very simple site's or very small subsections of simpler
sites. It just didn't work for you.
But here is quick rundown.
First don't include the protocol
So your start page should be just be
homepage.divms.uiowa.edu/~jones/
It should be ONLY one, adding the second made HTTrack do a whole host of other
assumptions.
Go to Set_Options -> Scan_Rules
Delete everything there (those are the Assumptions)
First add the line
-*
This tells it to reject Everything from everywhere, including Wikipedia
Then add rules to allow the content you want
+homepage.divms.uiowa.edu/~jones/*
+www.cs.uiowa.edu/~jones/*
Now it will only get files that are in those two URL ''directories'' or lower
(ie www.cs.uiowa.edu/~jones/voting/...)
You may need to add more include filters (+stuff) if you find stuff missing.
Since your only issue is the multiple Wiki links (commons, media, Wikipedia),
You could leave all as is and just add a rule/filter to block any wiki based
site
-*wiki*
Your site downloaded for me as you orgianly did it with only the added filter
-*wiki*, in about 2 hours
It also grabbed
collections.museumvictoria.com.au
patentimages.storage.googleapis.com
www.chilton-computing.org.uk
www.ricomputermuseum.org
uiowa.edu
but each of those were less then 1MB
Hope this get you on your way
| |