HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Unnecessary Wikimedia files copied with the site
Author: Matt
Date: 07/23/2017 21:30
 
>Please explain me what's the reason of
>this strange behaviour of your program.

Your not setting it up properly and just relying on a start page, is forcing
the program to ASSUME hundreds of things, and so wow, in your case, with that
specific site it didn't guess exactly what you wanted. (RTFM) The Default
assumptions are for very simple site's or very small subsections of simpler
sites. It just didn't work for you.

But here is quick rundown.

First don't include the protocol
So your start page should be just be
  homepage.divms.uiowa.edu/~jones/

It should be ONLY one, adding the second made HTTrack do a whole host of other
assumptions.

Go to Set_Options -> Scan_Rules  
Delete everything there (those are the Assumptions)

First add the line


-*



This tells it to reject Everything from everywhere, including Wikipedia

Then add rules to allow the content you want


+homepage.divms.uiowa.edu/~jones/*
+www.cs.uiowa.edu/~jones/*



Now it will only get files that are in those two URL ''directories'' or lower
(ie www.cs.uiowa.edu/~jones/voting/...)

You may need to add more include filters (+stuff) if you find stuff missing.



Since your only issue is the multiple Wiki links (commons, media, Wikipedia), 
You could leave all as is and just add a rule/filter to block any wiki based
site


-*wiki*


Your site downloaded for me as you orgianly did it with only the added filter
-*wiki*, in about 2 hours

It also grabbed
collections.museumvictoria.com.au
patentimages.storage.googleapis.com
www.chilton-computing.org.uk
www.ricomputermuseum.org
uiowa.edu

but each of those were less then 1MB 

Hope this get you on your way


 
Reply Create subthread


All articles

Subject Author Date
Unnecessary Wikimedia files copied with the site 07/22/2017 23:31
Re: Unnecessary Wikimedia files copied with the site 07/23/2017 21:30




e

Created with FORUM 2.0.11