HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Stop httrack from downloading the whole internet
Author: Gabriele
Date: 01/10/2020 23:57
 
In case anyone ever sees this, downloading wiki pages requires a *lot* of
filtering in removing all the special pages, for example I had these ones:
-*&action=*
-*?action=*
-*?title=Special:*
-*&title=Special:*
-*&diff=*
-*?diff=*
-*&oldid=*
-*?oldid=*
-*&limit=*
-*?limit=*
-*&printable=yes*
-*/Special:*
-*/User_talk:*

for a 2014 mirror of wiki.gentoo.org (I probably have something more recent,
but I'm not going to search).

But most of all especially on Wikipedia there are (or there used to be)
horrible traps in that e.g. urls that look like images (which end e.g. in
.png) are actually html pages filled with links that lead to the whole
encyclopedia, and there were similar problem with the style sheets (css files)
if I'm not mistaken.

So you have to ABSOLUTELY remove the usual+* filters (e.g.
.png +*.gif +*.jpg +*.css +*.js).

Even after that, in most cases it takes (or took, this project is dead) a lot
of tweaks and iterations, and being constantly on the lookout that httrack
hasn't started downloading all the knowledge of the universe.
 
Reply Create subthread


All articles

Subject Author Date
Re: Stop httrack from downloading the whole internet

11/18/2018 03:19
Re: Stop httrack from downloading the whole internet

01/10/2020 23:57




4

Created with FORUM 2.0.11