Re: Stop httrack from downloading the whole internet

Subject: Re: Stop httrack from downloading the whole internet

Author: Gabriele

Date: 01/10/2020 23:57

In case anyone ever sees this, downloading wiki pages requires a *lot* of
filtering in removing all the special pages, for example I had these ones:
-*&action=*
-*?action=*
-*?title=Special:*
-*&title=Special:*
-*&diff=*
-*?diff=*
-*&oldid=*
-*?oldid=*
-*&limit=*
-*?limit=*
-*&printable=yes*
-*/Special:*
-*/User_talk:*

for a 2014 mirror of wiki.gentoo.org (I probably have something more recent,
but I'm not going to search).

But most of all especially on Wikipedia there are (or there used to be)
horrible traps in that e.g. urls that look like images (which end e.g. in
.png) are actually html pages filled with links that lead to the whole
encyclopedia, and there were similar problem with the style sheets (css files)
if I'm not mistaken.

So you have to ABSOLUTELY remove the usual+* filters (e.g.
.png +*.gif +*.jpg +*.css +*.js).

Even after that, in most cases it takes (or took, this project is dead) a lot
of tweaks and iterations, and being constantly on the lookout that httrack
hasn't started downloading all the knowledge of the universe.

Create subthread

All articles

Subject	Author	Date
Re: Stop httrack from downloading the whole internet		11/18/2018 03:19
Re: Stop httrack from downloading the whole internet		01/10/2020 23:57