HTTrack Website Copier
Free software offline browser - FORUM
Subject: Problems crawling mediawiki
Author: bob
Date: 02/01/2008 00:28

first congratulations for the nice tool, I didn't see another one that handles
cgi generated content this well.

I am trying to crawl a mediawiki (using the current windows version 3.42). The
problem is, that there are user policies implemented, that require a login via
a POST form to see the mentioned pages.

The ways I tried to do this were:

1) log in with Perl, save the cookie in a Netscape cookie file. Try the
download (mirroring the "You need to log in to see this" pages). Copy the perl
generated cookie.txt over the one in the project directory and crawl again. No
success. btw the login works correctly since I can browse the pages with perl

2) using the --catch-url. This seems to work (as long as there still is a non
https version) but only gets the "successful log in" page. If I try to mirror
from there I only get the You need to log in to see this" pages again. I added
-*logout* so the logout link is not crawled. Still no success.

What is the right way to tackle this problem? Any suggestions?
PS: Yes, I do ignore the robots.txt (with permission)

All articles

Subject Author Date
Problems crawling mediawiki

02/01/2008 00:28
Re: Problems crawling mediawiki

11/07/2010 21:08
Re: Problems crawling mediawiki

09/05/2012 18:54


Created with FORUM 2.0.11