Problems crawling mediawiki - HTTrack Website Copier Forum

Subject: Problems crawling mediawiki

Author: bob

Date: 02/01/2008 00:28

Hi,

first congratulations for the nice tool, I didn't see another one that handles
cgi generated content this well.

I am trying to crawl a mediawiki (using the current windows version 3.42). The
problem is, that there are user policies implemented, that require a login via
a POST form to see the mentioned pages.

The ways I tried to do this were:

1) log in with Perl, save the cookie in a Netscape cookie file. Try the
download (mirroring the "You need to log in to see this" pages). Copy the perl
generated cookie.txt over the one in the project directory and crawl again. No
success. btw the login works correctly since I can browse the pages with perl

2) using the --catch-url. This seems to work (as long as there still is a non
https version) but only gets the "successful log in" page. If I try to mirror
from there I only get the You need to log in to see this" pages again. I added
-*logout* so the logout link is not crawled. Still no success.

What is the right way to tackle this problem? Any suggestions?
-bob
PS: Yes, I do ignore the robots.txt (with permission)

All articles

Subject	Author	Date
Problems crawling mediawiki		02/01/2008 00:28
Re: Problems crawling mediawiki		11/07/2010 21:08
Re: Problems crawling mediawiki		09/05/2012 18:54