HTTrack Website Copier
Free software offline browser - FORUM
Subject: Big difficulties on a big site!
Author: Having Problems!
Date: 01/25/2012 17:06
 
I'm trying to archive a giant website (20 million pages) with massively
difficulties.
 
All I'm after is the .html.
 
When attempting this with Httrack, I split the urls from the website that I'd
generated with a script into 200 .txt urls each containing 100,000 urls which
I've been feeding into Httrack.
 
The only problem is speed, Httrack is just too slow running one thread and on
large lists it refuses to run more than one even if you specify a large amount
and use the unrestricted options.

Also using this method you'd be unable to specify more than one proxy, so I've
tried running multiple instances of the program, the only problem being it
uses up a lot of memory cpu and ultimately crashes the computer (even on a new
Alienware I've just bought!)

I've attempted to use wget without success as it's just way too slow.

Flashget, which is slightly faster and can use multiple proxies for multiple
connections which is a nice touch worked okay, the only being that a batch
download of any more than 5k files and the program crashes.

I was considering trying getright too with the following program to generate
the links (hopefully preventing crashes) 
<http://www.bluechillies.com/download/5180.html> but unfortunately it's 10 years
old and has disappeared from the internet completely.

This has become a real annoyance and i'm not sure where to go with it :(
 
Reply


All articles

Subject Author Date
Big difficulties on a big site!

01/25/2012 17:06
Re: Big difficulties on a big site!

01/25/2012 21:47
Re: Big difficulties on a big site!

01/25/2012 23:59
Re: Big difficulties on a big site!

01/26/2012 00:43
Re: Big difficulties on a big site!

01/26/2012 00:52
Re: Big difficulties on a big site!

01/26/2012 13:44




9

Created with FORUM 2.0.11