HTTrack Website Copier
Free software offline browser - FORUM
Subject: vBulletin Topic Copies
Author: Tyler
Date: 02/04/2011 15:35
 
I'm trying to rip copies of select threads on the Rift MMO beta forums. Some of
these threads are invaluable to my research, but they won't exist online
forever. My guess is Rift will nuke the beta forums and start clean with the
game launch on 3/1. As a result, I'm trying to preserve some of the more
interesting discussions so that I can analyze them later.

As a test case, I am trying rip a copy of the following forum topic. 

<http://forums.riftgame.com/showthread.php?49092-The-dumbing-down-and-over-simplification-of-todays-MMO-s....-why>

Note: I have also used <http://forums.riftgame.com/showthread.php?49092> as my
URL to scrape.

I have had the most success with the following set of scraping filters. 

-*
+http://forums.riftgame.com/showthread.php?*49092*
+*.png +*.gif +*.jpg +*.css +*.js 

When I run this rule set, I get a great scrape on the first 2pages of the
topic. However, by page  the pagination breaks and it falls apart. By "break"
I mean that it looks like HTTtrack copies the page, but the page is no longer
formatted properly. 

Properly Formatted Page (1-2)
<http://i459.photobucket.com/albums/qq313/xythian/proper_format.png>

Not Properly Formatted Page (3+)
<http://i459.photobucket.com/albums/qq313/xythian/bad_format.png>

My guess is that because the page navigation changes slightly with each page
that HTTtrack is having difficulty following the link structure to the end.
However, this is standard for any multi-page thread so I'm a bit lost.

Any ideas on how to compensate for this issue? Or might it be something else
entirely?
Here's the HTS-log of the most successful scrape yet.

HTTrack3.43-12+htsswf+htsjava launched on Fri, 04 Feb 2011 01:25:44 at
<http://forums.riftgame.com/showthread.php?49092> -*
+http://forums.riftgame.com/showthread.php?*49092* +*.png +*.gif +*.jpg +*.css
+*.js

(winhttrack -qwC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5
(compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by
HTTrack Website Copier/3.x [XR&CO'2010], %s -->" -%l "en, en, *"
<http://forums.riftgame.com/showthread.php?49092> -O1 "C:\My Web Sites\Rift
Forums Single Topic Test 4" -*
+http://forums.riftgame.com/showthread.php?*49092* +*.png +*.gif +*.jpg +*.css
+*.js )

I also get a number of image/button related errors.

E.g.

01:26:19	Info: 	engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/_custom/buttons/collapse_40b.png (C:/My Web
Sites/Rift Forums Single Topic Test
4/forums.riftgame.com/images/_custom/buttons/collapse_40b.png)

01:26:19	Info: 	engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/misc/quote-left.png (C:/My Web Sites/Rift Forums
Single Topic Test 4/forums.riftgame.com/images/misc/quote-left.png)

01:26:20	Info: 	engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/editor/separator.gif (C:/My Web Sites/Rift Forums
Single Topic Test 4/forums.riftgame.com/images/editor/separator.gif)

01:26:21	Info: 	engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/_custom/buttons/thread-bg.gif (C:/My Web Sites/Rift
Forums Single Topic Test
4/forums.riftgame.com/images/_custom/buttons/thread-bg.gif)
 
Reply


All articles

Subject Author Date
vBulletin Topic Copies

02/04/2011 15:35
Re: vBulletin Topic Copies

02/05/2011 01:05
Re: vBulletin Topic Copies

02/05/2011 06:43




0

Created with FORUM 2.0.11