| I'm trying to rip copies of select threads on the Rift MMO beta forums. Some of
these threads are invaluable to my research, but they won't exist online
forever. My guess is Rift will nuke the beta forums and start clean with the
game launch on 3/1. As a result, I'm trying to preserve some of the more
interesting discussions so that I can analyze them later.
As a test case, I am trying rip a copy of the following forum topic.
<http://forums.riftgame.com/showthread.php?49092-The-dumbing-down-and-over-simplification-of-todays-MMO-s....-why>
Note: I have also used <http://forums.riftgame.com/showthread.php?49092> as my
URL to scrape.
I have had the most success with the following set of scraping filters.
-*
+http://forums.riftgame.com/showthread.php?*49092*
+*.png +*.gif +*.jpg +*.css +*.js
When I run this rule set, I get a great scrape on the first 2pages of the
topic. However, by page the pagination breaks and it falls apart. By "break"
I mean that it looks like HTTtrack copies the page, but the page is no longer
formatted properly.
Properly Formatted Page (1-2)
<http://i459.photobucket.com/albums/qq313/xythian/proper_format.png>
Not Properly Formatted Page (3+)
<http://i459.photobucket.com/albums/qq313/xythian/bad_format.png>
My guess is that because the page navigation changes slightly with each page
that HTTtrack is having difficulty following the link structure to the end.
However, this is standard for any multi-page thread so I'm a bit lost.
Any ideas on how to compensate for this issue? Or might it be something else
entirely?
Here's the HTS-log of the most successful scrape yet.
HTTrack3.43-12+htsswf+htsjava launched on Fri, 04 Feb 2011 01:25:44 at
<http://forums.riftgame.com/showthread.php?49092> -*
+http://forums.riftgame.com/showthread.php?*49092* +*.png +*.gif +*.jpg +*.css
+*.js
(winhttrack -qwC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5
(compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by
HTTrack Website Copier/3.x [XR&CO'2010], %s -->" -%l "en, en, *"
<http://forums.riftgame.com/showthread.php?49092> -O1 "C:\My Web Sites\Rift
Forums Single Topic Test 4" -*
+http://forums.riftgame.com/showthread.php?*49092* +*.png +*.gif +*.jpg +*.css
+*.js )
I also get a number of image/button related errors.
E.g.
01:26:19 Info: engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/_custom/buttons/collapse_40b.png (C:/My Web
Sites/Rift Forums Single Topic Test
4/forums.riftgame.com/images/_custom/buttons/collapse_40b.png)
01:26:19 Info: engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/misc/quote-left.png (C:/My Web Sites/Rift Forums
Single Topic Test 4/forums.riftgame.com/images/misc/quote-left.png)
01:26:20 Info: engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/editor/separator.gif (C:/My Web Sites/Rift Forums
Single Topic Test 4/forums.riftgame.com/images/editor/separator.gif)
01:26:21 Info: engine: warning: entry cleaned up, but no trace on heap:
forums.riftgame.com/images/_custom/buttons/thread-bg.gif (C:/My Web Sites/Rift
Forums Single Topic Test
4/forums.riftgame.com/images/_custom/buttons/thread-bg.gif)
| |