| On 23/11/2007 in the message <http://forum.httrack.com/readmsg/17201/index.html>
Donner Kebab wanted help with getting a message board download working. I was
browsing and clicked on that topic by chance and noticed that today he said
the problem is still unsolved. Since messages added under an old topic don't
seem to bump the topic and don't even show up unless you click on the topic,
I'll start a new one with my solution. I was able to sucessfully download the
test php messageboard URL he gave.
When I have mirrored a messageboards with httrack, The #1 key to success is
getting the login working. Specifically, getting httrack either the login
credential cookie (if the board remembers with a persistent cookie that you
are logged in) or the session cookie (if you need to log in every time).
If the login is persistent, it's easy: copy your Firefox cookies.txt to the
appropriate place where httrack picks it up. (If you don't have Firefox, get
it to at least make things like this easier).
If the login is per-session, you need to manually login so the board sets the
session cookie. Unfortunately that session cookie is never saved to
cookies.txt unless you have a Firefox addon with which you can manipulate
cookies. I use an addon named "View Cookies CS" which allows you to make any
cookie persistent, adding a year to its expiration so even a session cookie
can be saved to cookies.txt.
In the case of Donner Kebab's problem, the test messageboard set a session
cookie. I logged in, used the addon to make the session cookie stay valid for
a year, and exited Firefox and copied cookies.txt to where httrack would make
its request using it.
The #2 problem with downloading messageboards: not excluding links which do
things you don't want.
Like logout! You MUST look for a logout link, examine its URL, and exclude the
part of the logout URL that makes it unique. Or as you spider the site you
will log yourself out.
For messageboards, there are a lot of other links you need to exclude, because
they serve no purpose if you are mirroring a site. Examples: reply, edit,
delete, mark read, follow topic, new topic, print view, search, control
center, perferences, report. You get the idea. You need to examine these links
to see what makes the URL unique and then exclude it. Or you will get a lot of
junk you don't need, bloating the download, or even worse you will change the
messageboard just by spidering it (e.g. delete message). (If you are going to
periodicallly download the board you'll want to get rid of other stuff which
causes duplicate downloads, such as next/previous, but that's a trickier to do
and beyond the scope of this post).
Key debugging tip: I use Winhttrack and always turn on the request/response
log. "Set Options->Experts Only->Activate Debugging mode (winhttrack.log)"
(Oddly, the resulting file is not winhttrack.log, it's hts-ioinfo.txt). You
can examine every request to and response from the server, to verify things
like cookies.
I have downloaded messageboards and blogs (e.g. at blogger.com) with httrack
and my #1 piece of advice is: examine carefully the site, including
javascript-created URLs, to figure out what links you need to exclude. Because
if you don't you usually get multiple copies of the same thing, or screw up
the download completely.
Hope this helps :) | |