HTTrack Website Copier
Free software offline browser - FORUM
Subject: HTTrack seems to be spidering whole site
Author: Chris
Date: 11/27/2013 11:24
 
I am trying to save posts from a particular sub-forum on a vBulletin-based
message board. Yet despite issuing a specific start URL (the root of the
sub-forum I want to download) and limiting it to URLs containing the string of
that particular subforum in the scan rules, it seems to spider across the
entire site (potentially hundreds of thousands of pages if I let it run). I
don't want that--I ONLY want it to scan and download URLs matching the string.
Here is a semi-obfuscated example of the structure:

-Subforum thread listing: <http://www.forum.com/forum/saveme-282/> [the forum's
numerical ID is suffixed to its name]
-Subforum thread listing page 2: <http://www.forum.com/forum/saveme-2-282/>
-Thread in subforum: <http://www.forum/com/forum/saveme/nicethread.html> [the
numerical forum ID is omitted]

So I want it to start at the thread listing (/saveme-282/) and proceed to
drill down both into the threads listed on that page (/saveme/nicethread.html)
and subsequent pages (/saveme-2-282/, /saveme-3-282/, etc.) and the threads
listed on those pages (also in the format of /saveme/nicethread2.html and
/saveme/nicethread378.html).

If HTTrack were to only follow links to other pages that contain
*www.forum.com/forum/saveme*, this would have the exact affect, as all desired
pages to download can be reached by only following links that match that
string.

But instead, I see HTTrack starting to grab
www.forum.com/forum/dontsaveme-389,
www.forum.com/forum/dontsaveme/unrelatedthread.html,
www.forum.com/forum/member/unimportantperson.html, and other links completely
outside of the specified structure.

Here is the link I'm starting the project with:

<http://www.forum/com/forum/saveme-282/>

And here is the contents of my Scan Rules filter:

+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
+*www.forum.com/forum/saveme*

How can I configure HTTrack so it only follows the links I want and not every
link on the site?
Thoughts?
 
Reply


All articles

Subject Author Date
HTTrack seems to be spidering whole site

11/27/2013 11:24
Re: HTTrack seems to be spidering whole site

11/29/2013 02:59




d

Created with FORUM 2.0.11