HTTrack Website Copier
Free software offline browser - FORUM
Subject: Resolving all links on secondary internal pages
Author: Optical
Date: 04/27/2018 08:54
 
Hello!
First of all, I apologize if this has been asked but I have searched through
and through for terms to try and help me.

To the matter at fact, I am trying to offline store, because of a closure, a
classicVBulletin type forum, that is not javascript heavy. It's staggeringly
around 1 million threads, or 15 million posts.

I'm using the HTTrack UI, with disable sec limits, with the following settings
to copy:
Download all sites -> No proxy, Scan rules: media files and +site_link/*, no
limits, 20 connections / 3 retries, attempt to detect links, get non html
files, default build, default spider, no MIME, default browser id, default
log, default expert settings.

My problem stems from the fact that if my site_link ( which is the root ) is
saved with every link it detects on it, and then secondary links that are down
the branch, for example a thread that is on a secondary page is only "saved"
with the default "pages" available to it, example page 1,2,3,4 of the thread,
then ...., then n-3, n-2,n-1,n. All the pages between the first few and the
last do not get saved. It's as if it does not try to save page 5 if it is on
page 2, like recursively checking <a> links on secondary, tertiary, etc.

My questions are:
1. Is there any way I can make the the tool to save and discover every link
deep down the tree? I know I will get more than a million threads/files.
2. Should I use the CLI? I don't think it's a resource problem as the machine
it's running on is potent (highend pc + gigabit)
3. Am I doing something wrong to assume HTTrack can discover links on
secondary, tertiary, etc. pages and recursively cover everything?
Thank you! This tool will allow us to save a 20 year old forum that is being
closed without archiving.
 
Reply


All articles

Subject Author Date
Resolving all links on secondary internal pages

04/27/2018 08:54
Re: Resolving all links on secondary internal pages

05/01/2018 20:41




8

Created with FORUM 2.0.11