HTTrack Website Copier
Free software offline browser - FORUM
Subject: Can't mirror websites that use absolute links
Author: Elliander
Date: 03/17/2023 21:12
 
First I'd like to say that I found this program to be very interesting and I
especially like options to limit things to avoid causing problems for servers.
That said...

I'm trying to mirror a few old web comic websites so that I can read on a
slower device offline, but found it impossible to do.

So, for example, if I try mirroring the following website:

<https://www.nuklearpower.com/8-bit-theater/>

Without using any custom settings it creates a mirror that is less than 3 MB
in size and only included a single page. This seems to be because each and
every link uses "absolute coordinates" (a full URL) rather than "relative
coordinates" that most websites tend to use.

So, I went into the options, and made the following changes:

First, I set the limit on active connections to 3. I also set "Max connections
/ seconds" to 3 to further prevent server strain. Finally I set "Maximum
external depth" to 3. I figured that would work since, through the arvhice
page, every page SHOULD be accessible within 3 links.

Unfortunately, after downloading just 3 pages it set to work downloading the
first few pages of other comics, with the last page of each having a link
instead to the original website. It then started downloading a few pages of
every random affiliate website and so on it links to.

It's been running for 15 hours and 36 minutes so far, with 5.26 GiB downloaded
(which is a good speed to respect the domains in question), but I barely got
anything from the website I want. Looking in the foder it downloaded from 707
websites already including comics I never heard of and random advertisers. A
real problem is that since it linked to google somewhere, and since google
links to pretty much everywhere, the entire internet would be downloaded if I
let it finish. 

Meanwhile I tried the same thing with a few other sites with similar results.
I ultimately had to just stop downloading since I wasn't getting anywhere. I'd
rather not have to download the entire internet just to get a single domain.

So, I have to ask, is there any way around this? I see, for example, that
under "Scan Rules" I can click "Include Link(s)" and set the entire domain
like so:

+*[name].www.nuklearpower.com/*

However, there doesn't seem to be a way to exclude anything not on a single
domain. Regardless, I tried using that and then set "external connections"
back to blank and it just completed within seconds just like it did without
any options set.

So how exactly to a mirror an entire website that treats every link as an
external link, but to restrict the mirror to only the files located within a
single domain and/or subdomain? To download just one domain and not the entire
internet? 

If there is no way to do this could you add an "exclude" rule to exclude all
domains that are not in the domain include list?
Thanks!

 
Reply


All articles

Subject Author Date
Can't mirror websites that use absolute links

03/17/2023 21:12




6

Created with FORUM 2.0.11