HTTrack Website Copier
Free software offline browser - FORUM
Subject: Cache checks at remote-web-throttling speeds?
Author: Naisse
Date: 02/22/2021 02:51
 
PROBLEM: Httrack is re-checking existence of files on my hard drive with the
speed it's meant to use to not overload the remote server - so it's has spent
the last 9 HOURS on "engine: warning: temporary file
XXXXXXthreads/461564.html.tmp already exists" and is at about line 9938 of my
260000-line input link file. It's going to take forever. Is there any way of
speeding it up? 

So far I can only think of disabling speed limits, but there's no way to
reinstate them midway through, when I actually start interacting with remote
site!


COMMAND: The command line looks something like

httrack --list LINKLIST.txt -O /home/$USER/archive/ --near -i --retries=5 -C1
-s0 -f -D -a -r 15 -R3 -H0 -X0 -c12 -%c14 --max-pause=5200000000 -F
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101
Firefox/67.0" -p7 -%H -#L2000000 -*archive.nanowrimo.org/sign_out*
-*archive.nanowrimo.org*deletion_request* +*cloudfront.net* +mime:text/html
-*region*affiliation -*nanomail/new*
-*archive.nanowrimo.org*forums*mark_as_read -*forum*comment*reply*
-*thread*watch -*forum*thread*flag -*forum*comment*flag
-*participants*buddyship -*forums/*threads/new -*forums*threads*unread_comment
-*archive.nanowrimo.org/sign_out* -*archive.nanowrimo.org*deletion_request*

The -c12 -%c14 are getting ignored, so I guess I need to look into
--disable-security-limits if I'm to use my only "solution"... Is there a
better way?

MORE INFO: I'm trying to mirror a forum that's going to be taken down soon. 

First run I did, Httrack had gotten a lot of near-links (avatars, signatures,
etc), but didn't get around to all the threads (which are, after all, the most
important parts) and gone down a rabbit hole and I think I didn't let it
finish up the transfers properly, so it's all a bit of a mess. 

So, I've used wget and a lot of parsing to get a list of all forum boards and
threads, thinking that such a list would get httrack to _first_ get the
important bits and then, if time permits, get all the extras like images
(which are mostly hosted elsewhere, so won't be affected).

Result: a list of links that's about 260000 lines long. Which is fine, if a
bit daunting-looking, but I've had to stop httrack (since I'm a newb parser
and had to cut down a few duplicates from the file) and restart it... and now
the problem is beyond mere looks :D

Finding out about -C1 to make sure httrack trusts the cache instead of trying
to go by IF:MODIFIED (which is a liar in this case) helps to not duplicate
every page, but doesn't help the speed. 



Btw, it's frightfully hard to come across
<http://www.httrack.com/html/fcguide.html>... and even that doesn't give any
information on hts-paused.lock and the use of hts-stop.lock to pause the
download... Any chance of getting that info onto the FAQ or something? And
actually _linking_ "The tutorial written by Fred Cohen" to the FAQ page? ;)
 
Reply


All articles

Subject Author Date
Cache checks at remote-web-throttling speeds?

02/22/2021 02:51
Re: Cache checks at remote-web-throttling speeds?

02/23/2021 12:36




1

Created with FORUM 2.0.11