| Hi,
I've had some success with httrack, many thanks <applauds>
But I first got bitten by the default behaviour...
From the log:
HTTrack3.44-5+libhtsjava.so.2
httrack -C1 -%v -f2 -R5 <http://manga.animea.net/kekkaishi.html> -O httrack -*
+http://manga.animea.net/kekkaishi-chapter*.html*[] +*jpg +*png +*gif
This downloads ~6500 8k html files and the same number of images, 50150k
each.
(Well, for some reason a number of those files were missing after the first
run, and I reissued 12 plain httrack command from the download folder to
get everything btw I'd like to know if it's different to do httrack
--update)
I quickly discovered that several hundreds of image files were corrupted. Some
could not be opened at all, and this was quite easy to fix, using a tool (I
used feh) to remove unloadable images, then relaunching httrack; but the
majority were normal loadable images, except only the top of them was ok, and
it's a pain to check that manually, delete, relaunch and recheck.
Fortunately it not do that as it occurred to me that I never had this problem
while normally browsing the site, and that perhaps the site did not behave
correctly when a request takes time to fulfill: usually one gets to view the
image in 12 seconds, while httrack typically downloads it at under 8 kb/s
(since there are 23 other active connections), thus needs 520 seconds to
get it, depending on the size.
While browsing the site it does happen that a transfer hangs after an image
has only been partially printed (in which case it's almost always quicker to
just refresh the page), so perhaps the problem arises when partial transfers
are resumed by httrack.
Anyway, I thought I'd try to get httrack to mimic normal browsing more: not
limit on transfer rate, but just 1 transfer slot and just 1 / second.
httrack -C1 -%! -A1000000 -%c1 -c1 -%v -R5
<http://manga.animea.net/kekkaishi.html> -O httrack -*
+http://manga.animea.net/kekkaishi-chapter*.html*[] +*jpg +*png +*gif
(btw it's a pity that the log doesn't provide with the long names also. My
command line was a lot clearer, I have a file where I save this kind of
things:
httrack --cache=1 --disable-security-limits --max-rate=1000000
--connection-per-second=1 --sockets=1 --display --retries=5
)
Guess what? On the first try, NO corrupted image at all (well, except for a
few that were already corrupted on the server).
(Also for some reason I did get all the files on the first run)
But still, httrack's behaviour is not exactly what I'd wish: transfer rate
slows down to 8kb/s when downloading 8k files, and skyrockets to more than
100kb/s when downloading big files in a row. Average was 50kb/s.
In retrospect I'm surprised at the default behaviour of httrack, since sites
are designed to be browsed, and there's no end-user limit set on any given
transfer while normal browsing.
So why impose limits once a transfer has started?Of course DOS attacks must be
avoided at all costs. But why not say something like start a new transfer if
average transfer rate in the last 10s was less than some reasonable limit, and
this limit is not crossed even if all the opened slots finish in the next
second?
Well perhaps I just stumbled on the rare site that misbehaves with default
httrack...
Just my 2 cents. Sorry I was too long
| |