| Hello everyone,
I'm trying to download certain webpages that are generating big amounts of
garbage files, for example, tens of .html files in a directory that are
exactly equal (and lacking any real content) except for the filename and some
random number part of one external url inside. Or files from a big catalog
that are nearly the same.
As filenames they are actually different files, httrack downloads them, as
expected.
My question is: Is there any way (advanced feature, or something) to make
httrack "learn or notice" when to give up downloading files from a directory
based on the (almost) inexistet changes of files downloaded? Maybe some sort
of (near) duplicate content detection, and use it to stop downloading.
One webpage where this is happening is this one:
www.josecer.es
using this arguments(linux cli version)
-w -m10485760 -E7200 -A250000 -c4 -%P -%p -s2 -%k -%s -%u -Z -I0 -D -%H -N0
-K0 -*.exe
Thank you for your time.
jorchube | |