Avoid garbage content w/o depth or timeout limit

Subject: Avoid garbage content w/o depth or timeout limit

Author: jorchube

Date: 10/25/2012 17:42

Hello everyone,

I'm trying to download certain webpages that are generating big amounts of
garbage files, for example, tens of .html files in a directory that are
exactly equal (and lacking any real content) except for the filename and some
random number part of one external url inside. Or files from a big catalog
that are nearly the same. 

As filenames they are actually different files, httrack downloads them, as
expected.
My question is: Is there any way (advanced feature, or something) to make
httrack "learn or notice" when to give up downloading files from a directory
based on the (almost) inexistet changes of files downloaded? Maybe some sort
of (near) duplicate content detection, and use it to stop downloading.

One webpage where this is happening is this one:

www.josecer.es

using this arguments(linux cli version)

-w -m10485760 -E7200 -A250000 -c4 -%P -%p -s2 -%k -%s -%u -Z -I0 -D -%H -N0
-K0 -*.exe


Thank you for your time.
jorchube

All articles

Subject	Author	Date
Avoid garbage content w/o depth or timeout limit		10/25/2012 17:42
Re: Avoid garbage content w/o depth or timeout limit		10/25/2012 18:17
Re: Avoid garbage content w/o depth or timeout limit		11/01/2012 19:52