HTTrack Website Copier
Free software offline browser - FORUM
Subject: Avoid garbage content w/o depth or timeout limit
Author: jorchube
Date: 10/25/2012 17:42
 
Hello everyone,

I'm trying to download certain webpages that are generating big amounts of
garbage files, for example, tens of .html files in a directory that are
exactly equal (and lacking any real content) except for the filename and some
random number part of one external url inside. Or files from a big catalog
that are nearly the same. 

As filenames they are actually different files, httrack downloads them, as
expected.
My question is: Is there any way (advanced feature, or something) to make
httrack "learn or notice" when to give up downloading files from a directory
based on the (almost) inexistet changes of files downloaded? Maybe some sort
of (near) duplicate content detection, and use it to stop downloading.

One webpage where this is happening is this one:

www.josecer.es

using this arguments(linux cli version)

-w -m10485760 -E7200 -A250000 -c4 -%P -%p -s2 -%k -%s -%u -Z -I0 -D -%H -N0
-K0 -*.exe


Thank you for your time.
jorchube
 
Reply


All articles

Subject Author Date
Avoid garbage content w/o depth or timeout limit

10/25/2012 17:42
Re: Avoid garbage content w/o depth or timeout limit

10/25/2012 18:17
Re: Avoid garbage content w/o depth or timeout limit

11/01/2012 19:52




0

Created with FORUM 2.0.11