| One reason why I regularly interrupt my downloads is that they tend to escape
and download parts of the web I am not interested in. In such cases I need to
add additional filters to prevent HtTrack from mirrorring the whole web.
Here is an example of what I mean (first line of doit.log):
-qw%e1C2%Pns0%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 (compatible;
HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack Website
Copier/3.x [XR&CO'2010], %s -->" -%l "nl, en, *"
<http://auto.howstuffworks.com/stirling-engine.htm> -O1 "E:\\I\\Escape"
-auto.howstuffworks.com/* +auto.howstuffworks.com/stirling-engine*
This is supposed to grab the 5 files about stirling engines, plus all the HTML
it points to (external depth=1), plus all non-HTML any of these pages points
to (. But in reality it mirrors all of *.howstuffworrks (where * ~= auto).
According to new.txt all the hundreds links it downloads are "(from
<http://auto.howstuffworks.com/stirling-engine.htm)">;, but it is easy to check
most aren't. (WinHtTrack also shows that most are captured while scanning
other pages than the start page.) | |