HTTrack Website Copier
Free software offline browser - FORUM
Subject: Bug? Escaping spider.
Author: Biep
Date: 08/02/2010 20:56
 
One reason why I regularly interrupt my downloads is that they tend to escape
and download parts of the web I am not interested in.  In such cases I need to
add additional filters to prevent HtTrack from mirrorring the whole web.

Here is an example of what I mean (first line of doit.log):

-qw%e1C2%Pns0%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 (compatible;
HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack Website
Copier/3.x [XR&CO'2010], %s -->" -%l "nl, en, *"
<http://auto.howstuffworks.com/stirling-engine.htm> -O1 "E:\\I\\Escape"
-auto.howstuffworks.com/* +auto.howstuffworks.com/stirling-engine*

This is supposed to grab the 5 files about stirling engines, plus all the HTML
it points to (external depth=1), plus all non-HTML any of these pages points
to (.  But in reality it mirrors all of *.howstuffworrks (where * ~= auto).

According to new.txt all the hundreds links it downloads are "(from
<http://auto.howstuffworks.com/stirling-engine.htm)">;, but it is easy to check
most aren't.  (WinHtTrack also shows that most are captured while scanning
other pages than the start page.)
 
Reply


All articles

Subject Author Date
Bug? Escaping spider.

08/02/2010 20:56
Re: Bug? Escaping spider.

08/03/2010 22:53
Re: Bug? Escaping spider.

08/04/2010 18:53
Re: Bug? Escaping spider.

08/07/2010 02:58
Re: Bug? Escaping spider.

08/07/2010 18:46
Re: Bug? Escaping spider.

08/10/2010 16:21
Re: Bug? Escaping spider.

08/11/2010 14:38
Re: Bug? Escaping spider.

08/13/2010 20:19
Re: Bug! Escaping spider.

08/13/2010 22:05
Re: Bug! Escaping spider.

08/14/2010 15:42
Re: Bug! Squash it before it reproduces!

08/14/2010 20:16
Re: Bug? Escaping spider.

10/09/2010 16:54
Re: Bug! Escaping spider.

03/09/2011 18:02
Re: Bug! Escaping spider.

03/15/2011 18:10




d

Created with FORUM 2.0.11