HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: httrack 'escaping' from expected crawling area,XML
Author: Haudy Kazemi
Date: 12/31/2002 01:34
 
> Okay problem detected and fixed. The problem is 
the 'near' 
> option, in a particular case:
> - a non-html file is detected (here, a txt file)
> - the 'near' hack forced the download of the file
> - this file redirects to a regular html file (here, a 
fake 
> 404 error page)

<snip>

> I have now setup the 'near' option so that it uses a 
depth 
> of 1 (download and parse, but don't download anything)
> 
> > Does that include removing the tabs and spaces too?> 
> No - control chars, that is, < 32, excluding the space 
> character :)
> 
> Okay, I'll try to release a beta-6 soon on httrack.com 
(in 
> few minutes) which includes the 'near' fix.

Hello,

I just "continued" the mirroring of this project, and it 
completed successfully (AFAIK) except for one 
endless/infinite loop on an image (actually a 404 error) 
URL:
www.licensing.philips.com/partner/data/images/sm-duck.gif

This URL is parsed by Httrack and becomes:
www.licensing.philips.com/partner/data/images/images/sm-
duck.gif
www.licensing.philips.com/partner/data/images/images/images/
sm-duck.gif
etc.

The first two lines related to this URL from new.txt are:

16:58:27	19952/19952	-R-MC-	200	added ('OK')
	text/html	date:Mon,%2030%20Dec%202002%
2022:52:32%20GMT
	www.licensing.philips.com/partner/data/images/sm-
duck.gif	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.licensing.philips.com/partner/data/images/sm-
duck.gif	(from 
www.licensing.philips.com/partner/data/sl00811.pdf)

17:09:36	19973/19973	-R-MC-	200	added ('OK')
	text/html	date:Mon,%2030%20Dec%202002%
2023:03:40%20GMT
	www.licensing.philips.com/partner/data/images/images
/sm-duck.gif	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.licensing.philips.com/partner/data/images/ima
ges/sm-duck.gif	(from 
www.licensing.philips.com/partner/data/images/sm-duck.gif)

There were other files downloaded from this site that did 
not cause an infinite loop, even though it also is 
displayed as the same 404 error as the image URL:
www.licensing.philips.com/partner/data/sl00811.pdf

I was able to stop the infinite loop by hitting 'skip' once 
I noticed what was happening.  Is the problem with 
httrack's handling of the page or with a 'broken' server 
that gives false responses?  It surprises me that the .pdf 
link did not go infinite while the .gif did go infinite.

Anyway, if it helps, I've posted the new.txt, 
winprofile.ini, and broken pdf and gif to:
<http://kazemizadeh.net/httrack/epanorama.com/>

-Haudy Kazemi
 
Reply Create subthread


All articles

Subject Author Date
httrack 'escaping' from expected crawling area,XML

12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML

12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML

12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML

12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML

01/05/2003 08:45




7

Created with FORUM 2.0.11