Re: httrack 'escaping' from expected crawling area,XML

Subject: Re: httrack 'escaping' from expected crawling area,XML

Author: Xavier Roche

Date: 12/29/2002 19:27

> It looks like the Error 404 is parsed at the 07:26:43 
> 26127/26127 event where the 
> wholesaleproducts.com/index.html is crawled.  I don't 
think 
> that is supposed to happen.

Okay problem detected and fixed. The problem is the "near" 
option, in a particular case:
- a non-html file is detected (here, a txt file)
- the "near" hack forced the download of the file
- this file redirects to a regular html file (here, a fake 
404 error page)

The problem is that the "near" hack triggers the download 
of the txt file, with default depth (9999). For regular non-
html file, this is not a problem. But in case of redirect, 
the "child" herits a depth value of N-1 (9998), and this is 
causing a massive download.

I have now setup the "near" option so that it uses a depth 
of 1 (download and parse, but don't download anything)

> Does that include removing the tabs and spaces too?
No - control chars, that is, < 32, excluding the space 
character :)

Okay, I'll try to release a beta-6 soon on httrack.com (in 
few minutes) which includes the "near" fix.

Create subthread

All articles

Subject	Author	Date
httrack 'escaping' from expected crawling area,XML		12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML		01/05/2003 08:45