| > It looks like the Error 404 is parsed at the 07:26:43
> 26127/26127 event where the
> wholesaleproducts.com/index.html is crawled. I don't
think
> that is supposed to happen.
Okay problem detected and fixed. The problem is the "near"
option, in a particular case:
- a non-html file is detected (here, a txt file)
- the "near" hack forced the download of the file
- this file redirects to a regular html file (here, a fake
404 error page)
The problem is that the "near" hack triggers the download
of the txt file, with default depth (9999). For regular non-
html file, this is not a problem. But in case of redirect,
the "child" herits a depth value of N-1 (9998), and this is
causing a massive download.
I have now setup the "near" option so that it uses a depth
of 1 (download and parse, but don't download anything)
> Does that include removing the tabs and spaces too?
No - control chars, that is, < 32, excluding the space
character :)
Okay, I'll try to release a beta-6 soon on httrack.com (in
few minutes) which includes the "near" fix.
| |