Re: httrack 'escaping' from expected crawling area,XML

Subject: Re: httrack 'escaping' from expected crawling area,XML

Author: Xavier Roche

Date: 12/29/2002 15:41

> I then looked at the saved pages from the first link, 
> www.wholesaleproducts.com and also the hts-log.txt to see 
> if I could identify where Httrack first wandered into 
that 
> site from.  I couldn't find the relevent info this way. 

Look in hts-cache/new.txt and search for the 
wholesaleproducts thing. You should have in the (from) 
field (at the end of the line) the originating URL.

Two source of errors, I think:
1. external parser (swf), as the depth test is not issued 
(this is a bug, I will fix it soon)
2. maybe a bug in the "near" hack or in the filter system 
(not very probable, though)

> I didn't see anything unusual about it like XML.  There 
> were links in it to the main page and some of the cgi 
links 
> that were growing infinitely.

Which one(s)? Might be interesting to see what is the first 
hit in the new.txt tracking file

> In regards to the epcos.* problems
> A full text search for epcos.de had zero hits in my 

Same remark: can you check the new.txt file?I did not see any strange things
in the html file..

> there are  several links to epcos here, 
> perhaps httrack's PDF module 
> is confused?
Nope - pdf files aren't parsed at all

> www.epanorama.net/links/magazine.html  (some epcos links)

Yes with .xml extension ; but the XML file should be 
treated as regular binary file (not parsed)

> www.epanorama.net/links/surge.html  (some epcos links)

The 
<http://www.epcos.com/excelon/servlet/excelon/components_maga>
zine/xml/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
5&bereich=Company link seem to cause a timeout (also in IE)

> <BR/><h4><a href='&#10;			
> The local copy processed by Httrack is written like this 
> with many underscores '_' replacing spaces/tabs/etc in

Right - I will remove in the future explicit (&#10;) 
control chars ; but it is rather stange that the urls 
contains ctrl characters anyway

> Finally, I know I can use filters to exclude the epcos.* 
> and wholesaleproducts.com sites, but I think they should 
> have been automatically excluded as part of the default 
> behavior.

Absolutely - the only thing is to find the reason why the 
engine crawls them :)

>  I've seen this behavior before, but this is the 
> first time I've dug into it this deeply to identify the 
> cause.

So let's dig :)

Create subthread

All articles

Subject	Author	Date
httrack 'escaping' from expected crawling area,XML		12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML		01/05/2003 08:45