httrack 'escaping' from expected crawling area,XML

Subject: httrack 'escaping' from expected crawling area,XML

Author: Haudy Kazemi

Date: 12/29/2002 00:24

Hello,

Recently I began trying to use winhttrack to mirror this 
project (using bandwidth and connection limiters) which 
consisted of two interrelated sites:
<http://www.epanorama.net/>
<http://www.hut.fi/Misc/Electronics/>

I have "Get non-html files related to html page" enabled, 
no depth limits, default scan filters turned on (all 3 
check boxes).

After winhttrack crawled it for a while (I was periodically 
monitoring its progress for problems like infinite URL 
loops, huge forums, and endless links), it looked like the 
httrack was downloading much more than it should have been 
from a few specific sites.

The sites that had 'too much' being downloaded from them 
and looked like places where Httrack was 'escaping' 
or 'running away' or 'getting lost' were:
www.wholesaleproducts.com
www.epcos.de
www.epcos.com

By 'escaping' and 'getting lost' I mean that Httrack begins 
crawling areas that I think were incorrectly detected and 
then added to httrack's download list.

I then looked at the saved pages from the first link, 
www.wholesaleproducts.com and also the hts-log.txt to see 
if I could identify where Httrack first wandered into that 
site from.  I couldn't find the relevent info this way.  I 
then did a full text search of all the files httrack had 
downloaded, searching for wholesaleproducts.com.  The only 
hit I got from inside one of the original project sites was:
<http://www.epanorama.net/links/videosignal.html>

The source of that page had a link:
  <LI><A HREF=http://www.wholesaleproducts.com/a-
bestch.txt">How does NTSC 4.43 play on PAL TV ?</A>

I found no other references to wholesaleproducts.com 
anywhere in the original project sites.

I next tried accessing this link, and observed that it was 
missing, with a 404 page:
<http://www.wholesaleproducts.com/error404.html>

I looked at the source of the 404 page, it starts with:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 
Transitional//EN">
I didn't see anything unusual about it like XML.  There 
were links in it to the main page and some of the cgi links 
that were growing infinitely.

I looked at the source of the local copy of that 404 error 
page, as saved by winhttrack.  It was significantly 
different in content from the source of the 404 you see in 
the browser.  It was saved locally as:
"www.wholesaleproducts.com\a-bestch.txt"

----------------------------------
In regards to the epcos.* problems
A full text search for epcos.de had zero hits in my 
original project sites, although it had a few in some of 
the epcos.com sites.  The full text search for epcos.com 
showed me a few links to epcos.com from these pages:
www.epanorama.net/links/companies.html  (both web and local 
copies had a normal A HREF)
www.epanorama.net/links/componentinfo.html  (there are 
several links to epcos here, perhaps httrack's PDF module 
is confused?  Or is it XML that's messing things up?)
www.epanorama.net/links/magazine.html  (some epcos links)
www.epanorama.net/links/surge.html  (some epcos links)


----------------------------------------------------
Another thing, there seems to be some mishandling of
<A HREF>s on XML "enhanced" pages such as:
<http://www.epanorama.net/index2.php?section=documents&index=surge>
<http://www.epanorama.net/index2.php?section=documents&index=audio>

The source of the web version has the A HREFs section 
written like this:

              <td> 
          	<?xml version="1.0" encoding="UTF-8"?>
<BR/><h4><a href="&#10;			
	documents/surge/surge_ac.html">Mains transient 
surge suppression</a></h4>Mains transient surge 
suppression<h4><a href="&#10;			
	documents/surge/surgeratings.html">Surge Suppressor 
Specification recommendations</a></h4>Surge Suppressor 
Specification recommendations<h4><a href="&#10;		
	
	documents/surge/surgesuppres.html">Transient 
Voltage Suppression Devices</a></h4>Transient Voltage 
Suppression Devices<h4><a href="&#10;			
	documents/surge/telesurge.html">Telephone line 
surge protection</a></h4>Telephone line surge protection 
      		</td>
      


The local copy processed by Httrack is written like this 
with many underscores '_' replacing spaces/tabs/etc in the 
web copy.  When you look at the original web version in 
Internet Explorer you see the links normally, without any 
extra URL-breaking characters:

	        <td> 
          	<?xml version="1.0" encoding="UTF-8"?>
<BR/><h4><a href="_____documents/surge/surge_ac.html">Mains 
transient surge suppression</a></h4>Mains transient surge 
suppression<h4><a 
href="_____documents/surge/surgeratings.html">Surge 
Suppressor Specification recommendations</a></h4>Surge 
Suppressor Specification recommendations<h4><a 
href="_____documents/surge/surgesuppres.html">Transient 
Voltage Suppression Devices</a></h4>Transient Voltage 
Suppression Devices<h4><a 
href="_____documents/surge/telesurge.html">Telephone line 
surge protection</a></h4>Telephone line surge protection 
      		</td>

Is this an XML induced problem?  Does fixing this problem 
require adding XML support to WinHttrack or can it be 
handled like the Javascript?  Does anyone know if there are 
any XML--> HTML convertor proxies meant for older browsers 
without XML support?  (If there are, I could have Httrack 
use that to get around the XML problems.)

Additional info, in regards to XML and those &#10's in the 
A HREF's...I don't know what they do, and I had a hard time 
finding info about them.  The closest thing I found was at:
<http://www.w3.org/2000/08/lb2/> where they use &#10's but 
not necessarily in A HREFs.

Finally, I know I can use filters to exclude the epcos.* 
and wholesaleproducts.com sites, but I think they should 
have been automatically excluded as part of the default 
behavior.  I've seen this behavior before, but this is the 
first time I've dug into it this deeply to identify the 
cause.

All articles

Subject	Author	Date
httrack 'escaping' from expected crawling area,XML		12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML		12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML		12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML		01/05/2003 08:45