| Hello,
Recently I began trying to use winhttrack to mirror this
project (using bandwidth and connection limiters) which
consisted of two interrelated sites:
<http://www.epanorama.net/>
<http://www.hut.fi/Misc/Electronics/>
I have "Get non-html files related to html page" enabled,
no depth limits, default scan filters turned on (all 3
check boxes).
After winhttrack crawled it for a while (I was periodically
monitoring its progress for problems like infinite URL
loops, huge forums, and endless links), it looked like the
httrack was downloading much more than it should have been
from a few specific sites.
The sites that had 'too much' being downloaded from them
and looked like places where Httrack was 'escaping'
or 'running away' or 'getting lost' were:
www.wholesaleproducts.com
www.epcos.de
www.epcos.com
By 'escaping' and 'getting lost' I mean that Httrack begins
crawling areas that I think were incorrectly detected and
then added to httrack's download list.
I then looked at the saved pages from the first link,
www.wholesaleproducts.com and also the hts-log.txt to see
if I could identify where Httrack first wandered into that
site from. I couldn't find the relevent info this way. I
then did a full text search of all the files httrack had
downloaded, searching for wholesaleproducts.com. The only
hit I got from inside one of the original project sites was:
<http://www.epanorama.net/links/videosignal.html>
The source of that page had a link:
<LI><A HREF=http://www.wholesaleproducts.com/a-
bestch.txt">How does NTSC 4.43 play on PAL TV ?</A>
I found no other references to wholesaleproducts.com
anywhere in the original project sites.
I next tried accessing this link, and observed that it was
missing, with a 404 page:
<http://www.wholesaleproducts.com/error404.html>
I looked at the source of the 404 page, it starts with:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN">
I didn't see anything unusual about it like XML. There
were links in it to the main page and some of the cgi links
that were growing infinitely.
I looked at the source of the local copy of that 404 error
page, as saved by winhttrack. It was significantly
different in content from the source of the 404 you see in
the browser. It was saved locally as:
"www.wholesaleproducts.com\a-bestch.txt"
----------------------------------
In regards to the epcos.* problems
A full text search for epcos.de had zero hits in my
original project sites, although it had a few in some of
the epcos.com sites. The full text search for epcos.com
showed me a few links to epcos.com from these pages:
www.epanorama.net/links/companies.html (both web and local
copies had a normal A HREF)
www.epanorama.net/links/componentinfo.html (there are
several links to epcos here, perhaps httrack's PDF module
is confused? Or is it XML that's messing things up?)
www.epanorama.net/links/magazine.html (some epcos links)
www.epanorama.net/links/surge.html (some epcos links)
----------------------------------------------------
Another thing, there seems to be some mishandling of
<A HREF>s on XML "enhanced" pages such as:
<http://www.epanorama.net/index2.php?section=documents&index=surge>
<http://www.epanorama.net/index2.php?section=documents&index=audio>
The source of the web version has the A HREFs section
written like this:
<td>
<?xml version="1.0" encoding="UTF-8"?>
<BR/><h4><a href="
documents/surge/surge_ac.html">Mains transient
surge suppression</a></h4>Mains transient surge
suppression<h4><a href="
documents/surge/surgeratings.html">Surge Suppressor
Specification recommendations</a></h4>Surge Suppressor
Specification recommendations<h4><a href="
documents/surge/surgesuppres.html">Transient
Voltage Suppression Devices</a></h4>Transient Voltage
Suppression Devices<h4><a href="
documents/surge/telesurge.html">Telephone line
surge protection</a></h4>Telephone line surge protection
</td>
The local copy processed by Httrack is written like this
with many underscores '_' replacing spaces/tabs/etc in the
web copy. When you look at the original web version in
Internet Explorer you see the links normally, without any
extra URL-breaking characters:
<td>
<?xml version="1.0" encoding="UTF-8"?>
<BR/><h4><a href="_____documents/surge/surge_ac.html">Mains
transient surge suppression</a></h4>Mains transient surge
suppression<h4><a
href="_____documents/surge/surgeratings.html">Surge
Suppressor Specification recommendations</a></h4>Surge
Suppressor Specification recommendations<h4><a
href="_____documents/surge/surgesuppres.html">Transient
Voltage Suppression Devices</a></h4>Transient Voltage
Suppression Devices<h4><a
href="_____documents/surge/telesurge.html">Telephone line
surge protection</a></h4>Telephone line surge protection
</td>
Is this an XML induced problem? Does fixing this problem
require adding XML support to WinHttrack or can it be
handled like the Javascript? Does anyone know if there are
any XML--> HTML convertor proxies meant for older browsers
without XML support? (If there are, I could have Httrack
use that to get around the XML problems.)
Additional info, in regards to XML and those 
's in the
A HREF's...I don't know what they do, and I had a hard time
finding info about them. The closest thing I found was at:
<http://www.w3.org/2000/08/lb2/> where they use 
's but
not necessarily in A HREFs.
Finally, I know I can use filters to exclude the epcos.*
and wholesaleproducts.com sites, but I think they should
have been automatically excluded as part of the default
behavior. I've seen this behavior before, but this is the
first time I've dug into it this deeply to identify the
cause.
| |