| > I then looked at the saved pages from the first link,
> www.wholesaleproducts.com and also the hts-log.txt to see
> if I could identify where Httrack first wandered into
that
> site from. I couldn't find the relevent info this way.
Look in hts-cache/new.txt and search for the
wholesaleproducts thing. You should have in the (from)
field (at the end of the line) the originating URL.
Two source of errors, I think:
1. external parser (swf), as the depth test is not issued
(this is a bug, I will fix it soon)
2. maybe a bug in the "near" hack or in the filter system
(not very probable, though)
> I didn't see anything unusual about it like XML. There
> were links in it to the main page and some of the cgi
links
> that were growing infinitely.
Which one(s)? Might be interesting to see what is the first
hit in the new.txt tracking file
> In regards to the epcos.* problems
> A full text search for epcos.de had zero hits in my
Same remark: can you check the new.txt file?I did not see any strange things
in the html file..
> there are several links to epcos here,
> perhaps httrack's PDF module
> is confused?
Nope - pdf files aren't parsed at all
> www.epanorama.net/links/magazine.html (some epcos links)
Yes with .xml extension ; but the XML file should be
treated as regular binary file (not parsed)
> www.epanorama.net/links/surge.html (some epcos links)
The
<http://www.epcos.com/excelon/servlet/excelon/components_maga>
zine/xml/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
5&bereich=Company link seem to cause a timeout (also in IE)
> <BR/><h4><a href='
> The local copy processed by Httrack is written like this
> with many underscores '_' replacing spaces/tabs/etc in
Right - I will remove in the future explicit ( )
control chars ; but it is rather stange that the urls
contains ctrl characters anyway
> Finally, I know I can use filters to exclude the epcos.*
> and wholesaleproducts.com sites, but I think they should
> have been automatically excluded as part of the default
> behavior.
Absolutely - the only thing is to find the reason why the
engine crawls them :)
> I've seen this behavior before, but this is the
> first time I've dug into it this deeply to identify the
> cause.
So let's dig :)
| |