| > > I then looked at the saved pages from the first link,
> > www.wholesaleproducts.com and also the hts-log.txt to
see
> > if I could identify where Httrack first wandered into
> that
> > site from. I couldn't find the relevent info this way.
>
> Look in hts-cache/new.txt and search for the
> wholesaleproducts thing. You should have in the (from)
> field (at the end of the line) the originating URL.
Here are the relevant lines from new.txt:
06:45:05 287/-1 ---MC- 302 error ('Found')
text/html date:Sat,%2028%20Dec%202002%
2012:45:04%20GMT www.wholesaleproducts.com/a-
bestch.txt I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/a-bestch.txt
(from www.epanorama.net/links/videosignal.html)
06:56:59 2341/2341 ---M-- 200 added ('OK')
text/html etag:%22e033-925-3b1d3af4%22
www.wholesaleproducts.com/error404.html I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/error404.html
(from www.wholesaleproducts.com/a-bestch.txt)
07:26:42 34/34 ---M-- 200 added ('OK')
image/gif etag:%22d9a0-22-36b3633b%22
www.wholesaleproducts.com/white1.gif I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/white1.gif (from
www.wholesaleproducts.com/error404.html)
07:26:42 9433/9433 ---M-- 200 added ('OK')
image/gif etag:%22d460-24d9-3a1b2e8c%22
www.wholesaleproducts.com/malllogo.gif I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/malllogo.gif
(from www.wholesaleproducts.com/error404.html)
07:26:43 3087/3087 ---M-- 200 added ('OK')
text/html etag:%22d93b-c0f-3b11c66b%22
www.wholesaleproducts.com/terms.html I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/terms.html (from
www.wholesaleproducts.com/error404.html)
07:26:43 26127/26127 ---M-- 200 added ('OK')
text/html etag:%22d3c9-660f-3d97c808%22
www.wholesaleproducts.com/ I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/index.html (from
www.wholesaleproducts.com/error404.html)
07:26:45 9596/9596 ---MC- 200 added ('OK')
text/html date:Sat,%2028%20Dec%202002%
2013:26:42%20GMT
www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/bin/ccdbdis5836.html
(from www.wholesaleproducts.com/error404.html)
09:23:00 907/907 ---M-- 200 added ('OK')
text/html etag:%22d92d-38b-390f80ba%22
www.wholesaleproducts.com/subscribemsg.html
I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/subscribemsg.html
(from www.wholesaleproducts.com/terms.html)
09:23:00 3046/3046 ---M-- 200 added ('OK')
image/jpeg etag:%22c074-be6-3d48654f%22
www.wholesaleproducts.com/BGbluemarble.jpg
I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/BGbluemarble.jpg
(from www.wholesaleproducts.com/)
etc.etc.etc.
>
> Two source of errors, I think:
> 1. external parser (swf), as the depth test is not issued
> (this is a bug, I will fix it soon)
> 2. maybe a bug in the 'near' hack or in the filter system
> (not very probable, though)
It looks like the Error 404 is parsed at the 07:26:43
26127/26127 event where the
wholesaleproducts.com/index.html is crawled. I don't think
that is supposed to happen. As there are no SWF files
here, I presume it is in the near function as you suggest.
> > I didn't see anything unusual about it like XML. There
> > were links in it to the main page and some of the cgi
> links
> > that were growing infinitely.
>
> Which one(s)? Might be interesting to see what is the
first
> hit in the new.txt tracking file
The problematic cgi was:
www.wholesaleproducts.com/bin/ccdbdis.pl?(various
parameters). Also, I looked at the source of the saved
local copy and it looks like winhttrack was NOT adding its
usual timestamp and id at the top of the HTML.
Some examples:
09:24:02 9596/9596 ---MC- 200 added ('OK')
text/html date:Sat,%2028%20Dec%202002%
2015:23:51%20GMT
www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale&action=&ItemID=
I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/bin/ccdbdis0542.html
(from www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale)
09:23:54 34117/34117 ---MC- 200 added ('OK')
text/html date:Sat,%2028%20Dec%202002%
2015:23:38%20GMT
www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale&action=category_front_list&ItemID=&Catego
ry=Bretford%20Multimedi&SubCategory1=Connections%20Accessor
I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/bin/ccdbdis432f.html
(from www.wholesaleproducts.com/)
>
> > In regards to the epcos.* problems
> > A full text search for epcos.de had zero hits in my
>
> Same remark: can you check the new.txt file?> I did not see any strange
things in the html file..
The first few epcos.* links in new.txt are:
04:40:36 61364/61364 ---MC- 200 added ('OK')
text/html date:Sat,%2028%20Dec%202002%
2010:40:33%20GMT
www.epcos.com/excelon/servlet/excelon/components_mag
azine/xml/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.epcos.com/excelon/servlet/excelon/components_
magazine/xml/content_e5669.xml (from
www.epanorama.net/links/componentinfo.html)
04:41:00 61172/61172 ---M-- 200 added ('OK')
application/pdf etag:%224145a-eef4-3ba62a91%22
www.epcos.com/inf/80/ds/e0000005.pdf I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/inf/80/ds/e0000005.pdf (from
www.epanorama.net/links/componentinfo.html)
04:41:49 316079/316079 ---M-- 200 added ('OK')
application/pdf etag:%2241457-4d2af-3ba62a91%22
www.epcos.com/inf/80/ds/e0000002.pdf I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/inf/80/ds/e0000002.pdf (from
www.epanorama.net/links/componentinfo.html)
04:57:25 48423/48423 ---MC- 200 added ('OK')
text/html date:Sat,%2028%20Dec%202002%
2010:57:22%20GMT
www.epcos.com/excelon/servlet/excelon/components_mag
azine/xml/content_e.xml?xslsheet=components_magazine:/xsl/index.xsl I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/excelon/servlet/excelon/components_
magazine/xml/content_ebd3c.xml (from
www.epanorama.net/links/magazine.html)
06:16:20 56122/56122 ---MC- 200 added ('OK')
text/html date:Sat,%2028%20Dec%202002%
2012:16:17%20GMT
www.epcos.com/excelon/servlet/excelon/components_mag
azine/xml/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
5&bereich=Company I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.epcos.com/excelon/servlet/excelon/components_
magazine/xml/content_e8734.xml (from
www.epanorama.net/links/surge.html)
06:16:46 1833/1833 ---M-- 200 added ('OK')
application/x-javascript etag:%228e945-729-
3c2c667c%22 www.epcos.com/share/all/js/browser.js
I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/share/all/js/browser.js
(from
www.epcos.com/excelon/servlet/excelon/components_magazine/xm
l/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications)
06:16:47 10784/10784 ---M-- 200 added ('OK')
application/x-javascript etag:%228e790-2a20-
3decb3bb%22 www.epcos.com/share/all/js/epcos_main.js
I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/share/all/js/epcos_main.js
(from
www.epcos.com/excelon/servlet/excelon/components_magazine/xm
l/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications)
06:16:48 138/138 ---M-- 200 added ('OK')
application/x-javascript etag:%224d556-8a-
3c2c678c%22
www.epcos.com/web/components_magazine/js/components.
js I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/web/components_magazine/js/componen
ts.js (from
www.epcos.com/excelon/servlet/excelon/components_magazine/xm
l/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications)
At this point numerous additional epcos links are added
similar to the ones above. You can download the whole
new.txt file and a few of the other html pages from
wholesaleproducts.com here:
<http://kazemizadeh.net/httrack/epanorama.com/>
(if you want access to the incompletely mirrored copy of
this site let me know).
> > there are several links to epcos here,
> > perhaps httrack's PDF module
> > is confused?>
> Nope - pdf files aren't parsed at all
>
> > www.epanorama.net/links/magazine.html (some epcos
links)
>
> Yes with .xml extension ; but the XML file should be
> treated as regular binary file (not parsed)
I'd say some parsing is happening, otherwise I see little
reason for the epcos folders to fill up with xml files, and
for there to be entries in new.txt showing the XML files as
the source link.
>
> > www.epanorama.net/links/surge.html (some epcos links)
>
> The
>
<http://www.epcos.com/excelon/servlet/excelon/components_maga>
> zine/xml/content_e.xml?>
xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
> 5&bereich=Company link seem to cause a timeout (also in
IE)
>
> > <BR/><h4><a href='
> > The local copy processed by Httrack is written like
this
> > with many underscores '_' replacing spaces/tabs/etc in
>
> Right - I will remove in the future explicit ( )
> control chars ; but it is rather stange that the urls
> contains ctrl characters anyway
Does that include removing the tabs and spaces too? I
think always removing tabs and spaces will be problematic
for URLs that require them in the form of escaped '%20's.
Making this work right might need some minor XML parsing or
a way to figure out whether the spaces ought to be stripped
or converted to '%20's. Perhaps the solution is to only
strip out spaces/tabs/etc from the beginning of a URL.
> > Finally, I know I can use filters to exclude the
epcos.*
> > and wholesaleproducts.com sites, but I think they
should
> > have been automatically excluded as part of the default
> > behavior.
>
> Absolutely - the only thing is to find the reason why the
> engine crawls them :)
>
> > I've seen this behavior before, but this is the
> > first time I've dug into it this deeply to identify the
> > cause.
>
> So let's dig :)
The steam-shovel is standing by... ;) | |