HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: httrack 'escaping' from expected crawling area,XML
Author: Haudy Kazemi
Date: 12/29/2002 17:36
 
> > I then looked at the saved pages from the first link, 
> > www.wholesaleproducts.com and also the hts-log.txt to 
see 
> > if I could identify where Httrack first wandered into 
> that 
> > site from.  I couldn't find the relevent info this way. 
> 
> Look in hts-cache/new.txt and search for the 
> wholesaleproducts thing. You should have in the (from) 
> field (at the end of the line) the originating URL.

Here are the relevant lines from new.txt:
06:45:05	287/-1	---MC-	302	error ('Found')
	text/html	date:Sat,%2028%20Dec%202002%
2012:45:04%20GMT	www.wholesaleproducts.com/a-
bestch.txt	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/a-bestch.txt
	(from www.epanorama.net/links/videosignal.html)

06:56:59	2341/2341	---M--	200	added ('OK')
	text/html	etag:%22e033-925-3b1d3af4%22
	www.wholesaleproducts.com/error404.html	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/error404.html
	(from www.wholesaleproducts.com/a-bestch.txt)

07:26:42	34/34	---M--	200	added ('OK')
	image/gif	etag:%22d9a0-22-36b3633b%22
	www.wholesaleproducts.com/white1.gif	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/white1.gif	(from 
www.wholesaleproducts.com/error404.html)

07:26:42	9433/9433	---M--	200	added ('OK')
	image/gif	etag:%22d460-24d9-3a1b2e8c%22
	www.wholesaleproducts.com/malllogo.gif	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/malllogo.gif
	(from www.wholesaleproducts.com/error404.html)

07:26:43	3087/3087	---M--	200	added ('OK')
	text/html	etag:%22d93b-c0f-3b11c66b%22
	www.wholesaleproducts.com/terms.html	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/terms.html	(from 
www.wholesaleproducts.com/error404.html)

07:26:43	26127/26127	---M--	200	added ('OK')
	text/html	etag:%22d3c9-660f-3d97c808%22
	www.wholesaleproducts.com/	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/index.html	(from 
www.wholesaleproducts.com/error404.html)

07:26:45	9596/9596	---MC-	200	added ('OK')
	text/html	date:Sat,%2028%20Dec%202002%
2013:26:42%20GMT
	www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/bin/ccdbdis5836.html
	(from www.wholesaleproducts.com/error404.html)

09:23:00	907/907	---M--	200	added ('OK')
	text/html	etag:%22d92d-38b-390f80ba%22
	www.wholesaleproducts.com/subscribemsg.html
	I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/subscribemsg.html
	(from www.wholesaleproducts.com/terms.html)

09:23:00	3046/3046	---M--	200	added ('OK')
	image/jpeg	etag:%22c074-be6-3d48654f%22
	www.wholesaleproducts.com/BGbluemarble.jpg
	I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/BGbluemarble.jpg
	(from www.wholesaleproducts.com/)

etc.etc.etc.
> 
> Two source of errors, I think:
> 1. external parser (swf), as the depth test is not issued 
> (this is a bug, I will fix it soon)
> 2. maybe a bug in the 'near' hack or in the filter system 
> (not very probable, though)

It looks like the Error 404 is parsed at the 07:26:43 
26127/26127 event where the 
wholesaleproducts.com/index.html is crawled.  I don't think 
that is supposed to happen.  As there are no SWF files 
here, I presume it is in the near function as you suggest.

> > I didn't see anything unusual about it like XML.  There 
> > were links in it to the main page and some of the cgi 
> links 
> > that were growing infinitely.
> 
> Which one(s)? Might be interesting to see what is the 
first 
> hit in the new.txt tracking file

The problematic cgi was:
www.wholesaleproducts.com/bin/ccdbdis.pl?(various 
parameters).  Also, I looked at the source of the saved 
local copy and it looks like winhttrack was NOT adding its 
usual timestamp and id at the top of the HTML.

Some examples:
09:24:02	9596/9596	---MC-	200	added ('OK')
	text/html	date:Sat,%2028%20Dec%202002%
2015:23:51%20GMT
	www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale&action=&ItemID=
I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/bin/ccdbdis0542.html
	(from www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale)

09:23:54	34117/34117	---MC-	200	added ('OK')
	text/html	date:Sat,%2028%20Dec%202002%
2015:23:38%20GMT

www.wholesaleproducts.com/bin/ccdbdis.pl?merchant=wholesale&action=category_front_list&ItemID=&Catego
ry=Bretford%20Multimedi&SubCategory1=Connections%20Accessor
	I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.wholesaleproducts.com/bin/ccdbdis432f.html
	(from www.wholesaleproducts.com/)

> 
> > In regards to the epcos.* problems
> > A full text search for epcos.de had zero hits in my 
> 
> Same remark: can you check the new.txt file?> I did not see any strange
things in the html file..

The first few epcos.* links in new.txt are:

04:40:36	61364/61364	---MC-	200	added ('OK')
	text/html	date:Sat,%2028%20Dec%202002%
2010:40:33%20GMT
	www.epcos.com/excelon/servlet/excelon/components_mag
azine/xml/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.epcos.com/excelon/servlet/excelon/components_
magazine/xml/content_e5669.xml	(from 
www.epanorama.net/links/componentinfo.html)

04:41:00	61172/61172	---M--	200	added ('OK')
	application/pdf	etag:%224145a-eef4-3ba62a91%22
	www.epcos.com/inf/80/ds/e0000005.pdf	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/inf/80/ds/e0000005.pdf	(from 
www.epanorama.net/links/componentinfo.html)

04:41:49	316079/316079	---M--	200	added ('OK')
	application/pdf	etag:%2241457-4d2af-3ba62a91%22
	www.epcos.com/inf/80/ds/e0000002.pdf	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/inf/80/ds/e0000002.pdf	(from 
www.epanorama.net/links/componentinfo.html)

04:57:25	48423/48423	---MC-	200	added ('OK')
	text/html	date:Sat,%2028%20Dec%202002%
2010:57:22%20GMT
	www.epcos.com/excelon/servlet/excelon/components_mag
azine/xml/content_e.xml?xslsheet=components_magazine:/xsl/index.xsl	I:/web-
archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/excelon/servlet/excelon/components_
magazine/xml/content_ebd3c.xml	(from 
www.epanorama.net/links/magazine.html)

06:16:20	56122/56122	---MC-	200	added ('OK')
	text/html	date:Sat,%2028%20Dec%202002%
2012:16:17%20GMT
	www.epcos.com/excelon/servlet/excelon/components_mag
azine/xml/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
5&bereich=Company	I:/web-archive%
20problematic/www.epanorama.net%
2020021228/www.epcos.com/excelon/servlet/excelon/components_
magazine/xml/content_e8734.xml	(from 
www.epanorama.net/links/surge.html)

06:16:46	1833/1833	---M--	200	added ('OK')
	application/x-javascript	etag:%228e945-729-
3c2c667c%22	www.epcos.com/share/all/js/browser.js
	I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/share/all/js/browser.js
	(from 
www.epcos.com/excelon/servlet/excelon/components_magazine/xm
l/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications)

06:16:47	10784/10784	---M--	200	added ('OK')
	application/x-javascript	etag:%228e790-2a20-
3decb3bb%22	www.epcos.com/share/all/js/epcos_main.js
	I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/share/all/js/epcos_main.js
	(from 
www.epcos.com/excelon/servlet/excelon/components_magazine/xm
l/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications)

06:16:48	138/138	---M--	200	added ('OK')
	application/x-javascript	etag:%224d556-8a-
3c2c678c%22
	www.epcos.com/web/components_magazine/js/components.
js	I:/web-archive%20problematic/www.epanorama.net%
2020021228/www.epcos.com/web/components_magazine/js/componen
ts.js	(from 
www.epcos.com/excelon/servlet/excelon/components_magazine/xm
l/content_e.xml?xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
7&bereich=Applications)

At this point numerous additional epcos links are added 
similar to the ones above.  You can download the whole 
new.txt file and a few of the other html pages from 
wholesaleproducts.com here:
<http://kazemizadeh.net/httrack/epanorama.com/>
(if you want access to the incompletely mirrored copy of 
this site let me know).

> > there are  several links to epcos here, 
> > perhaps httrack's PDF module 
> > is confused?> 
> Nope - pdf files aren't parsed at all
> 
> > www.epanorama.net/links/magazine.html  (some epcos 
links)
> 
> Yes with .xml extension ; but the XML file should be 
> treated as regular binary file (not parsed)
I'd say some parsing is happening, otherwise I see little 
reason for the epcos folders to fill up with xml files, and 
for there to be entries in new.txt showing the XML files as 
the source link.

> 
> > www.epanorama.net/links/surge.html  (some epcos links)
> 
> The 
> 
<http://www.epcos.com/excelon/servlet/excelon/components_maga>
> zine/xml/content_e.xml?> 
xslsheet=components_magazine:/xsl/artikel.xsl&an=8&number=16
> 5&bereich=Company link seem to cause a timeout (also in 
IE)
> 
> > <BR/><h4><a href='&#10;			
> > The local copy processed by Httrack is written like 
this 
> > with many underscores '_' replacing spaces/tabs/etc in
> 
> Right - I will remove in the future explicit (&#10;) 
> control chars ; but it is rather stange that the urls 
> contains ctrl characters anyway

Does that include removing the tabs and spaces too?  I 
think always removing tabs and spaces will be problematic 
for URLs that require them in the form of escaped '%20's.  
Making this work right might need some minor XML parsing or 
a way to figure out whether the spaces ought to be stripped 
or converted to '%20's.  Perhaps the solution is to only 
strip out spaces/tabs/etc from the beginning of a URL.

> > Finally, I know I can use filters to exclude the 
epcos.* 
> > and wholesaleproducts.com sites, but I think they 
should 
> > have been automatically excluded as part of the default 
> > behavior.
> 
> Absolutely - the only thing is to find the reason why the 
> engine crawls them :)
> 
> >  I've seen this behavior before, but this is the 
> > first time I've dug into it this deeply to identify the 
> > cause.
> 
> So let's dig :)

The steam-shovel is standing by... ;)
 
Reply Create subthread


All articles

Subject Author Date
httrack 'escaping' from expected crawling area,XML

12/29/2002 00:24
Re: httrack 'escaping' from expected crawling area,XML

12/29/2002 15:41
Re: httrack 'escaping' from expected crawling area,XML

12/29/2002 17:36
Re: httrack 'escaping' from expected crawling area,XML

12/29/2002 19:27
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 01:33
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 01:34
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 10:54
Re: httrack 'escaping' from expected crawling area,XML

12/31/2002 11:51
Re: httrack 'escaping' from expected crawling area,XML

01/05/2003 08:45




6

Created with FORUM 2.0.11