HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Repost : crawling order (can't see response)
Author: Xavier Roche
Date: 08/12/2002 07:27
 
[darn, the database is really fked up..]

> When httrack pulls the first file, does it then traverse 
> the tree on the web site visiting all the top level links 
> first, then second level links, or does it follow all the 
> way down the tree first until it reaches the boundary of 
> the site, then comes back up one level.

The crawler is descending all "layers", on a heap basis ; 
that is, it takes ALL links that can be reached using "one 
mouse click" from the primary urls (the addresses you typed 
to crawl), then all links that can be reached using "two 
mouse clicks", and so on..

Of course, depending on the site structure, it can make 
behaviours you wouldn't have imagined (for example, you can 
go back to "upper" structures using "top" links, or the 
engine can also use links not generally used, because 
hidden or written in very small font size..)

Anyway, this behaviour is generally the one which is 
desired.
 
Reply Create subthread


All articles

Subject Author Date
Repost : crawling order (can't see response)

08/11/2002 22:48
Re: Repost : crawling order (can't see response)

08/12/2002 07:27




5

Created with FORUM 2.0.11