HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: More Spider Behavior Options
Author: Xavier Roche
Date: 02/25/2001 11:33
 
Well, I started to make a "HTTrack library" (the lib/ 
folder) but this project was discontinued due to lack 
of time. For now, you can control which pages has to 
be parsed, which pages has to be captured and some 
other interesting things (userdefined html processing, 
such as some linguistic analysis)

A solution here might be a supplemental handler "give 
me more links or I'll stop" in which you would send 
all URLs step by step

But a problem will result: links takes some memory, 
for example with 1,000,000 links you can have >100MB 
of memory used, due to the fact that HTTrack not 
only "spider" but also "remembers" - it remembers all 
links stored so that it does not take twice a link, 
and stored its location. Each link does not takes much 
memory (about 100 bytes) but of course with a huge 
parsing this can be a problem!

To summarize, I think such cases might be handled 
through some supplemental library hacks - maybe I'll 
have to continue the library when I have finished the 
3.00 :)
 
Reply Create subthread


All articles

Subject Author Date
Re: More Spider Behavior Options

02/25/2001 11:16
Re: More Spider Behavior Options

02/25/2001 11:33
[REPOST] Re: More Spider Behavior Options

02/26/2001 10:23




f

Created with FORUM 2.0.11