Re: More Spider Behavior Options - HTTrack Website Copier Forum

Subject: Re: More Spider Behavior Options

Author: Xavier Roche

Date: 02/25/2001 11:33

Well, I started to make a &quot;HTTrack library&quot; (the lib/ 
folder) but this project was discontinued due to lack 
of time. For now, you can control which pages has to 
be parsed, which pages has to be captured and some 
other interesting things (userdefined html processing, 
such as some linguistic analysis)

A solution here might be a supplemental handler &quot;give 
me more links or I'll stop&quot; in which you would send 
all URLs step by step

But a problem will result: links takes some memory, 
for example with 1,000,000 links you can have &gt;100MB 
of memory used, due to the fact that HTTrack not 
only &quot;spider&quot; but also &quot;remembers&quot; - it remembers all 
links stored so that it does not take twice a link, 
and stored its location. Each link does not takes much 
memory (about 100 bytes) but of course with a huge 
parsing this can be a problem!

To summarize, I think such cases might be handled 
through some supplemental library hacks - maybe I'll 
have to continue the library when I have finished the 
3.00 :)

Create subthread

All articles

Subject	Author	Date
Re: More Spider Behavior Options		02/25/2001 11:16
Re: More Spider Behavior Options		02/25/2001 11:33
[REPOST] Re: More Spider Behavior Options		02/26/2001 10:23