HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: How to gracefully kill a crawl
Author: Alain Desilets
Date: 02/28/2012 18:47
 
> If I pause httrack as above and then try to unzip
> the new.zip file, I get:
> 
>    "End-of-central-directory signature not found"
> 
> I can unzip the old.zip file, but not the new.zip
> one. And unfortunately, I need to access the
> original content as it was stored in the new.zip
> file.
> 
> Any suggestion as to how to circumvent this
> problem?
Just thought of one possible solution, but wondering if there would be an
easier way.

I could wrap httrack inside a script that repeatedly calls it with:

   --max-time 60 --continue

In other words, continue the previous crawl for 60 seconds. That way, if the
wrapper script is interrupted, httrack will exit gracefully after a maximum of
60 seconds.

The problem I see with this is that the new.txt, old.txt new.zip and old.zip
will contain at most the urls that were downloaded in the last two 60 second
crawls. So my wrapper would have to do a bit of work to concatenate those into
files say, old-cumulative.txt and old-cumulative.zip.

Is there an easier way to get to my goal than this? It's not hard to
implement, but it's still not negligeable amount of work.

Alain



 
Reply Create subthread


All articles

Subject Author Date
How to gracefully kill a crawl

02/27/2012 21:26
Re: How to gracefully kill a crawl

02/27/2012 23:08
Re: How to gracefully kill a crawl

02/28/2012 12:00
Re: How to gracefully kill a crawl

02/28/2012 12:26
Re: How to gracefully kill a crawl

02/28/2012 18:47




d

Created with FORUM 2.0.11