HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: How to mirror pages containing string(s)?
Author: appyface
Date: 06/30/2010 14:32
 
Yes you are right, time and bandwidth are not saved.  "My bad" -- I must have
needed more coffee! 

The string I wish to scan for is different each time I'm interested in
downloading from this site.  It's an academic database and I'm searching for
pages mentioning a specific keyword, to utilize offline.  I just don't do this
often enough to warrant keeping a full mirror of the site on my local disk and
excluding it from my own backups, etc. though I could.  But for me I really
don't mind crawling the entire database each time I wish to download a few
thousand pages my selected keywords.

FWIW I had been using "ItSucks" to do this task -- it has the ability to crawl
text pages with search string(s) and download only those matching the string
along with the other criteria.  It is usually fast and reasonably efficient.
Unfortunately this site has tripled or more in size since the last time I used
ItSucks to crawl it.  ItSucks is exhausting the java heap (1.9G) before it can
finish.  I've a bug ticket in to the developer with hopes it will be fixed
soon...

So. I've had HTTrack downloading all .html pages from the entire site for the
last day.  It finished during the night with the error message, "Too many
URLs, giving up..(>100000)
To avoid that: use #L option for more links (example: -#L1000000)".  I've just
restarted the run in update mode with -#L9999999 I hope that will do it.  

FYI the .html pages downloaded so far occupy just under 13G on disk. I've not
downloaded all of the site's .html pages before, so I don't know what the
finished size will be.  I've got 40G left, I hope it's enough :-)

I realize my specific task is not the intended use of HTTrack.  But HTTrack is
proving to be very efficient and it is very nice to interact with.  But not
having the search string is not the end of of the world :-) Perhaps one day
the developer might make text search an option.

Thank you again for your help.  Kind regards,
--appyface
 
Reply Create subthread


All articles

Subject Author Date
How to mirror pages containing string(s)?

06/28/2010 05:02
Re: How to mirror pages containing string(s)?

06/28/2010 14:47
Re: How to mirror pages containing string(s)?

06/29/2010 05:23
Re: How to mirror pages containing string(s)?

06/29/2010 18:47
Re: How to mirror pages containing string(s)?

06/30/2010 14:32
Re: How to mirror pages containing string(s)?

07/02/2010 14:47
Re: How to mirror pages containing string(s)?

07/02/2010 19:15
Re: How to mirror pages containing string(s)?

07/03/2010 02:29
Re: How to mirror pages containing string(s)?

07/03/2010 19:25
Re: How to mirror pages containing string(s)?

07/03/2010 19:44
Re: How to mirror pages containing string(s)?

07/10/2010 19:07
Re: How to mirror pages containing string(s)?

07/11/2010 22:24




b

Created with FORUM 2.0.11