HTTrack Website Copier
Free software offline browser - FORUM
Subject: Creating a web dataset with httrack
Author: Sami
Date: 02/27/2012 09:00
 
Hi, I want to create a dataset for my 'focused web crawling' research work. I
have chosen the domain as "American Football". 
A collection of  5000 html pages is enough for my work. 

In options 'Experts Only' I set the 'Global travel mode' to 'go everywhere on
the web' so that it does a generic crawl (for creating the dataset). I set all
the others the default way. 

I did a sample crawl with httrack starting with these seed URLs, but I can't
go beyond the home page of those seed URLs, it connects to the web online
(Download contains about 400MB size, excluding the log file which is
relatively large). I checked the downloaded folder structure and I see a lots
of files with 'htm.tmp' extension. What are they? If I rename them I see they
have the content. 

What is the way to collect a solid dataset with httrack?
Thanks.
 
Reply


All articles

Subject Author Date
Creating a web dataset with httrack

02/27/2012 09:00
Re: Creating a web dataset with httrack

02/27/2012 19:18
Re: Creating a web dataset with httrack

03/05/2012 19:34
Re: Creating a web dataset with httrack

03/06/2012 15:55




4

Created with FORUM 2.0.11