Creating a web dataset with httrack - HTTrack Website Copier Forum

Subject: Creating a web dataset with httrack

Author: Sami

Date: 02/27/2012 09:00

Hi, I want to create a dataset for my 'focused web crawling' research work. I
have chosen the domain as "American Football". 
A collection of  5000 html pages is enough for my work. 

In options 'Experts Only' I set the 'Global travel mode' to 'go everywhere on
the web' so that it does a generic crawl (for creating the dataset). I set all
the others the default way. 

I did a sample crawl with httrack starting with these seed URLs, but I can't
go beyond the home page of those seed URLs, it connects to the web online
(Download contains about 400MB size, excluding the log file which is
relatively large). I checked the downloaded folder structure and I see a lots
of files with 'htm.tmp' extension. What are they? If I rename them I see they
have the content. 

What is the way to collect a solid dataset with httrack?
Thanks.

All articles

Subject	Author	Date
Creating a web dataset with httrack		02/27/2012 09:00
Re: Creating a web dataset with httrack		02/27/2012 19:18
Re: Creating a web dataset with httrack		03/05/2012 19:34
Re: Creating a web dataset with httrack		03/06/2012 15:55