| Hi, I want to create a dataset for my 'focused web crawling' research work. I
have chosen the domain as "American Football".
A collection of 5000 html pages is enough for my work.
In options 'Experts Only' I set the 'Global travel mode' to 'go everywhere on
the web' so that it does a generic crawl (for creating the dataset). I set all
the others the default way.
I did a sample crawl with httrack starting with these seed URLs, but I can't
go beyond the home page of those seed URLs, it connects to the web online
(Download contains about 400MB size, excluding the log file which is
relatively large). I checked the downloaded folder structure and I see a lots
of files with 'htm.tmp' extension. What are they? If I rename them I see they
have the content.
What is the way to collect a solid dataset with httrack?
Thanks. | |