| Hi, I would like to crawl Yahoo portal, so I use command:
httrack <http://www.yahoo.com> -O "/home/user/HTTRACK/yahoo" "*yahoo.com/*"
-s0 -r10
-s0 - means do not respect robots.txt
-r10 - depth 10
Alter some second I have log like that:
HTTrack3.43-9+libhtsjava.so.2 launched on Sun, 06 Nov 2011 16:15:53 at
<http://www.yahoo.com> *yahoo.com/*
(httrack <http://www.yahoo.com> -O /home/marek/HTTRACK/yahoo *yahoo.com/* -s0
-r10
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive
information,
such as username/password authentication for websites mirrored in this
project
do not share these files/folders if you want these information to remain
private
16:15:54 Error: "Unable to get server's address: No such file or
directory" (-5) after 2 retries at link *yahoo.com/* (from primary/primary)
HTTrack Website Copier/3.43-9 mirror complete in 1 seconds : 4 links scanned,
1 files written (78 bytes overall) [686 bytes received at 686 bytes/sec], 78
bytes transfered using HTTP compression in 1 files, ratio 132%
(1 errors, 0 warnings, 0 messages)
I think this is because redirections.. What should I do to crawl _only_ Yahoo
web page ? (I shouldn't use filter: "*yahoo*" because yahoo word can be in get
parameter for example).
Thank you for help
| |