HTTrack Website Copier
Free software offline browser - FORUM
Subject: problems with yahoo
Author: hollow.quincy
Date: 12/04/2011 23:31
 
Hi, I would like to crawl Yahoo portal, so I use command:

httrack <http://www.yahoo.com> -O "/home/user/HTTRACK/yahoo" "*yahoo.com/*"
-s0 -r10
-s0 - means do not respect robots.txt
-r10 - depth 10
Alter some second I have log like that:

HTTrack3.43-9+libhtsjava.so.2 launched on Sun, 06 Nov 2011 16:15:53 at
<http://www.yahoo.com> *yahoo.com/*
(httrack <http://www.yahoo.com> -O /home/marek/HTTRACK/yahoo *yahoo.com/* -s0
-r10
Information, Warnings and Errors reported for this mirror:
note:    the hts-log.txt file, and hts-cache folder, may contain sensitive
information,
    such as username/password authentication for websites mirrored in this
project
    do not share these files/folders if you want these information to remain
private

16:15:54    Error:     "Unable to get server's address: No such file or
directory" (-5) after 2 retries at link *yahoo.com/* (from primary/primary)

HTTrack Website Copier/3.43-9 mirror complete in 1 seconds : 4 links scanned,
1 files written (78 bytes overall) [686 bytes received at 686 bytes/sec], 78
bytes transfered using HTTP compression in 1 files, ratio 132%
(1 errors, 0 warnings, 0 messages)

I think this is because redirections.. What should I do to crawl _only_ Yahoo
web page ? (I shouldn't use filter: "*yahoo*" because yahoo word can be in get
parameter for example).

Thank you for help
 
Reply


All articles

Subject Author Date
problems with yahoo

12/04/2011 23:31
Re: problems with yahoo

12/05/2011 17:22
Re: problems with yahoo

12/09/2011 21:20




6

Created with FORUM 2.0.11