| > You didn't use a filter. You used TWO urls (www.yahoo.com and *yahoo.com/*)
> The later is not a valid url.
> Try "+*.yahoo.com/*"
I try your trick, but I still have problems. My commands is:
httrack <http://www.yahoo.com/index.html> -O "/home/HTTRACK/yahoo"
+*.yahoo.com/* -* +mime:text/html -s0 -r10 -M100000000 -E600 -%l en -F
"Mozilla/5.0 (Windows NT 6.0; WOW64; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
Options:
<http://www.yahoo.com/index.html> - this is first problem, when i use just
<http://www.yahoo.com/> I have only one page and this is the end.. (but not
every page has index.html.. or should I add it to every domain I want to
crawl?)
+*.yahoo.com/* -filter only pages from yahoo website,
-* +mime:text/html -I want to download only html pages (not images and
others),
-s0 -don't worry about robots.txt
-r10 -crawl very deep
-M100000000 -download 100Megabytes
-E600 -httack can download only 10 hours
-%l en -I would like to download english pages first (should I use quotation
marks " ?-F "Mozilla/5.0 (Windows NT 6.0; WOW64; rv:8.0.1) Gecko/20100101
Firefox/8.0.1" -say website that browser is Firefox.
How should look my httrack command ?
Thank you for help !
| |