| > Loop for all the urls:
> httack -I -w -D -r5 -M1000000 -F <useragent>
> -A100000 -G10000000 -m50000 -O myweb/<url> <url>
> The result is that some pages are well downloaded
> while others not.
While every site is potentially different, Some settings usually work.
The -r5 does not mirror one page it mirrors many. If you want one page, then
the supporting files are level two -r2
Add the near flag -n so all supporting files, where ever they are are
captured.
Add the -%P so links in javascript are found.
You might want to override robots.txt on some sites -s0
-I -M -A are unnecessary, the -G probably.
-m only 50k files? lots of html files are bloated with inline javascript now
making files bigger than 100k
07/02/2010 01:17 PM 121,545 index-11.html
07/02/2010 01:17 PM 114,532 index-14.html
07/02/2010 04:52 PM 114,433 index-17.html
07/02/2010 08:27 PM 110,231 index14.html
I always run with -x so you know where the mirror ends.
I never use an httrack browserID
<http://www.httrack.com/html/fcguide.html> | |