| My settings for downloading a Tumblr blog with HTTrack
Title: myhostname
URL: <http://myhostname.tumblr.com>
URL list (.txt): I've previously extracted a list of links of the raw images
that are on <http://data.tumblr.com/> (links which I renamed to start with
<http://s3.amazonaws.com/data.tumblr.com/>), and audio and video from
<https://a.tumblr.com/>, <https://www.tumblr.com/audio_file/myblogname> and
<https://vt.tumblr.com> with an application for ripping Tumblr media called
TumblrThree. It extracted the links with https protocol, but with Notepad++ I
renamed all the links to start with http and then I've put all those links in
a single txt file.
Tab Scan Rules:
+*.png +*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/*
-mime:application/foobar
+http://myhostname.tumblr.com/* <http://static.tumblr.com/>*
+http://assets.tumblr.com/* +http://media.tumblr.com/*
+http://*.media.tumblr.com/* -*?*=* -*=*
+http://myhostname.tumblr.com/archive?*=* +http://www.tumblr.com/photo/*
+http://myhostname.tumblr.com/post/*
+http://s3.amazonaws.com/data.tumblr.com/*
+http://a.tumblr.com/* +http://www.tumblr.com/audio_file/myhostname/*
+http://vt.tumblr.com/*
-en.wikipedia.org/* -www.google-analytics.com/*
--disable-security-limits
--max-rate 5000000
--assume php=text/html
Tab Spider: no robots.txt
Tab Limits:
Max transfer rate (B/s): 0 or a big number like 999999 or 5000000
Max connections / seconds: 50
Tab Flow Control:
Number of connections:50
Retries: 5
I may have put some unnecessary scan rules but that's it. I'm experimenting
with the program.
Maria | |