| Hello everyone,
I currently have a test site set up like this:
/index.html (default page)
/robots.txt
/images
/images/image1.jpg
/images/hidden-image.jpg
index.html contains an embedded image (/images/image1.jpg)
robots.txt contains:
User-agent: *
Disallow: /hidden-folder
This is what I want to achieve:
I want to build a sitemap that is as complete as possible, I want two things
in particular:
1. traverse each directory, and if the directories are browseable, crawl each
visible file. In this example, folder /images is browseable, and I want
Httrack to also discover hidden-image.jpg, that is currently not linked from
anywhere
2. use robots.txt to discover the hidden folders, and crawl them as well (thus
bypassing the restriction), so /hidden-folder should be crawled too
According to the doc, I suppose option -t should do the trick:
"This option 'tests' all links - even those forbidden (by the robot exclusion
protocol) - by using the 'HEAD' protocol to test for the presence of a file.
"
So this example should work:
httrack <http://www.shoesizes.com/> -O /tmp/shoesizes -t
When looking at the Apache log file, I set GET requests instead of HEAD
requests:
"GET /robots.txt HTTP/1.1" 200 38 "-" "HTTRACK"
"GET / HTTP/1.1" 200 402 "-" "HTTRACK"
"GET /images/image1 .jpg HTTP/1.1" 200 38228
I am using these flags:
-vz --test
Debug output:
21:22:57 Info: engine: transfer-status: link recorded:
testsite.net/robots.txt ->
21:22:57wl.parknInfo: etNote: robots.txt forbidden links for testsite.net are:
/hidden-folder
21:22:57 Info: Note: due to testsite.net remote robots.txt rules, links
beginning with these path will be forbidden: /hidden-folder (see in the
options to disable this)
So the robots.txt file has been read, the hidden folder has been detected but
strangely Httrack says it's still forbidden.
What I want is a pure spider that will crawl every file doing HEAD requests,
only when the content-type is text/html (or similar) it should then perform a
GET request just after the HEAD, and look for more links in the HTML.
I didn't reach a satisfactory outcome with wget. I was thinking Httrack might
be the answer.
In short I want to index discover and index as many files as possible,
including those prohibited by robots.txt.
I am using the command line version 3.47-23+libhtsjava.so.2 [XR&CO'2013] on
Centos
Thank you | |