HTTrack Website Copier
Free software offline browser - FORUM
Subject: Pure spider: doing HEAD requests with Httrack
Author: Marj
Date: 08/19/2013 21:55
 
Hello everyone,

I currently have a test site set up like this:

/index.html (default page)
/robots.txt
/images
	/images/image1.jpg
	/images/hidden-image.jpg

index.html contains an embedded image (/images/image1.jpg)

robots.txt contains:
User-agent: *
Disallow: /hidden-folder

This is what I want to achieve:

I want to build a sitemap that is as complete as possible, I want two things
in particular:
1. traverse each directory, and if the directories are browseable, crawl each
visible file. In this example, folder /images is browseable, and I want
Httrack to also discover hidden-image.jpg, that is currently not linked from
anywhere
2. use robots.txt to discover the hidden folders, and crawl them as well (thus
bypassing the restriction), so /hidden-folder should be crawled too

According to the doc, I suppose option -t should do the trick:

"This option 'tests' all links - even those forbidden (by the robot exclusion
protocol) - by using the 'HEAD' protocol to test for the presence of a file.
"
So this example should work:
httrack <http://www.shoesizes.com/> -O /tmp/shoesizes -t

When looking at the Apache log file, I set GET requests instead of HEAD
requests:

"GET /robots.txt HTTP/1.1" 200 38 "-" "HTTRACK"
"GET / HTTP/1.1" 200 402 "-" "HTTRACK"
"GET /images/image1 .jpg HTTP/1.1" 200 38228

I am using these flags:
-vz --test

Debug output:
21:22:57	Info: 	engine: transfer-status: link recorded:
testsite.net/robots.txt -> 
21:22:57wl.parknInfo: etNote: robots.txt forbidden links for testsite.net are:
/hidden-folder
21:22:57	Info: 	Note: due to testsite.net remote robots.txt rules, links
beginning with these path will be forbidden: /hidden-folder (see in the
options to disable this)

So the robots.txt file has been read, the hidden folder has been detected but
strangely Httrack says it's still forbidden.
What I want is a pure spider that will crawl every file doing HEAD requests,
only when the content-type is text/html (or similar) it should then perform a
GET request just after the HEAD, and look for more links in the HTML.

I didn't reach a satisfactory outcome with wget. I was thinking Httrack might
be the answer.
In short I want to index discover and index as many files as possible,
including those prohibited by robots.txt.

I am using the command line version 3.47-23+libhtsjava.so.2 [XR&CO'2013] on
Centos

Thank you
 
Reply


All articles

Subject Author Date
Pure spider: doing HEAD requests with Httrack

08/19/2013 21:55




4

Created with FORUM 2.0.11