HTTrack Website Copier
Free software offline browser - FORUM
Subject: Download one page from a Drupal based site
Author: Dustin
Date: 09/12/2010 01:56
 
I have this website that looks like it is based on Drupal (from looking at the
HTML and seeing references to Drupal in code)

This website is a agency that posts legal information and it requires a user
to log on. I have a subscription to this site, and I would like to mirror one
portion of this site so I can srape the data without having to key in the
information or copy 'n' paste.

I have been playing with HTTrack software and I keep getting 403 error when
trying to access the page. I was able to capture the authentication and when I
try to retrieve a specific folder, I get 403 Forbidden.

The specific location I want to reach is www.thexxxxx.com/marriage.  If I use
Firefox, I logon first and then go to the marriage folder and I get the first
page of the list. At the bottom is the buttons to go to the next page.  I only
want to grab the first page.

Here is the log file.... maybe someone will have some idea what I am doing
wrong here.

 HTTrack3.43-9+htsswf+htsjava launched on Sat, 11 Sep 2010 19:37:08 at
<http://thexxxxx.com/user?>postfile:C:\playv3\page1\page1\hts-post0> -* +*.png
+*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
+http://www.thexxxxx.com/marriage

(winhttrack -qwr1%e0C2%Ps2u1%s%uN0%I0p7DaK0H0%kf2A25000%f0#f -F "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.0)" -%F  -%l "en, en, *"
<http://thexxxxx.com/user?>postfile:C:\playv3\page1\page1\hts-post0> -O1
C:\playv3\page1\page1 -* +*.png +*.gif +*.jpg +*.css +*.js
-ad.doubleclick.net/* -mime:application/foobar +http://www.xxxxx.com/marriage
)



Information, Warnings and Errors reported for this mirror:

note:	the hts-log.txt file, and hts-cache folder, may contain sensitive
information,

	such as username/password authentication for websites mirrored in this
project

	do not share these files/folders if you want these information to remain
private



19:37:09	Info: 	Note: due to thexxxxxx.com remote robots.txt rules, links
begining with these path will be forbidden: /includes/, /misc/, /modules/,
/profiles/, /scripts/, /sites/, /themes/, /CHANGELOG.txt, /cron.php,
/INSTALL.mysql.txt, /INSTALL.pgsql.txt, /install.php, /INSTALL.txt,
/LICENSE.txt, /MAINTAINERS.txt, /update.php, /UPGRADE.txt, /xmlrpc.php,
/admin/, /comment/reply/, /contact/, /logout/, /node/add/, /search/,
/user/register/, /user/password/, /user/login/, /?q=admin/,
/?q=comment/reply/, /?q=contact/, /?q=logout/, /?q=node/add/, /?q=search/,
/?q=user/password/, /?q=user/register/, /?q=user/login/ (see in the options to
disable this)

19:37:09	Error: 	"Forbidden" (403) at link thexxxx.com/user/xxx (from
primary/primary)

19:37:09	Info: 	No data seems to have been transfered during this session! :
restoring previous one!

 
Reply


All articles

Subject Author Date
Download one page from a Drupal based site

09/12/2010 01:56
Re: Download one page from a Drupal based site

09/12/2010 16:55




0

Created with FORUM 2.0.11