SOS: Not going beyond the starting URL - HTTrack Website Copier Forum

Subject: SOS: Not going beyond the starting URL
Author: Sanjay Das
Date: 12/09/2002 11:49
Greetings,

A few weeks earlier, I had successfully downloaded 100,000 
selected webpages, among more massive webpages in a site, 
using the Scan Rules suggested by you. But this time, with 
another similar website, I am facing problem.

The earlier website was a Question & Answer forum, unlike 
this one, forum.httrack.com, as well as the present one I 
am about to download.

The problem is that HTTrack is not downloading beyond the 
starting URLs.

I am starting a project with 8 sub-URLs within the site, 
and have tested both with the Scan Rules (many times) and 
without the Scan Rules, to get the same result.

Please find below the following:
A. The General Structure Of The Site,
B. The Objective Of The Project To Be Downloaded By     
HTTrack,
C. The Scan Rules And Start URLs I Used,
D. The Problem Faced,
E. The Request For Solution To You

A. The site has this general structure:

The main Question & Answer site is: 

<https://questionandanswers.thesite.com> (when this is typed 
into the address bar, it results in:

<https://questionandanswers.thesite.com/answers/main>)


Each category and sub category thereafter is like:

<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1100> 
(where for every category and sub-categry, the last four 
digits vary, ie 1100, 1101, 1102,...1108,...1127 etc.)


On every category and sub-category page, rhe Questions' 
captions, as links, are listed upto the number 25 and 
there after in the next page (as in search results in any 
search results). The link to go to the next page of 25 
Questions & Answers are like:

<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1100&qtype=all&num=25&sort=qestartts:D:R:d>
1&start=25
(please note that here the 'catid' comes)


The links of the Questions (which includes both the 
Question and its Answer) are as:

<https://questionandanswers.thesite.com/answers/main?cmd=threadview&id=121369>

(Please note that in it there is no 'catid', 
but 'threadview&id' for every Question & Answer, only 
varies is the last six digits, for every unique Question & 
Answers)

B. The Objectives Of The Project:

As before, I require to download only all the Questions & 
Answers in a select category and some sub-categories under 
it (all having unique 'catid's). I need to download only 
1+8 sub-category among all. The 'catid's are from 1100 to 
1109.

C. The Scan Rules And Starting URLs Used:

The starting URLs were:

<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1100>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1101>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1102>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1103>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1104>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1105>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1106>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1107>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1108>


I used as following Scan Rules at first,

-* +answers.google.com/index* 
+answers.google.com/*catid=1100* 
+answers.google.com/*catid=1101*
+answers.google.com/*catid=1102*
+answers.google.com/*catid=1103*
+answers.google.com/*catid=1104*
+answers.google.com/*catid=1105*
+answers.google.com/*catid=1106*
+answers.google.com/*catid=1107*
+answers.google.com/*catid=1108*
+answers.google.com/*search&catid* 
+answers.google.com/*threadview&id*

Secondly,

-* +answers.google.com/index*
+answers.google.com/*threadview&id*

Thirdly,

-* +answers.google.com/answers* 
+answers.google.com/*catid=1100* 
+answers.google.com/*catid=1101*
+answers.google.com/*catid=1102*
+answers.google.com/*catid=1103*
+answers.google.com/*catid=1104*
+answers.google.com/*catid=1105*
+answers.google.com/*catid=1106*
+answers.google.com/*catid=1107*
+answers.google.com/*catid=1108*
+answers.google.com/*search&catid* 
+answers.google.com/*threadview&id*

Fourthly,

-* +answers.google.com/answers/* 
+answers.google.com/answers/*catid=1100* 
+answers.google.com/answers/*catid=1101*
+answers.google.com/answers/*catid=1102*
+answers.google.com/answers/*catid=1103*
+answers.google.com/answers/*catid=1104*
+answers.google.com/answers/*catid=1105*
+answers.google.com/answers/*catid=1106*
+answers.google.com/answers/*catid=1107*
+answers.google.com/answers/*catid=1108*
+answers.google.com/answers/*search&catid* 
+answers.google.com/*threadview&id*

Fifthly,

-* +answers.google.com/answers*
+answers.google.com/*threadview&id*

Sixthly,

-* +answers.google.com/answers*
+answers.google.com/answers/*threadview&id*
+answers.google.com/answers/*qestartts*

[From the first to the fifth, I overlooked to use 
*qestartts*, but that was irrlevant perhaps, on the result 
I had for all the sample runs]

D. The Problem Faced:

For each run, from first to the sixth, the HTTrack ended 
in less than two minutes, where only the starting URL 
pages were downloaded and not one Question & Answer pages.



Please guide as to where I went wrong and why. May I 
request you to please construct a Scan Rule and other 
settings, on the informations I provided ?
Is it also possible that some websites, in particular this 
one, are crawler-proof ? If so how to get over this 
problem, by over coming its all and any constraints ? 

I would remain the most genuinely thankful to you on your 
help to get over this problem totally, and please accept 
my sincere thanks for the success I had in downloading the 
previous 100,000 pages, from another site.

Warm regards,
Sanjay
All articles
Subject	Author	Date
SOS: Not going beyond the starting URL		12/09/2002 11:49
Re: SOS: Not going beyond the starting URL		12/09/2002 20:13