| Greetings,
A few weeks earlier, I had successfully downloaded 100,000
selected webpages, among more massive webpages in a site,
using the Scan Rules suggested by you. But this time, with
another similar website, I am facing problem.
The earlier website was a Question & Answer forum, unlike
this one, forum.httrack.com, as well as the present one I
am about to download.
The problem is that HTTrack is not downloading beyond the
starting URLs.
I am starting a project with 8 sub-URLs within the site,
and have tested both with the Scan Rules (many times) and
without the Scan Rules, to get the same result.
Please find below the following:
A. The General Structure Of The Site,
B. The Objective Of The Project To Be Downloaded By
HTTrack,
C. The Scan Rules And Start URLs I Used,
D. The Problem Faced,
E. The Request For Solution To You
A. The site has this general structure:
The main Question & Answer site is:
<https://questionandanswers.thesite.com> (when this is typed
into the address bar, it results in:
<https://questionandanswers.thesite.com/answers/main>)
Each category and sub category thereafter is like:
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1100>
(where for every category and sub-categry, the last four
digits vary, ie 1100, 1101, 1102,...1108,...1127 etc.)
On every category and sub-category page, rhe Questions'
captions, as links, are listed upto the number 25 and
there after in the next page (as in search results in any
search results). The link to go to the next page of 25
Questions & Answers are like:
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1100&qtype=all&num=25&sort=qestartts:D:R:d>
1&start=25
(please note that here the 'catid' comes)
The links of the Questions (which includes both the
Question and its Answer) are as:
<https://questionandanswers.thesite.com/answers/main?cmd=threadview&id=121369>
(Please note that in it there is no 'catid',
but 'threadview&id' for every Question & Answer, only
varies is the last six digits, for every unique Question &
Answers)
B. The Objectives Of The Project:
As before, I require to download only all the Questions &
Answers in a select category and some sub-categories under
it (all having unique 'catid's). I need to download only
1+8 sub-category among all. The 'catid's are from 1100 to
1109.
C. The Scan Rules And Starting URLs Used:
The starting URLs were:
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1100>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1101>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1102>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1103>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1104>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1105>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1106>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1107>
<https://questionandanswers.thesite.com/answers/main?cmd=search&catid=1108>
I used as following Scan Rules at first,
-* +answers.google.com/index*
+answers.google.com/*catid=1100*
+answers.google.com/*catid=1101*
+answers.google.com/*catid=1102*
+answers.google.com/*catid=1103*
+answers.google.com/*catid=1104*
+answers.google.com/*catid=1105*
+answers.google.com/*catid=1106*
+answers.google.com/*catid=1107*
+answers.google.com/*catid=1108*
+answers.google.com/*search&catid*
+answers.google.com/*threadview&id*
Secondly,
-* +answers.google.com/index*
+answers.google.com/*threadview&id*
Thirdly,
-* +answers.google.com/answers*
+answers.google.com/*catid=1100*
+answers.google.com/*catid=1101*
+answers.google.com/*catid=1102*
+answers.google.com/*catid=1103*
+answers.google.com/*catid=1104*
+answers.google.com/*catid=1105*
+answers.google.com/*catid=1106*
+answers.google.com/*catid=1107*
+answers.google.com/*catid=1108*
+answers.google.com/*search&catid*
+answers.google.com/*threadview&id*
Fourthly,
-* +answers.google.com/answers/*
+answers.google.com/answers/*catid=1100*
+answers.google.com/answers/*catid=1101*
+answers.google.com/answers/*catid=1102*
+answers.google.com/answers/*catid=1103*
+answers.google.com/answers/*catid=1104*
+answers.google.com/answers/*catid=1105*
+answers.google.com/answers/*catid=1106*
+answers.google.com/answers/*catid=1107*
+answers.google.com/answers/*catid=1108*
+answers.google.com/answers/*search&catid*
+answers.google.com/*threadview&id*
Fifthly,
-* +answers.google.com/answers*
+answers.google.com/*threadview&id*
Sixthly,
-* +answers.google.com/answers*
+answers.google.com/answers/*threadview&id*
+answers.google.com/answers/*qestartts*
[From the first to the fifth, I overlooked to use
*qestartts*, but that was irrlevant perhaps, on the result
I had for all the sample runs]
D. The Problem Faced:
For each run, from first to the sixth, the HTTrack ended
in less than two minutes, where only the starting URL
pages were downloaded and not one Question & Answer pages.
Please guide as to where I went wrong and why. May I
request you to please construct a Scan Rule and other
settings, on the informations I provided ?
Is it also possible that some websites, in particular this
one, are crawler-proof ? If so how to get over this
problem, by over coming its all and any constraints ?
I would remain the most genuinely thankful to you on your
help to get over this problem totally, and please accept
my sincere thanks for the success I had in downloading the
previous 100,000 pages, from another site.
Warm regards,
Sanjay
| |