| > For each run, from first to the sixth, the HTTrack ended
> in less than two minutes, where only the starting URL
> pages were downloaded and not one Question & Answer pages.
First, when you are crawling large sites, you *must* setup
reasonnable settings (for forums, not more than 2 or 3
simultaneous connections, and bandwidth limit) or the
websites will progressively ban all offline browsers.
Okay, for your problem, I did not see any obvious errors ;
please launch (in your browser) the top index.html of the
(quickly) mirrored project and check what are the links
written. There might be multiple reasons, and testing is
not very simple with https sites. If you let the mouse on a
link not mirrored, what do you see as URL? The problem can
be multiple redirect pages, or even crawler protections (I
heard that some bad users were crawling google using too
aggressive settings, this is indeed a stupid thing to do)
| |