HTTrack Website Copier
Free software offline browser - FORUM
Subject: site returns a 404 error on a perfectly valid page
Author: Aaron
Date: 07/12/2017 08:02
 
I'm trying to download all the old pages of a forum so I can run a search on
it. The forum doesn't exist anymore, but it's been partly preserved by
web.archive.org.

The problem is, web.archive.org frequently gives you a 404 error EVEN on pages
that have links that I need to crawl. In other words, 404 errors on pages I
need to download and get links from... or else I can't get those links any
other way.

The problem is, WinHTTrack skips the pages because it sees the 404 error. I
need it to NOT skip any 404 error pages, and simply crawl them like normal.
Because web.archive.org's 404 error pages contain the links needed to get to
the dates of the snapshots it saved of the forum. But you'll always arrive at
a 404 page first, because you almost always start with an incorrect date. And
form the 404 error page, you are given links to the valid dates, which you can
click on to go to a snapshot of that page.

SO THE 404 PAGES are important, I need to crawl them.

For example, if you go to
<https://web.archive.org/web/20030128091916/http://www.underlight.com:80/forum/>,
it gives a 404 error. But that page is important, because if you look at it,
you'll see it gives you links to all the snapshots it saved for
underlight.com:80/forum. I need those links so I can crawl those snapshots.
But HTTrack skips the page because of the 404. Help?
 
Reply


All articles

Subject Author Date
site returns a 404 error on a perfectly valid page

07/12/2017 08:02




0

Created with FORUM 2.0.11