| I'm trying to grab a site which has a number of items with 3 sub pages. On some
of these all 3 pages work, on some the third page isn't there. (on the
original site) and these two alternatives behaves, as you'd expect, great when
grabbing the site. The third alternative provides the link, just as the items
with 3 working pages, but the link lets you sit there in limbo, waiting for a
response for a while, and then the server drops you of on the page with the
list of all the items. And this is where WinHTTrack breaks down, and starts
recursing all the items, over and over.
I went in and manually added all the broken links to the exclusion list, which
seams to have worked out, but now i'm thinking... Isn't there a way to make
WinHTTrack recognize that these are in fact pages it's already been going
through?
Page structure at the point of breakdown, and links leading away from it:
www.site.com/items (This is the page with the listing)
www.site.com/items/item_x (This is the item page, with it's 3 pages)
www.site.com/items/item_x/where (This is supposedly a list of suppliers, but
in 99% of the cases it drops you at "www.site.com/items" while not letting you
know something went wrong. Clicking on any item in the list now, brings you to
the "correct" page with it's 3 pages, and does so through a correct link, yet
WinHTTrack keeps recursing this over and over down to it's maximum dept.
What i did, manually adding all the broken links, seam to work fairly ok.
There is a link or two i haven't been able to locate, but rather then every
broken item having 100+ MB on my drive, it's 1 with a 5 MB storage need, and
the rest is, if "large" 200-500kB, the majority is just 10-20kB. The over all
data amount wont reach in to the 9GB it did before.
Bottom line is, this seams to be working, i just wanted to know if there is a
"right" way of doing this, rather then adding a couple of 100 links manually
to the exclusion list.
Thank you guys n' girls for your time and effort.
| |