| Hello. I'm trying to download/archive a particular page and the sub-pages
linked from it, but for the life of me I can't get the filters just right so
that it snags what I'm after but ignores everything else.
The page is <https://tcrf.net/The_Legend_of_Zelda:_Ocarina_of_Time>
Sub-pages include, for example
<https://tcrf.net/The_Legend_of_Zelda:_Ocarina_of_Time/Unused_Link_Animations>
which include thumbnails of images, linked as
<https://tcrf.net/File:OoT_Link%27s_Animation_2310.gif>
and the true image is stored at
<https://tcrf.net/images/9/93/OoT_Link%27s_Animation_2310.gif>
Ideally I'd like to pick up only the content on
"/The_Legend_of_Zelda:_Ocarina_of_Time" and below, but since the files are
stored at "tcrf.net/File:*" and "tcrf.net/images/*" I'm having difficulty
working out the correct filter combination to achieve this. If I filter
everything (-*), then + the main page, I get the sub-pages, but lose all
images, including the skin for the pages. But if I +/File:* then it starts
picking up anything it can get its hands on under that directory. And if I
don't filter everything (-*) then it will continue to pick up anything else on
the main directory (ex. <https://tcrf.net/GoldenEye_007_(Nintendo_64>))
I assume this is partially because of the category pages linked everywhere;
"Nintendo 64", "Games published by Nintendo" etc. Theoretically, everything
I'm after should be under "/The_Legend_of_Zelda:_Ocarina_of_Time" and below,
barring the images, but I don't understand why it goes scanning for anything
under "/File:*" instead of only grabbing what's directly linked in the
sub-pages and in that directory.
I admit, I'm not tech savvy when it comes to this, so I'm way outside my
element, but if this is not achievable via filters, could the opposite
approach be taken? Mirror the whole site, then remove what I'm not after from
the mirror? I haven't seen anything along those lines mentioned, but I thought
it wouldn't hurt to ask. The only other option I can think of is if there's a
way to compile the list of URL's linked to by the pages, then feed those in
via a list and have it only pick up what's under those URL's.
Anyway, any help would be appreciated.
Regards, El3mental | |