| > I'm downloading a text file list of urls but am
> having trouble getting the images from those pages
> to download.
Sorry, but this doesn't make sense to me. A "text file of URL's" would
contain URL's only, in text, hence why it is called a "text file". And you
wouldn't be "downloading" it, as my thought process goes, because it would be
downloadED (i.e. saved) and presumed the starting point (URL list) of the
mirror.
> If the urls are all at:
> <http://articles.site.com/article/pagename>
So to start with, am I correct in thinking that you have a text file saved (in
your project folder perhaps) that has a list of URL's which refer to various
articles published at the location <http://articles.site.com/article/{pagename>}
with the {pagename} as the only difference between each of the URL's in the
list?
> I know the images are at:
> <http://img.site.com/img/pages/articles/subfolder/image-name.jpg>
So there is no correlation between the article "pagename" and the location of
the image files?
So you have this file, say "links.txt, that contains URL's, such as
<http://articles.site.com/article/pageA1.html>
<http://articles.site.com/article/pageB1.html>
<http://articles.site.com/article/pageB2.html>
<http://articles.site.com/article/pageB3.html>
<http://articles.site.com/article/pageC1.html>
and you know (by peeking ahead, I'm supposing) that each of those "pages"
contain images in the form of ''<img src='' on the "img" server of site.com
or
<http://img.site.com/{something}/{something}/*.jpg>
If that's the case, it looks like all is well to me.
(That is, until... "Maximum mirroring depth" - see below)
>
> So I set my filters to be:
> -*
> +articles.site.com/article/*.*
> +images.site.com/*.*
> +img.site.com/*.*
> -*.pdf
Where did images.site.com come from? Also, for the middle three (or maybe
should be two), I would use either just /* after the .com or something like
this (if "this" is what you want):
+articles.site.com/*[path]*.html
+img.site.com/*[path]*.jpg
>
> My mirroring depth and external depth are set to 0
> because I only want the pages in the list of URLs. I
> have checked the "Get non-HTML files related to a
> link" checkbox.
>
I wish I could explain "Why?", but I have NEVER been able to get even images
embedded in the starting URL page unless I've had max mirror depth blank
(default) or set to at least 2 - even with the "get near" option selected. I
dunno; maybe I'm doing something wrong myself because to me, if I have max
depth set to 1, logically that means I want the starting page and everything
(per filters) to make it "mirrored" locally. But using the "-*" as the first
filter, and following up with exclusively what I want, eliminates setting the
max depth to 2 as being a problem.
If I'm not wrong about that, it seems pretty unintuitive to me. (No offense,
Xavier!)
The way I see it, the "get near files" selection just helps you out by saving
you from putting all filters (+'s) you would need without it. You still seem
to have to leave the max depth blank or set it to something other than 0 or
1.
> There's nothing in the robots.txt that would
> preclude the images from downloading (although I'll
> ignore robots.txt next time I try to mirror anyway
> just to be sure).
Your log file will tell you near the top if there is a robots.txt restriction.
It's extremely clear when there is one...
> Why are the images from the pages not downloading?
Regardless of the fact that I was having trouble following what you were
trying to do, I think changing the Maximum Mirror Depth (not External Depth)
to 2 or higher will get you the images you've been looking for!
> Thanks,
> Ari
HTH,
~Bp | |