| I am trying to rip a portion of the documentation on MSDN, starting from some
focused pages and limiting the pages up in the logical site hierarchy.
Microsoft ideals being what they are, the site seems organized to not care
about the case of most characters in links; thus:
<https://msdn.microsoft.com/en-us/library/xxxxxx.aspx>
*is equivalent to*
<https://msdn.microsoft.com/EN-US/library/xxxxxx.aspx>.
It seems like all the links to 'EN-US' resources are being ignored (not
ripped), while all the links to 'en-us' resources are being saved/rewritten,
making me think HTTRACK is somehow ignoring them due to the case (?)
What I imagine happening is it basically creating a local 'en-us' directory to
copy content, writing linked files to it (both linked via 'en-us' and
'EN-US'), then checking the URLs against the local directory string while
rewriting. When the URL contains 'en-us', it sees the 'en-us' directory and
that checks out. When it gets to a URL with 'EN-US', I'm guessing it doesn't
see a local resource that matches that capitalization, and leaves the URL
alone(?) - that's all I can imagine :-(
Release notes seem to mention an intent to handle URLS with different cases in
release 3.33 (I'm using 3.48-21) - could it have gotten broken? is this not
what it was supposed to handle?
Has anyone encountered this kind of thing? is there a way to work around it? | |