| Hi-
I have a few websites that use an odd img format, either for archival,
debugging, or to prevent broken links (not sure why).
It's in the form:
<img SRC="img1.gif" NATURALSIZEFLAG="3" height=200 width=300 align=BOTTOM
OLDREF="old_img1.gif">
or even
<img SRC="imgA.gif" NATURALSIZEFLAG="3" height=200 width=300 align=BOTTOM
OLDREF="old_imgA.html">
Where the older version must have been its own page, instead of an image.
The issue is 2 things:
1) Httracks tries to download the oldref file, which typically does not exist,
or if it does, is either outdated or a duplicate. I'm not sure if this is by
design or not. (Possibly because of the "Attempt to detect all links" setting?
I want the tag downloaded, but only the src, not any other attribute.) The
generated site backup still seems to use only the src attribute, so nothing
seems broken. But it can result in larger backup sizes. NBD
2) If the oldref attribute is a different file type, it attempts to download
the old file name, with the new extension. In the second example above,
httracks will try to download "old_imgA.gif", even though there is no such
text string anywhere in the file. Even if it's downloading the oldref as part
of the detect-all-links-setting, it's using the wrong extension.
(I'm guessing it's because the recursive logic is setting a var for the
extension only once per tag, but is then downloading whatever looks like a
filepath in the tag, but is using new filename and the first extension.
Haven't looked at any code, but thats my guess.)
I don't use the program nearly enough to justify spending time to learn the
code and make proper edits, but wanted to bring attention to it in case it's a
bug.
***End Bug Report***
***Begin Over-The-Top Suggestion for Fixing***
(aka sneaking in a feature suggestion as a bug fix)
Besides changing the recursive logic, another possible solution might be to
implement some sort of include/exclude filter in the settings (I can't think
of a way to do that off hand).
However I would suggest a debug or find and replace style filter, that could
catch and filter tags, attributes, and/or generic html.
This could also double as an error catcher, so that broken links or other
references can be fixed during the run, the scan will include the corrected
link, and it will continue to propagate down that path.
It would also prevent having to fix the same known errors when re-running, as
well as prevent trying to download "bad" links.
I don't use the prog enough to probably ever need that sort of feature, or
even experience these issues again, but hopefully this will help someone else
down the road that relies more heavily on it.
Cheers,
-B
| |