| > Did you have extend parsing turned on (attempt to
> detect all)?>
> The parser may be very simplistic. Simply skips the
> initial quote if present and stops at the trailing
> quote or space or angle brackets.
By default extended parsing is turned on:
"Attempt to detect all links (even in unknown tags/javascript code)"
I unchecked that option and I had another run.
(Both runs were set to "no robots.txt rules")
No difference: the images with alt= attributes containing <br> were not
downloaded.
Extended parsing option checked (default):
winhttrack -qiC2%Ps0u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f
Extended parsing option unchecked:
winhttrack -qiC2%P0s0u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f
So the extended parsing option makes no difference.
As you suggest: most likely the parser stops parsing the IMG element when it
encounters a closing angled bracket, even if that closing angled bracket is
inside an attribute declaration.
Presumably web browsers take locating the src= attribute as the first
priority.
Anyway, I've purged all <br>'s from the alt= attributes in my source code, and
that change will soon be pushed out. | |