Wikipedia jpg pages - HTTrack Website Copier Forum

Subject: Wikipedia jpg pages
Author: x
Date: 09/24/2012 01:56
I think I fell in the same problem reported at
<http://forum.httrack.com/readmsg/29267/index.html> , a wikipedia jpg page led
to the download of possibly the whole wikipedia.
I tried to reduce the problem though and it seems even stranger than at first
sight.

The jpg page I fell into is
<http://en.wikipedia.org/wiki/File:Romanian-kirilitza-tatal-nostru.jpg>, which
was referenced by two of the pages I was downloading,
<http://www.unicode.org/mail-arch/unicode-ml/y2006-m04/0216.html> and
<http://www.unicode.org/mail-arch/unicode-ml/y2006-m04/0235.html> .

I made some reduced tests using as starting web addresses one or both of the
unicode mail archive pages, starting with the default settings.

To make the bug come out you need to have either a +*.jpg filter or the "Get
non-HTML files related" option active, but in the second case it tbe bug is
much weirder (see below).

I always had Maximum external depth set to 0, of course httrack doesn't
respect this when it gets at the jpg,
 but it seems to consider the maximum mirroring depth, with a depth of 5 it
downloads a lot more than with a depth of 4, and with a depth of 3 it only
gets a few hundred files (this for the "Get non-HTML files" with 2 starting
web addresses case - see below).

With respect to the default options you only need additionally to change the
browser id, the default getting blocked by everyone.

In the commands here I also filtered out *unicode.org* and the other wikipedia
link to isolate the bug.

This is the command I used for the +*.jpg case:
winhttrack -qwr4%e0C2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A50000%f#f -F "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.0)" -%F "<!-- Mirrored from %s%s by
HTTrack Website Copier/3.x [XR&CO'2010], %s -->" -%l "en, en, *"
<http://www.unicode.org/mail-arch/unicode-ml/y2006-m04/0216.html> -O1
"C:\Test\unicode.org - romanian-kirilitza bug" +*.jpg
-http://en.wikipedia.org/wiki/Romanian_Cyrillic_alphabet -*unicode.org*
Starting from the 0235.html page is the same of course.

The problem is weirder with "Get non-HTML files" case though: it appears to
occur only when MORE THAN ONE downloaded page links to the wikipedia jpg. If
you start from only one of the forum pages and leave the -*unicode.org* filter
I used the scan will stop in a few seconds, no matter the depth. If on the
other hand you set both pages as starting web addresses, or only one but add a
filter to reach the other one and have enough depth
(+http://www.unicode.org/mail-arch/unicode-ml/y2006-m04/0235.html) you will
get the bug.

This is a command for the "Get non-HTML files" case:
winhttrack -qwr3%e0C2%Pns2u1%s%uN0%I0p3DaK0H0%kf2A50000%f#f -F "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.0)" -%F "<!-- Mirrored from %s%s by
HTTrack Website Copier/3.x [XR&CO'2010], %s -->" -%l "en, en, *"
<http://www.unicode.org/mail-arch/unicode-ml/y2006-m04/0216.html>
<http://www.unicode.org/mail-arch/unicode-ml/y2006-m04/0235.html> -O1
"C:\Test\unicode.org - romanian-kirilitza bug"
-http://en.wikipedia.org/wiki/Romanian_Cyrillic_alphabet -*unicode.org*

Note that for some reason the +*.jpg case requires one level of depth more
than the "Get non-HTML files" with two starting web addresses case to produce
the same results (4 vs 3 minimum), while "Get non-HTML files" with one
starting web addresses and the +<otherpage> case requires a minimum of 4,
likely because both link-following trails have to reach a certain point.

A final strange thing is that new.txt always reports (from
<http://en.wikipedia.org/wiki/File:Romanian-kirilitza-tatal-nostru.jpg>) for all
the unwanted pages, even if there was no link to them at that address.

Until this bug is corrected (it has likely been introduced recently) it's for
sure best to filter out at least *wikipedia.org/*.jpg in every project, but
note that it is likely that this affect any html page ending with an extension
normally associated with some other media type (there are just not many of
them outside wikipedia).
All articles
Subject	Author	Date
Wikipedia jpg pages		09/24/2012 01:56
can not download wikipedia pages		10/06/2012 10:50
Re: can not download wikipedia pages		10/06/2012 16:31