| Hi. Before I start I want to express very sincere thanks to the developers of
HTTrack, which is clearly a mature product of long standing and absolutely
invaluable, and I am very grateful to have discovered it.
Although I've seen a number of threads here around this subject (and have read
through Fred's user guide) I haven't found a proper answer to this question:
why does "maximum external depth" not work? Nor indeed other rules, such as
exclusions under scan rules. Actually, more importantly, I want to know a
solution to my problem, but I'd really like it to be a methodical one that can
be logically understood and depended on rather than a hack that I must just
hope will work.
My scenario:
Using WinHTTrack Website Copier 3.49-2
I wish to crawl a site of perhaps 10-20k published and public pages -
predominantly Drupal but also aliasing through to some static HTML from older
sites. In it are links to a large number of PDFs and other documents on a
large and unknown number of third-party sites, which I want to archive too.
There are also images embedded from Wikipedia - I am ambivalent about grabbing
these, though it would be nice. There are third-party script libraries and
probably CSS, all of which I want to archive. And of course there are many
links to pages on other sites, but I am NOT interested in archiving them. In
other words I want to download files (documents of various sorts, JS and CSS,
and optionally images) but not HTML pages.
I have moved towards ever-stricter rules in an effort not to archive the
entirety of Wikipedia AND Wikimedia in all available languages, and to stop
drilling down into external sites. My most recent set of options and rules was
as follows (ignoring defaults):
Scan rules
+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.pdf +*.doc +*.docx +*.xls +*.xlsx
+*.ppt +*.pptx
-ad.doubleclick.net/*
-*wikipedia.org/* -*wikimedia.org/*
limits
max external depth: 0
max number of links: 1000000
max size HTML: 1000000
max size non-HTML: 20000000
links
get non-HTML files (I think this is default actually)
get HTML first
build
no external pages (I have also tried without this, when I realised that it had
nothing to do with actually *archiving* external files and only related to how
to deal with external links when not archiving an external file - preserve the
link, or make an error page for it?)
Experts
store HTML first
--------
This has had better results than previous efforts, but still archives a huge
amount of irrelevant material and I have no confidence that it will ever stop.
I cancelled the crawl before it ended, but when it appeared that HTTrack had
completed archiving the HTML on my target site (the count of HTML files within
it hadn't increased for a long time although archiving was ongoing).
My understanding of the rules above is that, firstly, nothing should have been
archived from any wikipedia.org or wikipedia.org subdomain. No documentation I
have read suggests that other options might override the scan rules, and
anyway the only option that seems pertinent is "maximum external depth", which
is 0. But when I stopped the crawl I had approaching 5000 files from those
sites, including hundreds of HTML pages and a couple of thousand JPGs. Tracing
through the log files it appeared that the first reference to Wiki*edia was a
link to an image on Wikipedia (not an embedded image, a link):
<https://en.wikipedia.org/w/index.php?title=File%3AFace_mask_torres_strait.JPG>
Of course, annoyingly, although it has a ".jpg" extension, this is actually an
HTML page. HTTrack archived it, and then made hundreds of other requests for
all of the other resources linked from that page. It moved onto the Wikimedia
page for that image:
<https://commons.wikimedia.org/wiki/File:Face_mask_torres_strait.JPG>
and on from there to pages about museums, users and so on, grabbing images,
HTML and everything else. So it was neither respecting the scan rules
excluding wiki*edia.org files, nor the "max external depth=0" option. I should
say that the situation had improved a lot from an earlier scan, where I had
the same external scan depth but hadn't excluded those domains, so the scan
rule had helped, but not fixed the issue as expect.
I wondered whether it was the fact that these pages have a ".jpg" extension
that tricked HTTrack into thinking they were images, at least for the purpose
of considering this rule - because actual image files wouldn't have outbound
links in them. But they are being processed as HTML and new links extracted
and followed, so it would be quite some bug to then ignore that rule based on
them being images. It does fit with what I'm seeing, but I don't think it's
the diagnosis if only because the problem applies to other sites too. In fact
what prompted me to finally stop the scan was actually not the Wiki*edia
files, but the fact that HTTrack appeared to be downloading an entire site
(www.vitae.ac.uk), which it had reached from a single link on my site. Again,
the max external depth option seemed to be having no effect and I stopped it
once I saw it had created a folder of 2Gb for that site and wasn't slowing
down!
I understand, of course, that there can be complex interactions between rules,
but I'm struggling to see how any of my rules and parameters would interact to
create the result I'm seeing.
I can also see that it's possible to add exclusions to the scan list - I will
certainly be adding www.vitae.ac.uk - but this is not how it's meant to work,
right? I should be able to rely on the max scan depth. To be frank I'd quite
like to archive some of the external target pages of links on my site, such as
the *first* Wikipedia page at the end of a link from my site (and its assets,
but no HTML pages beyond that), so I'd rather use scan depth than exclusions.
If I could apply different scan rules to external sites compared to my target
site (in particular, scan rules based on MIME type) that might also help, but
I can't see how to do that.
Any thoughts or advice would be much appreciated.
Thanks, Jeremy
| |