| The aim is to download all linked PDF files from the following site (and
combine them later):
<http://www66.statcan.gc.ca/eng/acyb_c1867-eng.aspx> [A statistical yearbook of
Canada]. I plan to do this for several yearbooks linked here:
<http://www66.statcan.gc.ca/acyb_000-eng.htm>
Possibility 1. Using HTTRACK on the site directly
Using HTTRACK for the site itself as
httrack <http://www66.statcan.gc.ca/eng/acyb_c1867-eng.aspx> "/statcan/" -*
+mime:/aspx/text/html +*.pdf
returns only the four PDF files that can directly be accessed. The problem is
that the other links need to be uncovered by Javascript first and I am unsure
how to this.
Possibility 2. Using HTTRACK on built in search function
The site features a search function.
<http://www66.statcan.gc.ca/acyb_001-eng.htm>
Searching for "1" and "a" essentially returns all pages of a yearbook I
presume. We can thus get a long list of pdf file links (see here).
Unfortunately using HTTRACK on this list returns lots of html files, but not
the required PDFs.
httrack
<http://www76.statcan.gc.ca/stcsrdcyb/query.html?f4=&f5=&f6=1+a&f9=&qp=%2Btopic%3A500001867&charset=iso-8859-1&style=eclfcyb&nh=50&lk=2&la=en&col=cybpdfen&qm=0&qp=topic%3A333333>
"/statcan/" +http://www66.statcan.gc.ca/eng/*" +*.pdf
Any ideas how to change the options in HTTRACK? (or perhaps WGET works as
well?) | |