HTTrack Website Copier
Free software offline browser - FORUM
Subject: PDF Files from a site with Javascript
Author: TB
Date: 05/21/2015 13:11
 
The aim is to download all linked PDF files from the following site (and
combine them later):

<http://www66.statcan.gc.ca/eng/acyb_c1867-eng.aspx> [A statistical yearbook of
Canada]. I plan to do this for several yearbooks linked here:
<http://www66.statcan.gc.ca/acyb_000-eng.htm>

Possibility 1. Using HTTRACK on the site directly

Using HTTRACK for the site itself as

httrack <http://www66.statcan.gc.ca/eng/acyb_c1867-eng.aspx> "/statcan/" -*
+mime:/aspx/text/html +*.pdf

returns only the four PDF files that can directly be accessed. The problem is
that the other links need to be uncovered by Javascript first and I am unsure
how to this.

Possibility 2. Using HTTRACK on built in search function

The site features a search function.
<http://www66.statcan.gc.ca/acyb_001-eng.htm>

Searching for "1" and "a" essentially returns all pages of a yearbook I
presume. We can thus get a long list of pdf file links (see here).

Unfortunately using HTTRACK on this list returns lots of html files, but not
the required PDFs.

httrack
<http://www76.statcan.gc.ca/stcsrdcyb/query.html?f4=&f5=&f6=1+a&f9=&qp=%2Btopic%3A500001867&charset=iso-8859-1&style=eclfcyb&nh=50&lk=2&la=en&col=cybpdfen&qm=0&qp=topic%3A333333>
"/statcan/" +http://www66.statcan.gc.ca/eng/*" +*.pdf  

Any ideas how to change the options in HTTRACK? (or perhaps WGET works as
well?)
 
Reply


All articles

Subject Author Date
PDF Files from a site with Javascript

05/21/2015 13:11




2

Created with FORUM 2.0.11