| Hello,
I posted a question on September 24, 2015 and never received any response.
Does anyone have any insights or similar problems?
I am working on scraping websites with the WiderNet Project in Chapel Hill, NC
and have run into some problems with drop-down menus. The site is
<http://smallwarsjournal.com/>. The problem pages are
<http://smallwarsjournal.com/jrnl/iss/archive> and
<http://smallwarsjournal.com/blog/archive/201509>, which lead to the journal
and blog archives and have drop-down menus to access the archives by month.
These pages themselves are scraped, but when I try to select another month I
get a 404 error. For example, for the journal archive I get a message “The
page you requested <http://smallwarsjournal.com/jrnl/iss/archive> could not
be
found” and for the blog archive I get a message like “The page you
requested <http://smallwarsjournal.com/blog/archive/201509> could not be
found.” I looked at the page source code and the different drop-down options
lead to relative links in the code. I’m wondering if it’s some database or
javascript setup that’s causing the problem? Here are the parameters I used,
copied from the doit file. The hts-log file
was too large to open.
-%F "<!-- this file was mirrored for the egranary digital library from %s%s
on
%s -->" -F "mozilla [en] egranary digital library system" -Q -C2 -t -%P -n
-s0
-%s -%u -N0 -p3 -D -a -K5 -H0 -%k -f2 -A25000 -%A cgi,php,php3,asp=text/html
-%f0 -#f -q -X -#L -o0 -u2 -qwC2%Pns0u2k%s%uN0I0%I0p3DaH0%kf2o0A25000%f#f -F
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -%l "en, en, *"
smallwarsjournal.com/ -O1 "X:\\egCache\\smallwarsjournal.com" -* +*.css +*.js
-ad.doubleclick.net/* -mime:application/foobar +smallwarsjournal.com*
-smallwarsjournal.com/user* +*.gif +*.jpg +*.png +*.tif +*.bmp +*.zip +*.tar
+*.tgz +*.gz +*.rar +*.z +*.exe +*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3
+*.mp2 +*.rm +*.wav +*.vob +*.qt +*.vid +*.ac3 +*.wma +*.wmv -#L10000000 -O
"X:\\egRawScraped\\smallwarsjournal.com,X:\\egCache\\smallwarsjournal.com"
-%A
cgi=text/html -%A php,php3,asp=text/html
I’m also having similar issues with this site: <http://www.mmrjournal.org/>.
The page <http://www.mmrjournal.org/archive> has two drop-down menus that link
to journal volumes. In my scrape the volume 2 menu is dropped down and the
volume 1 menu doesn’t work; when I click on volume 2 it folds up and then
doesn’t work at all.
Here are the parameters from the log file:
HTTrack3.45-3+htsswf+htsjava launched on Fri, 18 Sep 2015 16:34:03 at -*
+*.jpg
+*.gif
+*.css
+*.js
+*.png
-*page2ad*
-*google-analytics*
-*paypal.com*
-*akamaitech.com*
-*google.com*
-*doubleclick.com*
-*googlesyndication*
-*webtrendslive*
-*sitemeter.com*
-*ads.web.aol.com*
-*nedstatbasic.net*
-*webstats4u.com*
-*webstats.motigo.com*
-*statcounter.com*
-*service.urchin.com*
-*snap.com*
-*stats.wordpress.com*
-*visit.webhosting.yahoo.com*
-*mailto*
+www.mmrjournal.org/* www.mmrjournal.org/
(httrack -%F "<!-- this file was mirrored for the egranary digital library
from %s%s on %s -->" -F "mozilla [en] egranary digital library system" -Q -C2
-t -%P -n -s0 -%s -%u -N0 -p3 -D -a -K5 -H0 -%k -f2 -A25000 -%A
cgi,php,php3,asp=text/html -%f0 -#f -q -X -#L -o0 -u2 -O
X:\egRawScraped\www.mmrjournal.org,X:\egCache\www.mmrjournal.org -%S
X:\UpdateSW\HTTrackScanRules\ScanRulesFull.txt +www.mmrjournal.org/*
www.mmrjournal.org/ )
Thank you for any suggestions or help! | |