| For personal use, I'm trying to save search results from Simply Hired (SH). SH
paginates search results with 50 hits per page and HTML navigation links to
the other pages. A SH search results page also includes a large number of
links irrelevant to my purpose.
The URLs I want HTTrack to follow and save are in the form:
www.simplyhired.com/search?bunch-of-arguments&pn=nn, where nn is the page
number. I want the pages saved without going deeper than that (i.e. save the
list of jobs, not each job linked to on the page).
If this is possible with HTTrack, I'm not figuring it out. Here is the log. I
canceled the download when I saw it was grabbing everything, but I snipped the
resulting errors from the log for brevity's sake.
HTTrack3.48-21+htsswf+htsjava launched on Fri, 06 Nov 2015 11:35:02 at
-www.simplyhired.com +www.simplyhired.com/search*&pn= +*.png +*.gif +*.jpg
+*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
(winhttrack -qir2C1%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5
(compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by
HTTrack Website Copier/3.x [XR&CO'2014], %s -->" -%l "en, *"
-O1 "C:\Users\admin\Documents\httrack\SimplyHired" -www.simplyhired.com
+www.simplyhired.com/search*&pn= +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js
-ad.doubleclick.net/* -mime:application/foobar )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive
such as username/password authentication for websites mirrored in this
do not share these files/folders if you want these information to remain
11:35:02 Warning: Note: due to www.simplyhired.com remote robots.txt rules,
links beginning with these path will be forbidden: /a/legal/,
/a/job-feed/rss/, /a/job-details/view/, /a/job-details/, /a/job-alerts/cancel,
/a/jump/, /a/error/, /a/backend/, /account/, /job/, /job-id/, /jobs/,
/a/jobs/list/tab/, /a/job-details/bounce/, /serp, /a/salary/search/,
/a/job-alerts/create-json/, /a/local-jobs/city/, /event-logging/,
/static/widgets/fancybox/source/, /suggest/, /myresume/, /a/job-alerts/,
/a/saved-*/get, /a/jobtrends, /a/jobs/rss, /a/job-feed/rss/, /a/jbb (see in
the options to disable this)
11:35:06 Warning: HTML file (0 bytes) retransferred due to lack of cache:
11:35:07 Warning: HTML file (0 bytes) retransferred due to lack of cache:
11:35:07 Error: Exit requested by shell or user
11:35:09 Warning: File not added due to mirror cancel:
11:35:09 Warning: File not added due to mirror cancel:
11:35:09 Warning: File not added due to mirror cancel:
11:35:10 Warning: File not added due to mirror cancel:
11:35:10 Warning: File not added due to mirror cancel:
11:35:10 Warning: File not added due to mirror cancel:
11:35:10 Warning: File not added due to mirror cancel:
11:35:10 Warning: File not added due to mirror cancel:
11:35:10 Warning: File not added due to mirror cancel:
HTTrack Website Copier/3.48-21 mirror complete in 10 seconds : 157 links
scanned, 30 files written (2438206 bytes overall), 4 files updated [239969
bytes received at 23996 bytes/sec], 198584 bytes transferred using HTTP
compression in 2 files, ratio 18%, 2.0 requests per connection
(127 errors, 375 warnings, 0 messages)
As you might notice from the URLs in the log, it's not enough to grab URLs
with "search" in them because URLs I don't want also have "search" in them.
The URLs I need must have the string "&pn=" in them. Thanks for any help.
| |