HTTrack Website Copier
Free software offline browser - FORUM
Subject: Scraping sitemap.xml links
Author: ng
Date: 09/18/2015 17:54
 
I'm trying to archive a Drupal site and have run httrack against my desired
site and it's done a fine job pulling some of what I require however the
active site doesn't link to ALL the content ever published. The only place
this is listed is in the sitemap.xml however as this is an XML document
httrack doesn't pickup any of the links so doesn't crawl and pull the pages
like it would if i pointed it to the site directly (it only pulls the xml file
and nothing else as there are no actual 'links' in the file). 

So my question is, is it possible to setup HTTrack to pickup the urls between
the '<loc></loc>' tags when i point it to the sitemap.xml? Alternatively i
have a list of all the urls (30,000) and I'm wondering if i can just feed it
these instead of a direct url starting point?
Cheers.
 
Reply


All articles

Subject Author Date
Scraping sitemap.xml links

09/18/2015 17:54
Re: Scraping sitemap.xml links

09/20/2015 01:10
Re: Scraping sitemap.xml links

09/20/2015 21:15




0

Created with FORUM 2.0.11