HTTrack Website Copier
Free software offline browser - FORUM
Subject: Filtering links based on the anchor text
Author: Alain Desilets
Date: 02/16/2012 21:47
I want to run Httrack and only crawl "interlanguage links". By this, I mean
links that take you from a page written in say, English to its equivalent page
written in say, French.

I other words, I'd like to be able to tell Httrack to only follow links whose
anchor text contains French or Fran├žais or fr (case insensitive). Or, click
on buttons or select options whose labels match the above.

Looking at the the httrack command line options, it seems you can filter based
on the URLs, but not based on the anchor text. I also looked in the
documentation of the callbacks, and it seems the linkdetected callback only
receives the URL. No other information about the element whose "clicking" lead
to that URL.

Is there a way I could get to this functionality?
If not, I'd be willing to see if i could add this into the source, but I would
need some guidance. Httrack is a pretty intimidating piece of code for someone
who hasn't done C in 20 years ;-).



All articles

Subject Author Date
Filtering links based on the anchor text

02/16/2012 21:47


Created with FORUM 2.0.11