HTTrack Website Copier
Free software offline browser - FORUM
Subject: Stop Words
Author: guenter strubinsky
Date: 03/12/2018 12:46
 
The website, that I am tracking (and I believe others too) contain junk
entries. I am NOT interested in increasing my breast size at the moment and I
don't want to own a yacht.

There are patterns in those and it would be wonderful if we could define
stop-words or stop-regexes, that would immediately go to the next page (since
those containing the stop words will not have ANY useful links worth
following). If there are stop regexes we can also avoid links in good pages
that have embedded commercials/ads. If the scan goes over the html, the
possibilities seem endless.

I will take a look at the code, but honestly, I worked on C about 15 years ago
and may be a tad rusty, while the authors know their baby and know exactly
where to add the pattern scanner to break processing the current page and
progress to the next.
 
Reply


All articles

Subject Author Date
Stop Words

03/12/2018 12:46




9

Created with FORUM 2.0.11