| Ok, I should just say RTFM, but here is the real answer
Your scan rules are
+*[name].*[name]tpg.ch*[name].*[name]/*
-*[name].*[name].com*[name].*[name]/
-*[name].*[name].net*[name].*[name]/*
First problem: you don't want every thing so your first FILTER line should be
-*
that block everything! All Page4s All Sites!
then we add stuff we do want
+www.tpg.ch/documents/*
Ok as for the <http://www.tpg.ch/fr/horaires> page, I only get to see the
"International version" and on that version of the page all the colored
numbers are hard links to the pdf (No PHP)
"/documents/10162/16057571/tpg_ligne_18-11dec2016.pdf" for the 18 line
But going on your description we should add
+www.tpg.ch/html/pdf/*
You must also add all the pages that have the links to the pdfs, so HTTrack
gets those pages to then get the links.
+www.tpg.ch/fr/horaires/rechercher*
Now Your start pages. As were restricting the site content so much, there
might not be any links from your start page to the pages with the links so we
add all of your needed base pages.
www.tpg.ch/fr/plans-du-reseau
www.tpg.ch/fr/plans-de-connexion
www.tpg.ch/livre_horaire
www.tpg.ch/fr/horaires
We don't need to add rules '+' for these base pages, as its implied.
We also don't need the "www.tpg.ch" you have currently so remove it.
Other Things:
Be nice, turn some of the settings down
MaxRate 250000
MaxConn 5
Socets 5
Trave 1
Clear these settings
ExtDepth
Depth
Give that a try.
| |