HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: filter setting
Author: Xavier Roche
Date: 01/22/2002 22:36
 
> 1, when I want download jpg and jpeg will regexp
>    '+*jpe*g'
>    and for htm and html
>    '*html*'
>    work?
+*.jpg +*.jpeg +*.html
should do the trick

> 2, I want to download files ending with 
> eg. 10, 20, 1003, 1289, 1345
> from URI like this:
> <http://foo.com/print.phtml?id=1234>
> is it possible to do it with one command or is
> necessary to use separate commands for each?> (and is possible to specify
range eg. 12-100
> or even mixed: 5, 8, 12-100, 128, 1006-1152 ?)

Ranges, no, but you can use:
+foo.com/print.phtml?id=*

> 3, on page eg.
> 
> <http://root.cz/index.html>
>  (are articles and discussions)
>  (clanek = article)
>  Art.1011 (http://root.cz/clanek.phtml?id=1011)
>  Disc.1011
>  <http://root.cz/forum/diskuse.php3?clanek=1011&>;
>  vlakno=0&stav=0&vse=Zobrazit+v%B9e
> 
>  Art.1010 (http://root.cz/clanek.phtml?id=1010)
>  Disc.1011
>  <http://root.cz/forum/diskuse.php3?clanek=1010&>;
>  vlakno=0&stav=0&vse=Zobrazit+v%B9e
> ('end' of index.html)
> 
> their printer friendly version are
>  articles
>     <http://root.cz/print.phtml?id=1011>
>     <http://root.cz/print.phtml?id=1010>
> ('print' instead of 'clanek')
> 
>  discussions
>     <http://root.cz/forum/diskuse.php3?clanek=1011&>;
>     vlakno=0&stav=0&vse=Zobrazit+v%B9e&print=1
> 
>     <http://root.cz/forum/diskuse.php3?clanek=1010&>;
>     vlakno=0&stav=0&vse=Zobrazit+v%B9e&print=1
> (there is appended '&print=1' on end of the URI)
> 
> but they aren't linked on index.html (but they are
> on Art.1011 (http://root.cz/clanek.phtml?id=1011))
> Is it possible to download it only the index.html
> file and only printer-friendly pages with images
> and other wanted datas?> I tried it - but no success - it was only possible
> to do it with downloading Art.10xy too.
> (Yes, there is probably a way - download index.html
> and with some bash scripting extract URI's , replace
> (probably with sed) clanek with print and feed it
> back to httrack - but it is hard way)

Wow, quite complex situation. If I understand, you 
want to to capture URLs that aren't linked in the 
index.html, but in deeper pages, without mirroring 
these pages. This isn't possible - you have to include 
pages which contains links to the pages you want to 
download.

You can exclude links, however, like in:
+root.cz/forum/diskuse.php3*
-root.cz/forum/diskuse.php3?*print=1*

This filter will accept all diskuse.php3 links except 
those with 'print=1' in the query string. By combining 
several filters, you may manage to sharpen the scope 
of the mirror

 
Reply Create subthread


All articles

Subject Author Date
filter setting

01/22/2002 19:56
Re: filter setting

01/22/2002 22:36




d

Created with FORUM 2.0.11