HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Scan Rules
Author: Christian
Date: 04/21/2015 06:57
 
Hi John!

Let me see if I understand what you want to do: you want to avoid copying
commands and links that end in "?like_comment=?" from being downloaded?
I am assuming you are mirroring a forum or a site that has "comments" section
at end of page, correct? Or are the pages you are copying pages that have
comments at the bottom (which in that case, it will make it much harder to
avoid copying, UNLESS, the comments are from an external site - like FaceBook
comments at bottom of page).

Assuming the "comments" are on separate pages... let me give you an example:

Say you WANT the following site, which links to "comment" section, for the
sake of an explanation, since you did not name what site you are mirroring:

<http://games.gamepressure.com/game1.html>

But in these pages for each game (/game1.html, /game2.html, etc), they have
links to comments and video sections, which you want to avoid. 

In this scenario, you WANT the game info section (example:
<http://games.gamepressure.com/game_info.asp?ID=game1>) of each game as well, so
we need to create Rules to ensure you get pages you want and not the one you
do not want. 

So here is what you want to do: First, let's create a Rule to GET PAGES YOU
WANT (in the scenario sample above):

+*http://games.gamepressure.com/game_info.asp?ID=*
+*http://games.gamepressure.com/game_info.asp?*
+*http://games.gamepressure.com/game_info.asp*
+*http://games.gamepressure.com/game_info.asp?ID=
+*http://games.gamepressure.com/game_info.asp?+*http://games.gamepressure.com/game_info.asp
+*+*http://games.gamepressure.com/game_info.asp?ID=*
+*http://games.gamepressure.com/game_info.asp?*
+*/game_info.asp?ID=*
+*/game_info.asp?*
+*/game_info.asp*
+*/game_info.asp?ID=
+*/game_info.asp?+*/game_info.asp

Do you see how I used positive (+) and wildcard (*) tags?? This is to ensure I
get the "game_info" pages of each game on the said site (although the site
will be pretty big, but that is OK, it will have info I need). :)

Second, let's create a Rule to AVOID PAGES/COMMENTS YOU DO NOT WANT (in the
scenario sample I created above, you want to avoid the videos, since they are
huge in size):

-*http://games.gamepressure.com/movie.asp?ID=*
-*http://games.gamepressure.com/movie.asp?*
-*http://games.gamepressure.com/movie.asp*
-*http://games.gamepressure.com/movie.asp*
-*http://games.gamepressure.com/games_movies.asp*
-*http://games.gamepressure.com/movies_list.asp?ID=*
-*http://games.gamepressure.com/movies_list.asp?*
-*http://games.gamepressure.com/movies_list.asp*
-*http://games.gamepressure.com/movies_list.*
-*http://games.gamepressure.com/movies_list*
-*/movie.asp*
-*/movie.asp?ID=*
-*/movie.asp?*
-*/games_movies.asp*
-*games_movies.asp*
-*/games_movies.asp*

I also added:  -*.mov -*.mpg -*.mpeg -*.avi -*.asf -*.mp3 -*.mp2 -*.rm -*.wav
-*.vob -*.qt -*.vid -*.ac3 -*.wma -*.wmv -*.mp4 -*.mp5  negative wildcards  so
I can be even more certain that I will not get the huge video files. :) I can
also add negative Rules from YouTube if the page links to YouTube and also
check boxes on "Build" tab --> "no external pages".

Makes sense?

Third, in your scenario (since I do not know what page you are trying to block
comments from, which makes it hard to get specific....

But GENERALLY speaking, you can try the following Rules to avoid the comments
(I try to be broad with what you do not want, and if it does not copy a lot of
things I want accidentally, I modify the Rule and "update" project). :)

-*"?like_comment=?*
-*"?like_comment=?-*"like_comment=?*
-*"like_comment=?-*"like_comment=*
-*"like_comment=
-*"like_comment*
-*"like_comment


As you can see, I had an "*" at end of some Rules and not on others. There is
an explanation as to that, but unless I know your specific site, it makes it
hard to know if need that * at end or not. 



***I can help you further if you provide more info. Please clarify or provide
sample site links you want and want to avoid. 

Also, I posted a bunch of pre-made Rules on the Forum last week to help other
average users. :) Check it out:
<http://forum.httrack.com/readmsg/33945/index.html>


I hope this helped. :)

Christian ^_^
 
Reply Create subthread


All articles

Subject Author Date
Scan Rules

04/19/2015 22:57
Re: Scan Rules

04/21/2015 06:57
Re: Scan Rules

04/21/2015 07:00




1

Created with FORUM 2.0.11