Re: All around help - HTTrack Website Copier Forum

Subject: Re: All around help
Author: Matt
Date: 09/30/2017 22:42
There are two Major parts that you are confusing.

1) is the Start or Base Page. You can think of this as the first like that
your forcing HTTrack to click on.

2) is the Filter Rules, this is what stops you from downloading every page on
the entire internet.  its where the  +, -, and * RULES go.


So Lets Say you wanted to download OUR site HTTrack.com

In the "Web Adresses: (URL)" box you would enter
"www.httrack.com"
Its recommended tyhat you don't enter the 'http:\\' part unless you later have
problems.


THIS WILL GENERATE A DEFAULT SET OF RULES, THAT ARE BUGGY AT BEST
 - The jist of the default rules is "pages similar-ish to the base page"

So we want to write our own rules.
Click the "Set options..." button
Click the "Filter" tab

Delete all that is there

First Rule (BLOCK EVERYTHING) so for the first rule like write
-*

Now we allow back the stuff we want
+www.httrack.com/*

See if you look on the front page there is a Link below "Welcome" that is
labeled "free" and it points to <http://www.gnu.org/philosophy/free-sw.html> and
a linke next to it labeled "GPL" pointing at
<http://www.gnu.org/licenses/gpl.txt>

Without the Block All, and then adding back the stuff we want, HTTrack would
grab all that stuff too and all the stuff that stuff points to until we have
the entire internet.


Oh But wait that DIDN"T WORK!!!!  The Blog is missing!!!  Why????Because the
Blog is really at blog.httrack.com

So we could add
+blog.httrack.com

OR
change the rule from "+www.httrack.com/*" to be the more inclusive
+*httrack.com/*


So were good??  NO still not working.

The Blog require a bid of code that it doesn't have, but tells you browser to
go get from Google called Jquery. So we need to allow Httrack to get that too
+ajax.googleapis.com
This is found by looking at the actual site's source code, or looking at the
broken downloaded site's "External" links.

Usually it will take only one or two extra rules to get the site properly.


The Other thing to remember is that when you limit you self to just a certain
part of a website, you must allow HTTrack to get the pages that have thew
links to the stuff
you want.

Lets say your downloading a site full of PDF's about fruit

If all the pdfs on Apple varieties are linked off a page about Apples you need
to get that page, and likewise the pear page for pears.  AND you would need to
get the Fruit-Trees page to get the links for the Pear Page and the Apple
Page. Now if there are 50 fruit tree pages and you really only want the two,
you could, do a few things to limit it.


You could add to your filter to limit the links found on "Fruit Trees" page

OR 

Not get the Fruit Tree" page at all, and use two Base Pages (Apple and Pear)
As the base pages are just starting links you provide to HTTrack.


I hope that Helps
Create subthread
All articles
Subject	Author	Date
All around help		09/19/2017 23:01
Re: All around help		09/30/2017 22:42