| HTTrack version 3.03beta (Jul 2001) pre-release notes
This file is a set of various notes about new features
introduced since 2.X releases of HTTrack. Many new
features and changes haven't yet been included in the
current documentation. Most features can be listed
using the commandline version of HTTrack:
$ httrack --quiet --help | more
Important note:
These remarks apply for the commandline version of
HTTrack, but also for the GUI version : most options
have been
implemented in the interface.
Options not implemented in the GUI version can be
inserted directly in the "filters" list. Therefore,
all these options can used in the GUI version.
FAQ
---
The "general" FAQ, which lists most-frequently-asked
questions, has been updated.
See "faq.html"
New filters options:
--------------------
A new filter type has been added, which allows you to
accept/refuse files depending on their size. This
filter has
to be included at the end of a standard filter
The syntax is:
<filter>*[<NN]
or
<filter>*[>NN]
or
<filter>*[<NN>PP]
The meaning of each filter:
*[<NN]
Means "match if the file is smaller than NN KB"
*[>NN]
Means "match if the file is bigger than NN KB"
*[<NN>PP]
Means "match if the file is smaller than NN KB AND
bigger than PP KB"
Example:
-*.gif*[>5] -*.zip +*.zip*[<10]
Will:
refuse all gif files smaller than 5KB,
exlude all zip files,
EXCEPT zip files smaller than 10KB
Note that in -*.gif*[>5] you can consider two parts:
- first part: "-*.gif" which is the well-known filter
- second part: "*[>5]" which is used when a filesize
is available
You can use the "size" filter anywhere, like in:
-*[>5]*.gif
BUT this is not recommended, as the syntax is not very
clear. Therefore, always all "size filters" at the end
of expressions.
Limits options:
---------------
%eN set the external links depth to N (* %e0) (--ext-
depth[=N])
This option is very useful when you have a website
with external HTML links you would like to get
together.
The default behaviour is not to get external HTML
files, but here, you can tell the engine "if you see
an external HTML link,
download it and get N levels"
Of course this option has to be used with care!
%cN maximum number of connections/seconds (*%c10)
You can tell the engine to limit the n# of connection
per second. This allow to limit the bandwdith or the
process load
on webservers, preventing from "overloads"
%L <file> add all URL located in this text file (one
URL per line) (--list <param>)
This option is not new, but its speed has been greatly
improved!
Build options:
-------------
%x do not include any password for external password
protected websites (%x0 include) (--no-passwords)
If you mirror
<http://smith_john:foobar@www.privatefoo.com/smith/>,
and exclude using filters some links, these links will
be by default rewritten with password data. For
example, "bar.html" will be renamed into
<http://smith_john:foobar@www.privatefoo.com/smith/bar.h>
tml
This can be a problem if you don't want to disclose
the username/password!
The %x option tell the engine not to include
username/password data in rewritten URLs
%q *include query string for local files (useless,
for information purpose only) (%q0 don't include) (--
include-query-string)
This option is not very useful, because parameters are
useless, as pages are not
dynamic anymore when mirrored. But some javascript
code may use the query string, and
it can give useful information. For example:
catalog4FB8.html?page=computer-science
is clearer than
catalog4FB8.html
Therefore, this option is activated by default
Spider options:
---------------
%s update hacks: various hacks to limit re-transfers
when updating (identical size, bogus response..) (--
updatehack)
This is a collection of "tricks" which are not
really "RFC compliant" but which can save bandwidth
by trying not to retransfer data in several cases
%A assume that a type (cgi,asp..) is always linked
with a mime type (-%A php3=text/html) (--assume
<param>)
The most important new feature for some people, maybe.
This option tells the engine that if a link is
encountered, with a
specific type (.cgi, .asp, or .php3 for example), it
MUST assume that this link has always the same MIME
type, for example
the "text/html" MIME type.
This is VERY important to speed up many mirrors.
I have done tests on big HTML files (approx. 150 MB,
150,000,000 bytes!) with 100,000 links inside.
Such files are being parsed in approx. 20 seconds on
my own PC by the latest optimized releases of HTTrack.
But these tests have been done with links of known
types, that is, html, gif, and so on..
If you have, say, 10,000 links of unknown type, such
as ".asp", this will cause the engine to test ALL
these files, and this
will SLOOOOW down the parser. In this example, the
parser will take hours, instead of 20 seconds!
In this case, it would be great to tell HTTrack: ".asp
pages are in fact HTML pages"
This is possible, using:
-%A asp=text/html
The -%A option can be replaced by the alias
--assume asp=text/html
Which is MUCH more clear.
You can use multiple definitions, separed by ","
(but ";" can also be used), or use multiple options.
Therefore, these two lines are identical:
--assume asp=text/html --assume php3=text/html --
assume cgi=image/gif
--assume asp=text/html,php3=text/html,cgi=image/gif
--assume asp,php3=text/html,cgi=image/gif
The MIME type is the standard well known "MIME" type.
Here are the most important ones:
text/html Html files, parsed by HTTrack
image/gif GIF files
image/jpeg Jpeg files
image/png PNG files
There is also a collection of "non standard" MIME
types. Example:
application/x-foo Files with "foo" type
Therefore, you can give to all files terminated
by ".mp3" the MIME type:
application/x-mp3
This allow you to rename files on a mirror. If you
KNOW that all "dat" files are in fact "zip" files
renamed into "dat", you can
tell httrack:
--assume dat=application/x-zip
You can also "name" a file type, with its original
MIME type, if this type is not known by HTTrack. This
will avoid a test
when the link will be reached:
--assume foo=application/foobar
In this case, HTTrack won't check the type, because it
has learned that "foo" is a known type, or MIME type
"application/foobar". Therefore, it will let untouched
the "foo" type.
A last remark, you can use complex definitions like:
--assume
asp,php3=text/html,cgi=image/gif,dat=application/x-
zip,mpg=application/x-mp3,foo=application/foobar
..and save it on your .httrackrc file:
set assume
asp=text/html,php3=text/html,cgi=image/gif,dat=applicat
ion/x-zip,mpg=application/x-mp3,foo=application/foobar
Browser ID:
----------
%l preffered language (-%l "fr, en, jp, *" (--
language <param>)
This option allows you to define the preffered
language. Some websites will use this parameter to
generate
pages in desired language.
Example:
-%l "fr, en, jp, *"
"I prefer to have pages with french language, then
english, then japanese, then any other language"
Log, index, cache:
-----------------
%v display on screen filenames downloaded (in
realtime) (--display)
Animated information when using consol-based version,
example:
17/95: localhost/manual/handler.html (6387 bytes) - OK
f2 one single log file (--single-log)
Do not split error and information log (hts-log.txt
and hts-err.txt) - use only one file (hts-log.txt)
%I make a searchable index for this mirror (* %I0
don't make) (--search-index)
Still in testing, this option asks the engine to
generate an index.txt, useable by third-party programs
or scripts,
to index all words contained in html files
Example:
$ httrack -%I linux.localdomain
..
$ more index.txt
..
abridged
1 linux/manual/misc/API.html
=1
(0)
absence
3 linux/manual/mod/core.html
2 linux/manual/mod/mod_imap.html
1 linux/manual/misc/nopgp.html
1 linux/manual/mod/mod_proxy.html
1 linux/manual/new_features_1_3.html
=8
(0)
absolute
3 linux/manual/mod/mod_auth_digest.html
1 linux/manual/mod/mod_so.html
=4
(0)
..
Guru options: (do NOT use)
--------------------------
This is a new section, for all "not very well
documented options". You can use them, in fact, do not
believe
what is written above!
#0 Filter test (-#0 '*.gif' 'www.bar.com/foo.gif')
To test the filter system. Example:
$ httrack -
#0 'www.*.com/*foo*bar.gif' 'www.mysite.com/test/foo4ba
r.gif'
www.mysite.com/test/foo4bar.gif does match
www.*.com/*foo*bar.gif
#f Always flush log files
Useful if you want the hts-log.txt file to be flushed
regularly (not buffered)
#FN Maximum number of filters
Use if if you want to use more than the maximum
default number of filters, that is, 500 filters:
-#F2000
for 2,000 filters
#h Version info
Informations on the version number
#K Scan stdin (debug)
Not useful (debug only)
#L Maximum number of links (-#L1000000)
Use if if you want to use more than the maximum
default number of links, that is, 100,000 links:
-#L2000000
for 2,000,000 links
#p Display ugly progress information
Self-explanatory :)
I will have to improve this one
#P Catch URL
"Catch URL" feature, allows to setup a temporary proxy
to capture complex URLs, often linked with POST action
(when using form based authentication)
#R Old FTP routines (debug)
Debug..
#T Generate transfer ops. log every minutes
Generate a log file with transfer statistics
#u Wait time
"On hold" option, in seconds
#Z Generate transfer rate statictics every minutes
Generate a log file with transfer statistics
#! Execute a shell command (-#! "echo hello")
Debug..
Command-line specific options:
-----------------------------
V execute system command after each files ($0 is the
filename: -V "rm \$0") (--userdef-cmd <param>)
Useful when you want to launch a command each time a
new file is downloaded. Example:
httrack -%I localhost -V "tar uvf foo.tar \$0; rm -f
\$0"
%U run the engine with another id when called as root
(-%U smith) (--user <param>)
Change the UID of the owner when running as r00t
Details: User-defined option N
%[param] param variable in query string
This new option is important: you can include query-
string content when forming the destination filename!
Example: you are mirroring a huge website, with many
pages named as:
www.foo.com/catalog.php3?page=engineering
www.foo.com/catalog.php3?page=biology
www.foo.com/catalog.php3?page=computing
..
Then you can use the -N option:
httrack www.foo.com -N "%h%p/%n%[page].%t"
If found, the "page" parameter will be included after
the filename, and the URLs above will be saved as:
/home/mywebsites/foo/www.foo.com/catalogengineering.php
3
/home/mywebsites/foo/www.foo.com/catalogbiology.php3
/home/mywebsites/foo/www.foo.com/catalogcomputing.php3
..
| |