HTTrack 3.03 (Jul 2001) pre-release notes - HTTrack Website Copier Forum

Subject: HTTrack 3.03 (Jul 2001) pre-release notes
Author: Xavier Roche
Date: 07/03/2001 08:23
HTTrack version 3.03beta (Jul 2001) pre-release notes

This file is a set of various notes about new features 
introduced since 2.X releases of HTTrack. Many new
features and changes haven't yet been included in the 
current documentation. Most features can be listed 
using the commandline version of HTTrack:
$ httrack --quiet --help | more

Important note:
These remarks apply for the commandline version of 
HTTrack, but also for the GUI version : most options 
have been
implemented in the interface.
Options not implemented in the GUI version can be 
inserted directly in the "filters" list. Therefore, 
all these options can used in the GUI version.


FAQ
---

The "general" FAQ, which lists most-frequently-asked 
questions, has been updated.
See "faq.html"


New filters options:
--------------------

A new filter type has been added, which allows you to 
accept/refuse files depending on their size. This 
filter has
to be included at the end of a standard filter
The syntax is:

<filter>*[<NN]
or
<filter>*[>NN]
or
<filter>*[<NN>PP]

The meaning of each filter:

*[<NN]
Means "match if the file is smaller than NN KB"

*[>NN]
Means "match if the file is bigger than NN KB"

*[<NN>PP]
Means "match if the file is smaller than NN KB AND 
bigger than PP KB"


Example:
-*.gif*[>5] -*.zip +*.zip*[<10]

Will:
refuse all gif files smaller than 5KB, 
exlude all zip files,
EXCEPT zip files smaller than 10KB

Note that in -*.gif*[>5] you can consider two parts:
- first part: "-*.gif" which is the well-known filter
- second part: "*[>5]" which is used when a filesize 
is available

You can use the "size" filter anywhere, like in:
-*[>5]*.gif

BUT this is not recommended, as the syntax is not very 
clear. Therefore, always all "size filters" at the end 
of expressions.


Limits options:
---------------

 %eN set the external links depth to N (* %e0) (--ext-
depth[=N])

This option is very useful when you have a website 
with external HTML links you would like to get 
together.
The default behaviour is not to get external HTML 
files, but here, you can tell the engine "if you see 
an external HTML link,
download it and get N levels"
Of course this option has to be used with care!

 %cN maximum number of connections/seconds (*%c10)

You can tell the engine to limit the n# of connection 
per second. This allow to limit the bandwdith or the 
process load
on webservers, preventing from "overloads"

 %L <file> add all URL located in this text file (one 
URL per line) (--list <param>)

This option is not new, but its speed has been greatly 
improved!


Build options:
-------------

 %x  do not include any password for external password 
protected websites (%x0 include) (--no-passwords)

If you mirror 
<http://smith_john:foobar@www.privatefoo.com/smith/>, 
and exclude using filters some links, these links will
be by default rewritten with password data. For 
example, "bar.html" will be renamed into
<http://smith_john:foobar@www.privatefoo.com/smith/bar.h>
tml
This can be a problem if you don't want to disclose 
the username/password!
The %x option tell the engine not to include 
username/password data in rewritten URLs

 %q *include query string for local files (useless, 
for information purpose only) (%q0 don't include) (--
include-query-string)

This option is not very useful, because parameters are 
useless, as pages are not
dynamic anymore when mirrored. But some javascript 
code may use the query string, and
it can give useful information. For example:
catalog4FB8.html?page=computer-science
is clearer than
catalog4FB8.html
Therefore, this option is activated by default

Spider options:
---------------

 %s  update hacks: various hacks to limit re-transfers 
when updating (identical size, bogus response..) (--
updatehack)

This is a collection of "tricks" which are not 
really "RFC compliant" but which can save bandwidth
by trying not to retransfer data in several cases

 %A  assume that a type (cgi,asp..) is always linked 
with a mime type (-%A php3=text/html) (--assume 
<param>)

The most important new feature for some people, maybe. 
This option tells the engine that if a link is 
encountered, with a 
specific type (.cgi, .asp, or .php3 for example), it 
MUST assume that this link has always the same MIME 
type, for example
the "text/html" MIME type.
This is VERY important to speed up many mirrors. 

I have done tests on big HTML files (approx. 150 MB, 
150,000,000 bytes!) with 100,000 links inside.
Such files are being parsed in approx. 20 seconds on 
my own PC by the latest optimized releases of HTTrack. 
But these tests have been done with links of known 
types, that is, html, gif, and so on..
If you have, say, 10,000 links of unknown type, such 
as ".asp", this will cause the engine to test ALL 
these files, and this
will SLOOOOW down the parser. In this example, the 
parser will take hours, instead of 20 seconds!

In this case, it would be great to tell HTTrack: ".asp 
pages are in fact HTML pages"
This is possible, using:
-%A asp=text/html

The -%A option can be replaced by the alias
--assume asp=text/html

Which is MUCH more clear.

You can use multiple definitions, separed by "," 
(but ";" can also be used), or use multiple options. 
Therefore, these two lines are identical:
--assume asp=text/html --assume php3=text/html --
assume cgi=image/gif
--assume asp=text/html,php3=text/html,cgi=image/gif
--assume asp,php3=text/html,cgi=image/gif

The MIME type is the standard well known "MIME" type. 
Here are the most important ones:
text/html	Html files, parsed by HTTrack
image/gif	GIF files
image/jpeg	Jpeg files
image/png	PNG files

There is also a collection of "non standard" MIME 
types. Example:
application/x-foo	Files with "foo" type

Therefore, you can give to all files terminated 
by ".mp3" the MIME type:
application/x-mp3

This allow you to rename files on a mirror. If you 
KNOW that all "dat" files are in fact "zip" files 
renamed into "dat", you can
tell httrack:

--assume dat=application/x-zip

You can also "name" a file type, with its original 
MIME type, if this type is not known by HTTrack. This 
will avoid a test
when the link will be reached:

--assume foo=application/foobar

In this case, HTTrack won't check the type, because it 
has learned that "foo" is a known type, or MIME type
"application/foobar". Therefore, it will let untouched 
the "foo" type.


A last remark, you can use complex definitions like:
--assume 
asp,php3=text/html,cgi=image/gif,dat=application/x-
zip,mpg=application/x-mp3,foo=application/foobar

..and save it on your .httrackrc file:

set assume 
asp=text/html,php3=text/html,cgi=image/gif,dat=applicat
ion/x-zip,mpg=application/x-mp3,foo=application/foobar


Browser ID:
----------

 %l  preffered language (-%l "fr, en, jp, *" (--
language <param>)

This option allows you to define the preffered 
language. Some websites will use this parameter to 
generate
pages in desired language.
Example:
-%l "fr, en, jp, *"

"I prefer to have pages with french language, then 
english, then japanese, then any other language"

Log, index, cache:
-----------------

 %v  display on screen filenames downloaded (in 
realtime) (--display)

Animated information when using consol-based version, 
example:
17/95: localhost/manual/handler.html (6387 bytes) - OK

  f2 one single log file (--single-log)

Do not split error and information log (hts-log.txt 
and hts-err.txt) - use only one file (hts-log.txt)

 %I  make a searchable index for this mirror (* %I0 
don't make) (--search-index)

Still in testing, this option asks the engine to 
generate an index.txt, useable by third-party programs 
or scripts,
to index all words contained in html files

Example:
$ httrack -%I linux.localdomain
..
$ more index.txt
..
abridged
        1 linux/manual/misc/API.html
        =1
        (0)
absence
        3 linux/manual/mod/core.html
        2 linux/manual/mod/mod_imap.html
        1 linux/manual/misc/nopgp.html
        1 linux/manual/mod/mod_proxy.html
        1 linux/manual/new_features_1_3.html
        =8
        (0)
absolute
        3 linux/manual/mod/mod_auth_digest.html
        1 linux/manual/mod/mod_so.html
        =4
        (0)
..


Guru options: (do NOT use)
--------------------------

This is a new section, for all "not very well 
documented options". You can use them, in fact, do not 
believe
what is written above!

 #0  Filter test (-#0 '*.gif' 'www.bar.com/foo.gif')

To test the filter system. Example:
$ httrack -
#0 'www.*.com/*foo*bar.gif' 'www.mysite.com/test/foo4ba
r.gif'
www.mysite.com/test/foo4bar.gif does match 
www.*.com/*foo*bar.gif

 #f  Always flush log files

Useful if you want the hts-log.txt file to be flushed 
regularly (not buffered)

 #FN Maximum number of filters

Use if if you want to use more than the maximum 
default number of filters, that is, 500 filters:
-#F2000
for 2,000 filters

 #h  Version info

Informations on the version number

 #K  Scan stdin (debug)

Not useful (debug only)

 #L  Maximum number of links (-#L1000000)

Use if if you want to use more than the maximum 
default number of links, that is, 100,000 links:
-#L2000000
for 2,000,000 links

 #p  Display ugly progress information

Self-explanatory :)
I will have to improve this one

 #P  Catch URL

"Catch URL" feature, allows to setup a temporary proxy 
to capture complex URLs, often linked with POST action 
(when using form based authentication)

 #R  Old FTP routines (debug)

Debug..

 #T  Generate transfer ops. log every minutes

Generate a log file with transfer statistics

 #u  Wait time

"On hold" option, in seconds

 #Z  Generate transfer rate statictics every minutes

Generate a log file with transfer statistics

 #!  Execute a shell command (-#! "echo hello")

Debug..


Command-line specific options:
-----------------------------

  V execute system command after each files ($0 is the 
filename: -V "rm \$0") (--userdef-cmd <param>)

Useful when you want to launch a command each time a 
new file is downloaded. Example:

httrack -%I localhost -V "tar uvf foo.tar \$0; rm -f 
\$0"

 %U run the engine with another id when called as root 
(-%U smith) (--user <param>)

Change the UID of the owner when running as r00t

  Details: User-defined option N
    %[param] param variable in query string

This new option is important: you can include query-
string content when forming the destination filename!

Example: you are mirroring a huge website, with many 
pages named as:
www.foo.com/catalog.php3?page=engineering
www.foo.com/catalog.php3?page=biology
www.foo.com/catalog.php3?page=computing
..

Then you can use the -N option:
httrack www.foo.com -N "%h%p/%n%[page].%t"

If found, the "page" parameter will be included after 
the filename, and the URLs above will be saved as:

/home/mywebsites/foo/www.foo.com/catalogengineering.php
3
/home/mywebsites/foo/www.foo.com/catalogbiology.php3
/home/mywebsites/foo/www.foo.com/catalogcomputing.php3
..
All articles
Subject	Author	Date
HTTrack 3.03 (Jul 2001) pre-release notes		07/03/2001 08:23