HTTrack Website Copier
Free software offline browser - FORUM
Subject: Question to FAQ-article about slow scanning
Author: Norbert Meier
Date: 01/05/2003 12:31
 
Hi!

HTTRACK is really really great, but I often experience long 
delays while it is scanning a single page.

I have found the faq-article cited below and I hope for 
some background information.

Is this linkchecking done in parallel? Are there 
possibilities that this part will be faster?
Does HTTRACK really access many pages twice?Or does it read a php-page
completely whenever it
finds a link to one in the current page and stores the 
read content in its cache? So that what appears to
be idle time is in reality fetching lots of pages?

Nevertheless, the way httrack works is different
to Teleport Pro (which will calculate the links later),
but has many advantages

Norbert



Sometimes, links are malformed in pages. "a href="/foo"" 
instead of "a href="/foo/"", for example, is a common 
mistake. It will force the engine to make a supplemental 
request, and find the real /foo/ location. 


Dynamic pages. Links with names terminated by .php3, .asp 
or other type which are different from the regular .html 
or .htm will require a supplemental request, too. HTTrack 
has to "know" the type (called "MIME type") of a file 
before forming the destination filename. Files like foo.gif 
are "known" to be images, ".html" are obviously HTML pages -
 but ".php3" pages may be either dynamically generated html 
pages, images, data files...

If you KNOW that ALL ".php3" and ".asp" pages are in fact 
HTML pages on a mirror, use the assume option:
--assume php3=text/html,asp=text/html 

This option can be used to change the type of a file, too : 
the MIME type "application/x-MYTYPE" will always have 
the "MYTYPE" type. Therefore, 
--assume dat=application/x-zip 
will force the engine to rename all dat files into zip 
files 
 
Reply


All articles

Subject Author Date
Question to FAQ-article about slow scanning

01/05/2003 12:31




c

Created with FORUM 2.0.11