HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Slow scanning frustration
Author: Xavier Roche
Date: 05/07/2005 13:52
 
> But one thing that has always frustrated me is its slow
> scanning of pages.
> You know, the entry at the top of the list where it's
> scanning and testing links.

It happends when httrack doesn't known in advance the document type. Most
offline browsers are "delaying" the document type resolution, but this technic
has also its drawbacks (need to post-process all pages, risks of missing
references ..)

The type is necessary to name local files - .php or .asp files can be HTML
data or JPG images. Browsers don't care, because the resource type is not used
to resolve the MIME type: HTTP gives its own Mime in the headers. But when
transformed into local files, you MUST rename .php into .html or .gif,
depending on the type. And some offline browsers don't care, because they
require an internal browser (yuk!).

httrack "knows" that .gif is generally "GIF file", and .html "html file". The
engine is also configured by default to assume that .php or .asp are generally
"HTML files".

But some sites may contains many "unknown" files, such as
<http://www.example.com/gallery> (without an ending /, with a gallery.php script
behind) - this lead to numerous requests, and this can slow down the mirror.

To summarize, this is was design choice. Should I rewrite the engine, I would
probably abandon this "testing links" madness, and do a post-processing.

I have several workarounds in my TODO list (since 1999!) to bypass this
"pre-testing" thing, but I didn't find the time and the courage yet to
implement them (this is a big change in the very deep core engine routines)

The best workaround yet would be to hack the internal type resolver, and
consider "unknown" files as special "application/x-unknown" files. The biggest
problem is then to reparse ALL html files at the mirror ending and patch all
"unresolved" links. This might take a lots of time and CPU for large-scale
mirrors.

 
Reply Create subthread


All articles

Subject Author Date
Slow scanning frustration

05/04/2005 19:39
Re: Slow scanning frustration

05/04/2005 21:11
Re: Slow scanning frustration

05/07/2005 13:52
Re: Slow scanning frustration

05/08/2005 18:28




5

Created with FORUM 2.0.11