| > I previously reported that HTTrack insisted on
> downloading and treating binary files as HTML.
> [..]
> But I ended up trying GetLeft (on sourceforge) and I
> can definetly say it *IS* able to download the web
> site correctly. Including all those files that
> HTTrack screwed up.
All Offline Browsers are different, and are better in
downloading certain types of sites. Type checking is
essential with HTTrack, as the parser always renames
files according to the remote MIME type. If the server
is BOGUS, and send a bogus response to HEAD requests,
it will not work, and HTTrack will rename files like
binary into html files. Note that you can bypass type
checking using the MIME TYPE option (forcing asp files
to be html, for example, or dll to be dll files
('dll' -> 'application/octet-stream')).
Anyway, trying several offline browsers is a good
idea - for example HTTrack is not yet able to pase
Flash sites (yuk), and will certainly never be able
to. But the internal httrack parser is able to detect
links that most offline browsers can't detect (simple
javascript links, or HTML generated on-the-fly).
Options allow you to customize the mirror very
sharply, to make remote mirrors of websites. You can
also use the engine as a library, and transform it
into a linguistic analysis tool, and so on.. and the
internal parser is FAST (despite of performances loss
when testing links on some sites as you have noticed,
that's true), and spidering a regular website will
generally be really fast. But HTTrack can't (yet)
handle all cases, especially unknown file types.
| |