Re: binary files treated as html - HTTrack Website Copier Forum

Subject: Re: binary files treated as html

Author: Xavier Roche

Date: 03/15/2002 23:58

> I previously reported that HTTrack insisted on 
> downloading and treating binary files as HTML.
> [..]
> But I ended up trying GetLeft (on sourceforge) and I 
> can definetly say it *IS* able to download the web 
> site correctly.  Including all those files that 
> HTTrack screwed up.

All Offline Browsers are different, and are better in 
downloading certain types of sites. Type checking is 
essential with HTTrack, as the parser always renames 
files according to the remote MIME type. If the server 
is BOGUS, and send a bogus response to HEAD requests, 
it will not work, and HTTrack will rename files like 
binary into html files. Note that you can bypass type 
checking using the MIME TYPE option (forcing asp files 
to be html, for example, or dll to be dll files 
('dll' -> 'application/octet-stream')).

Anyway, trying several offline browsers is a good 
idea - for example HTTrack is not yet able to pase 
Flash sites (yuk), and will certainly never be able 
to. But the internal httrack parser is able to detect 
links that most offline browsers can't detect (simple 
javascript links, or HTML generated on-the-fly). 
Options allow you to customize the mirror very 
sharply, to make remote mirrors of websites. You can 
also use the engine as a library, and transform it 
into a linguistic analysis tool, and so on.. and the 
internal parser is FAST (despite of performances loss 
when testing links on some sites as you have noticed, 
that's true), and spidering a regular website will 
generally be really fast. But HTTrack can't (yet) 
handle all cases, especially unknown file types.

Create subthread

All articles

Subject	Author	Date
binary files treated as html		03/15/2002 21:12
Re: binary files treated as html		03/15/2002 23:58