Re: URL capture and MIME type check - HTTrack Website Copier Forum

Subject: Re: URL capture and MIME type check

Author: JM

Date: 03/08/2006 20:12

Some update about this matter.

I was able to force HTTrack to rename the archived PDF document, generated by
a PHP script, to a .pdf file by using the "--assume" option. It allows the
spider engine not to check for the MIME type of PHP files and to assume that
they are in fact, PDF files. The problem is that if a PHP script doesn't
generate a PDF document, and it's often the case, you will get a PDF document
and not the file it's supposed to generate. Generally a HTML page but it could
also be an image (a graph chart for example)...

I also accidently found a "Check document type" option in the "Spider" tab of
the options dialog. It's the GUI version of the "u" spider option : « u check
document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2
check always) (--check-type[=N]) ». From the GUI you can set it to "If
unknown" ("check always") but from the info log file I got, it seems HTTrack
knows the generated file is a PDF document :

« 19:51:37 Warning: Warning moved treated for
www.*.com/*.php?*?>postfile:C:\HTTrack\http___www.*.com\hts-post4 (real one is
www.*.com/*.pdf) 
19:51:40 Info: engine: transfer-status: link added: www.*.com/*.pdf ->
C:/WinHTTrack/www.*.com/*.html  »

As you can see it found out the real file, a PDF document, the PHP script
generates. But it adds a HTML page instead.

So it seems MIME type checking is not supported by HTTrack when it comes to
submitting forms automatically. Because in general it perfectly handles PHP
scripts that generate files like PDF documents, it renames the PHP files
(.php) to PDF files (.pdf).

Create subthread

All articles

Subject	Author	Date
URL capture and MIME type check		02/24/2006 12:08
Re: URL capture and MIME type check		03/08/2006 20:12