Re: broken servers with bad responses... - HTTrack Website Copier Forum

Subject: Re: broken servers with bad responses...
Author: Haudy Kazemi
Date: 01/08/2003 08:22
I just pasted this in from Notepad so I'm not sure this is 
formatted correctly anymore.  The original is at:
<http://kazemizadeh.net/httrack/linkscan.txt>
(use Notepad wordwrap)

> > Humm. I will try to implement a feature to detect 
> redirects 
> > from 'clearly binary files' (gif, zip..) to html files -
 
> > but I'm wondering if this is going to break things or 
> not..
> 
> Correction: to detect html responses to 'clearly binary 
> files'
> 
> Potential problem: www.foo.com/cgi.exe?page=2 ; .exe can 
be 
> considered as 'binary' ... but here is a cgi.

Well, I suppose that could be a problem upon occaison, but 
I think the idea is to put something in that at least has a 
chance 

of detecting broken 404s.  (Hmm...'broken 404'...sounds a 
little redundant as 404 is meant for a broken link to begin 

with...)

MIME-type       URL                        Status/Result
application/exe www.foo.com/not-a-cgi.exe  ok,easy,download
text/html       www.foo.com/not-a-cgi.exe  Case1,broken 404
text/html       www.foo.com/cgi.exe        Case2,valid cgi

----------------------------------
Case1: the MIME-type says it is html, the URL says it is 
binary/exe.  In reality, it is html from a broken 404 error.

Some options in Case 1 are:
1.) download the file, saving it as named in the URL
(www.foo.com/not-a-cgi.exe) even though the MIME-type says 
it is html.  Then scan the html for near/related links 
(because 

the error code was not 404).  (I think this is the current 
behavior).  Problem: httrack behavior is unchanged, so it 
goes 

onto broken 404 pages, potentially endlessly.  Also, 404 
page saved as a .exe so it is not going to run, nor is it 
going to 

be viewable in the browser.

2.) download the file, saving it as named in the URL, 
www.foo.com/not-a-cgi.exe, with the correct extension for 
it's mime 

type.  The saved file is called and accessed by the name 
www.foo.com/not-a-cgi.exe.htm (two extensions).  This makes 
broken 

files readable when the local mirror is browsed, because a 
404 page saved as a .exe.htm which will be viewable in the 

browser.  Do not scan this saved html page for near/related 
links.

Problem: valid html pages coming from cgi URLs with 
ambiguous names could be skipped in the scanning process.

(I would prefer the files saved in cases like this to have 
their names match the mime type.)

----------------------------------
Case2: the MIME-type says it is html, the URL says it is 
binary/exe.  In reality, it is html from a cgi that simply 
has the 

name 'cgi.exe'.

Desired behavior: identify this link with cgi.exe as 
being 'okay' to scan and download from if it isn't 
a 'broken-404' 

situation.

----------------------------------
In both cases we have these conditions:
1.) both mime types are text/html
2.) both URLs have a *.exe at the end of them (usually 
means binary, but sometimes CGI)  (CGI may or may not have 
any 

parameters.)
3.) server returned status is not '404' (might be 200, 301, 
302, etc.)

------------------------------------------------------------
--------
Improved link-testing method with broken-404 detection, and 
a solution to the ambiguous CGI.EXE problem (broken 404 for 
a 

file named CGI.EXE vs a real CGI program returning 
text/html files)(case1 vs case2).  I think this algorithm 
could augment 

the current link-testing code in HTTrack:
(I am probably forgetting/overlooking things here...)

1-IF.) check the MIME-type that the url returns to see if 
it is different than expected from the URL extension.
This could be implemented as a table of correlated values 
of the valid file extensions for any particular (known) 
MIME type.
MIME-Type, Valid Extensions
image/gif, gif
image/jpeg, jpeg, jpg
etc.
(In my preliminary testing I've found that only about 5% of 
URLs fail this test, and even less if I had accounted for 

variations like image/jpeg which could have either URL 
extension .jpg or .jpeg.  Testing 5% of the links more 
thoroughly 

hopefully won't slow HTTrack down very much.)

1a-CASE MATCH.) if the MIME-type matches the URL's 
extension, download the file saving it as its own name.  No 
further 

link-testing checks needed.
Ex. image/gif==gif   --> DONE
(Examples: MIME-type application/exe matches with 
www.foo.com/cgi.exe so it passes.  A cgi named my-image-
cgi.gif could 

return a different .gif file each time it is called with 
the correct MIME-type image/gif.  It would pass this test 
and not 

require further checks.  A cgi named cgi.exe that returned 
text/html would not pass this test, and would therefore 
require 

additional checks.  cgi's can be named anything, but 
usually aren't.)

1b-CASE MISMATCH.) if the MIME-type does NOT match a known 
URL extension, save the file to disk as file name from URL 
with 

MIME-type tacked on at the end.
Ex. text/html!=.gif   --> save to disk as 'image.gif.htm'  -
-> goto NEXT TEST
Ex. text/html!=.asp   --> save to disk 
as 'serverscript.asp.htm'  --> goto NEXT TEST
Ex. text/html!=www.foo.com/maybe-cgi.exe  --> save to disk 
as 'maybe-cgi.exe.htm'  --> goto NEXT TEST
(if 'www.foo.com/maybe-cgi.exe' had been a valid EXE 
intended for download, the MIME type should NOT have been 
text/html, and 

it would have passed the first level of testing.
Ex. image/gif!=.asp   --> save to disk as image.gif (or 
image.asp.gif ?) --> goto NEXT TEST

NOTE: what about .ASP, .CGI, .PL, etc? Should URLs ending 
in these be always failing the first level of testing and 
thus 

being saved as filename.ext.htm?  Perhaps only URLs with 
MIME-types of text/* should get such renaming.  I'm not 
sure the 

best choice here (ideas?)  As-is, this algorithm would send 
nearly all .PL/.ASP/.CGI URLs to the second test, and then 
they'd be checked against the 404 rules which wouldn't 
discriminate between pages with the text 404 in it and the 
REAL 404 error pages.  Perhaps telling the engine to get 2 
levels beyond a suspected broken 404 would work...unless 
you're grabbing a website that talks about 404 errors, 
using ASP/PL/CGI files, on many different pages.  **How 
about a way to prompt the user about these weird potential 
false positives, or even better to create a list of pages 
HTTrack is unsure about that the user should check and 
verify.  The only pages that would 

2-IF.) if filetype (extension) of file saved in step 1b 
(CASE MISMATCH) ==(htm or html)
                                                            
            ^--or other filetypes that are scanned for links
       then STRING SEARCH file for string combinations:
        a. "404" and "not" and "found", or
        b. "404" and "error"
        c. (others?)

2a-CASE MATCH.) if result was TRUE/"string combinations 
were found", then assume page is a broken 404 and do not 
scan it for additional links (or scan it for just 1 or 2 
more levels?).  Note this in the log and provide a way to 
override on next update.

2b-CASE NO MATCH.) if result was FALSE/"no string 
combinations were found", assume page is a valid page and 
treat as usual (scan it for additional links to download).


----------------------------------
Note:
The solution I propose will have a false-positive/negative 
under these conditions:
1.) false-negative case: server is returning broken 404 
status pages that do not contain the strings searched for 
in step 2 

above.  This would prevent detection of the 404 pages, and 
cause it to be scanned for links.  Could happen if the 
broken 404 

response is simply html plus a graphic file.

2.) false-positive case: server returns a page that aroused 
suspicion when the MIME-type and URL were compared.  Then 
this 

new page contains the trigger terms searched for in step 2 
above.  This would prevent the (valid, non 404) page from 
being 

scanned for further links.  Ex. .ASP url with text/html 
mime type would fail the first test, and then if it 
contained the text '404' 'error' (because it was a help 
page talking about 404's, the algorithm would think it is a 
broken-404 case.)
Create subthread
All articles
Subject	Author	Date
broken servers with bad responses...		01/05/2003 12:09
Re: broken servers with bad responses...		01/05/2003 15:02
Re: broken servers with bad responses...		01/05/2003 22:24
Re: broken servers with bad responses...		01/07/2003 21:19
Re: broken servers with bad responses...		01/07/2003 21:21
Re: broken servers with bad responses...		01/08/2003 08:22
Re: broken servers with bad responses...also		01/08/2003 08:33