| I just pasted this in from Notepad so I'm not sure this is
formatted correctly anymore. The original is at:
<http://kazemizadeh.net/httrack/linkscan.txt>
(use Notepad wordwrap)
> > Humm. I will try to implement a feature to detect
> redirects
> > from 'clearly binary files' (gif, zip..) to html files -
> > but I'm wondering if this is going to break things or
> not..
>
> Correction: to detect html responses to 'clearly binary
> files'
>
> Potential problem: www.foo.com/cgi.exe?page=2 ; .exe can
be
> considered as 'binary' ... but here is a cgi.
Well, I suppose that could be a problem upon occaison, but
I think the idea is to put something in that at least has a
chance
of detecting broken 404s. (Hmm...'broken 404'...sounds a
little redundant as 404 is meant for a broken link to begin
with...)
MIME-type URL Status/Result
application/exe www.foo.com/not-a-cgi.exe ok,easy,download
text/html www.foo.com/not-a-cgi.exe Case1,broken 404
text/html www.foo.com/cgi.exe Case2,valid cgi
----------------------------------
Case1: the MIME-type says it is html, the URL says it is
binary/exe. In reality, it is html from a broken 404 error.
Some options in Case 1 are:
1.) download the file, saving it as named in the URL
(www.foo.com/not-a-cgi.exe) even though the MIME-type says
it is html. Then scan the html for near/related links
(because
the error code was not 404). (I think this is the current
behavior). Problem: httrack behavior is unchanged, so it
goes
onto broken 404 pages, potentially endlessly. Also, 404
page saved as a .exe so it is not going to run, nor is it
going to
be viewable in the browser.
2.) download the file, saving it as named in the URL,
www.foo.com/not-a-cgi.exe, with the correct extension for
it's mime
type. The saved file is called and accessed by the name
www.foo.com/not-a-cgi.exe.htm (two extensions). This makes
broken
files readable when the local mirror is browsed, because a
404 page saved as a .exe.htm which will be viewable in the
browser. Do not scan this saved html page for near/related
links.
Problem: valid html pages coming from cgi URLs with
ambiguous names could be skipped in the scanning process.
(I would prefer the files saved in cases like this to have
their names match the mime type.)
----------------------------------
Case2: the MIME-type says it is html, the URL says it is
binary/exe. In reality, it is html from a cgi that simply
has the
name 'cgi.exe'.
Desired behavior: identify this link with cgi.exe as
being 'okay' to scan and download from if it isn't
a 'broken-404'
situation.
----------------------------------
In both cases we have these conditions:
1.) both mime types are text/html
2.) both URLs have a *.exe at the end of them (usually
means binary, but sometimes CGI) (CGI may or may not have
any
parameters.)
3.) server returned status is not '404' (might be 200, 301,
302, etc.)
------------------------------------------------------------
--------
Improved link-testing method with broken-404 detection, and
a solution to the ambiguous CGI.EXE problem (broken 404 for
a
file named CGI.EXE vs a real CGI program returning
text/html files)(case1 vs case2). I think this algorithm
could augment
the current link-testing code in HTTrack:
(I am probably forgetting/overlooking things here...)
1-IF.) check the MIME-type that the url returns to see if
it is different than expected from the URL extension.
This could be implemented as a table of correlated values
of the valid file extensions for any particular (known)
MIME type.
MIME-Type, Valid Extensions
image/gif, gif
image/jpeg, jpeg, jpg
etc.
(In my preliminary testing I've found that only about 5% of
URLs fail this test, and even less if I had accounted for
variations like image/jpeg which could have either URL
extension .jpg or .jpeg. Testing 5% of the links more
thoroughly
hopefully won't slow HTTrack down very much.)
1a-CASE MATCH.) if the MIME-type matches the URL's
extension, download the file saving it as its own name. No
further
link-testing checks needed.
Ex. image/gif==gif --> DONE
(Examples: MIME-type application/exe matches with
www.foo.com/cgi.exe so it passes. A cgi named my-image-
cgi.gif could
return a different .gif file each time it is called with
the correct MIME-type image/gif. It would pass this test
and not
require further checks. A cgi named cgi.exe that returned
text/html would not pass this test, and would therefore
require
additional checks. cgi's can be named anything, but
usually aren't.)
1b-CASE MISMATCH.) if the MIME-type does NOT match a known
URL extension, save the file to disk as file name from URL
with
MIME-type tacked on at the end.
Ex. text/html!=.gif --> save to disk as 'image.gif.htm' -
-> goto NEXT TEST
Ex. text/html!=.asp --> save to disk
as 'serverscript.asp.htm' --> goto NEXT TEST
Ex. text/html!=www.foo.com/maybe-cgi.exe --> save to disk
as 'maybe-cgi.exe.htm' --> goto NEXT TEST
(if 'www.foo.com/maybe-cgi.exe' had been a valid EXE
intended for download, the MIME type should NOT have been
text/html, and
it would have passed the first level of testing.
Ex. image/gif!=.asp --> save to disk as image.gif (or
image.asp.gif ?) --> goto NEXT TEST
NOTE: what about .ASP, .CGI, .PL, etc? Should URLs ending
in these be always failing the first level of testing and
thus
being saved as filename.ext.htm? Perhaps only URLs with
MIME-types of text/* should get such renaming. I'm not
sure the
best choice here (ideas?) As-is, this algorithm would send
nearly all .PL/.ASP/.CGI URLs to the second test, and then
they'd be checked against the 404 rules which wouldn't
discriminate between pages with the text 404 in it and the
REAL 404 error pages. Perhaps telling the engine to get 2
levels beyond a suspected broken 404 would work...unless
you're grabbing a website that talks about 404 errors,
using ASP/PL/CGI files, on many different pages. **How
about a way to prompt the user about these weird potential
false positives, or even better to create a list of pages
HTTrack is unsure about that the user should check and
verify. The only pages that would
2-IF.) if filetype (extension) of file saved in step 1b
(CASE MISMATCH) ==(htm or html)
^--or other filetypes that are scanned for links
then STRING SEARCH file for string combinations:
a. "404" and "not" and "found", or
b. "404" and "error"
c. (others?)
2a-CASE MATCH.) if result was TRUE/"string combinations
were found", then assume page is a broken 404 and do not
scan it for additional links (or scan it for just 1 or 2
more levels?). Note this in the log and provide a way to
override on next update.
2b-CASE NO MATCH.) if result was FALSE/"no string
combinations were found", assume page is a valid page and
treat as usual (scan it for additional links to download).
----------------------------------
Note:
The solution I propose will have a false-positive/negative
under these conditions:
1.) false-negative case: server is returning broken 404
status pages that do not contain the strings searched for
in step 2
above. This would prevent detection of the 404 pages, and
cause it to be scanned for links. Could happen if the
broken 404
response is simply html plus a graphic file.
2.) false-positive case: server returns a page that aroused
suspicion when the MIME-type and URL were compared. Then
this
new page contains the trigger terms searched for in step 2
above. This would prevent the (valid, non 404) page from
being
scanned for further links. Ex. .ASP url with text/html
mime type would fail the first test, and then if it
contained the text '404' 'error' (because it was a help
page talking about 404's, the algorithm would think it is a
broken-404 case.) | |