HTTrack Website Copier
Free software offline browser - FORUM
Subject: Bug(?) with "Get non-HTML files related to a link"
Author: Tiago Paolini
Date: 12/23/2013 17:32
 
The problem I will describe here happens when the option "Get non-HTML files
related to a link" is checked, the (internal) mirroring depth is not set
(which should means that  it would mirror indefinitely the internal links) and
the external mirroring is not set (which means that HTTrack should not mirror
external pages).

Bottom line: HTTrack starts mirroring a lot of external sites, even though
presumably it should not do it.

When the source of a image returns a external HTML page, instead of an image
file, this page is saved and HTTrack starts mirroring this site. A example of
this is a deleted image that was hosted on some image host service
(Photobucket, ImageShack etc.). When one try to access a deleted image link it
is shown a not found page of the provider.

The odd part is that HTTrack is treating these external pages as if they were
internal pages and mirror them, even though the external mirroring depth is
set to 0. Also the "Get non-HTML files related to a link" implies that it
should not be getting external HTML pages, so if an external file returned a
page with MIME type text/html it should not be accepted.

There were no way I could find to filter out pages with MIME type text/html
only from external domains (if I added a mime type filter for this, it always
ended filtering out all the pages, even those from the site I am mirroring).

My workaround was to only accept external domains with URLs ending in image
extensions, and it worked for preventing HTTrack to further mirroring the
external sites. But it would be better if HTTrack did not accept external
pages with MIME type text/html when the option "Get non-HTML files related to
a link" is checked.

Best regards,
Tiago
 
Reply


All articles

Subject Author Date
Bug(?) with "Get non-HTML files related to a link"

12/23/2013 17:32




b

Created with FORUM 2.0.11