HTTrack Website Copier
Free software offline browser - FORUM
Subject: Search known dirs for more files automatically?
Author: NucAr
Date: 08/19/2019 12:04
 
When attempting to mirror an entire website such as <http://www.someweb.com/>, I
begin by entering that URL into the "Web Addresses" field, and

-* +http://www.someweb.com/*

into the "Scan Rules" field. This method misses files that are not linked (no
backlinks) in the HTML pages of the website.  For example, if a file

<http://www.someweb.com/files/foo/non_linked_file.zip>

at an HTTP file server-style page is accessible simply by visiting
<http://www.someweb.com/files/foo/>, but is never linked in any HTML page, then
it will not be downloaded.  Another file such as

<http://www.someweb.com/files/foo/linked_file.zip>

that is linked in an HTML page naturally will be downloaded.  But even though
the directory /files/foo/ is now known to exist thanks to linked_file.zip,
HTTrack does not search through that directory for more files.  If I add
<http://www.someweb.com/files/foo/> to "Web Addresses," then I get all files in
that directory.  This especially becomes a problem when higher-level
directories, such as /files/, are not publicly accessible, meaning that every
single known subdirectory must be appended to "Web Addresses" manually after
it is discovered in a previous mirroring attempt.

Thus it seems that HTTrack relies only on links, and does not search through
known directories for additional files.  Various settings such as the "Action"
and "Spider" options do not help.  Therefore, the only way I can see to
guarantee that all files in all known directories are downloaded is first to
mirror the website naively with only <http://www.someweb.com/> in "Web
Addresses," mirror it a second time with all known directories appended to
"Web Addresses," and possibly (in general) repeat this process until no new
files are discovered.  For a large tree of directories with varying
permissions, this requires using something such as the "find" command in a
Unix terminal to get the full paths of every known directory in the previous
mirror attempt to append to "Web Addresses" in the next mirror attempt.

Is there a way to tell HTTrack to search through known directories for more
files automatically?
 
Reply


All articles

Subject Author Date
Search known dirs for more files automatically?

08/19/2019 12:04




4

Created with FORUM 2.0.11