| I'm trying to get the information from a message board,
smarthome.custhelp.com but WinHTTrack appears to be
ignoring some scan rules I've added to simplify copying of
the board. Message boards tend to generate a lot of links
and extra downloaded pages unless handled carefully with
some PHP/ASP/etc scripts excluded. I usually exclude
the 'reply', 'login', and other irrelevent script links.
In this case I'm trying to exclude acct_login.php so I
added the scan rule -*acct_login.php*, but when I start
copying the site I notice that links with this PHP page are
still be accessed and copied. The top of the copied files
says something like:
<!-- Mirrored from smarthome.custhelp.com/cgi-
bin/smarthome.cfg/php/enduser/acct_login.php?p_sid=1r5QrPrg&p_lva=17&p_sp=cF9zcmNoPSZwX2dyaWRzb3J0PSZwX3J
vd19jbnQ9MjEzJnBfcGFnZT0x&p_next_page=myovr.php&p_li= by
HTTrack Website Copier/3.21 [XR&CO'2002], Tue, 15 Oct 2002
22:47:27 GMT -->
The top of the hts-log.txt file shows this:
HTTrack3.21+swf launched on Tue, 15 Oct 2002 17:46:34 at
<http://smarthome.custhelp.com/>
<http://smarthome.custhelp.com/cgi>-
bin/smarthome.cfg/php/enduser/std_alp.php -
*acct_login.php* -*ask.php* -*email_adp.php* -
*answer_fdbck.php* -*help_general.php* -*help_search.php*
+*.css +*.js -ad.doubleclick.net/* +*.gif +*.jpg +*.png
+*.tif +*.bmp +*.zip +*.tar +*.tgz +*.gz +*.rar +*.z +*.exe
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm
+*.wav +*.vob +*.qt +*.vid +*.ac3
(winhttrack -qwr5C2%Pns0u1z%sN0%I0p3DaK0c4T1200R6H0f2%c5#f -
F "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -%
F "<!-- Mirrored from %s%s by HTTrack Website
Copier/3.20RC8D [XR&CO'2002], %s -->" -%l "en, *"
<http://smarthome.custhelp.com/>
<http://smarthome.custhelp.com/cgi>-
bin/smarthome.cfg/php/enduser/std_alp.php -O "I:\web-
archive problematic\smarthome.custhelp.com
20021015","I:\web-archive
problematic\smarthome.custhelp.com 20021015" -
*acct_login.php* -*ask.php* -*email_adp.php* -
*answer_fdbck.php* -*help_general.php* -*help_search.php*
+*.css +*.js -ad.doubleclick.net/* +*.gif +*.jpg +*.png
+*.tif +*.bmp +*.zip +*.tar +*.tgz +*.gz +*.rar +*.z +*.exe
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm
+*.wav +*.vob +*.qt +*.vid +*.ac3 )
And in hts-log.txt within the first 20-30 downloaded files
is:
17:46:40 Info: engine: save-name: local name:
smarthome.custhelp.com/cgi-
bin/smarthome.cfg/php/enduser/acct_login.html ->
smarthome.custhelp.com/cgi-
bin/smarthome.cfg/php/enduser/acct_loginc652.html
17:46:40 Info: engine: transfer-status: link
recorded:
Is HTTrack considering the PHP page to be html and not
applying the scan rules to it?
Finally, I know there are javascript problems with httrack
handling this board, but I'll handle those another way
(manually specifying the start pages in httrack...). I'm
also going to lower the mirror depth level. This problem
in particular looks different from a javascript problem. | |