HTTrack Website Copier
Free software offline browser - FORUM
Subject: HTTrack ignoring scan rule @smarthome.custhelp.com
Author: Haudy Kazemi
Date: 10/16/2002 08:39
 
I'm trying to get the information from a message board, 
smarthome.custhelp.com but WinHTTrack appears to be 
ignoring some scan rules I've added to simplify copying of 
the board.  Message boards tend to generate a lot of links 
and extra downloaded pages unless handled carefully with 
some PHP/ASP/etc scripts excluded.  I usually exclude 
the 'reply', 'login', and other irrelevent script links.

In this case I'm trying to exclude acct_login.php so I 
added the scan rule -*acct_login.php*, but when I start 
copying the site I notice that links with this PHP page are 
still be accessed and copied.  The top of the copied files 
says something like:

<!-- Mirrored from smarthome.custhelp.com/cgi-
bin/smarthome.cfg/php/enduser/acct_login.php?p_sid=1r5QrPrg&p_lva=17&p_sp=cF9zcmNoPSZwX2dyaWRzb3J0PSZwX3J
vd19jbnQ9MjEzJnBfcGFnZT0x&p_next_page=myovr.php&p_li= by 
HTTrack Website Copier/3.21 [XR&CO'2002], Tue, 15 Oct 2002 
22:47:27 GMT -->


The top of the hts-log.txt file shows this:

HTTrack3.21+swf launched on Tue, 15 Oct 2002 17:46:34 at 
<http://smarthome.custhelp.com/> 
<http://smarthome.custhelp.com/cgi>-
bin/smarthome.cfg/php/enduser/std_alp.php -
*acct_login.php* -*ask.php* -*email_adp.php* -
*answer_fdbck.php* -*help_general.php* -*help_search.php* 
+*.css +*.js -ad.doubleclick.net/* +*.gif +*.jpg +*.png 
+*.tif +*.bmp +*.zip +*.tar +*.tgz +*.gz +*.rar +*.z +*.exe 
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm 
+*.wav +*.vob +*.qt +*.vid +*.ac3

(winhttrack -qwr5C2%Pns0u1z%sN0%I0p3DaK0c4T1200R6H0f2%c5#f -
F "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -%
F "<!-- Mirrored from %s%s by HTTrack Website 
Copier/3.20RC8D [XR&CO'2002], %s -->" -%l "en, *" 
<http://smarthome.custhelp.com/> 
<http://smarthome.custhelp.com/cgi>-
bin/smarthome.cfg/php/enduser/std_alp.php -O "I:\web-
archive problematic\smarthome.custhelp.com 
20021015","I:\web-archive 
problematic\smarthome.custhelp.com 20021015" -
*acct_login.php* -*ask.php* -*email_adp.php* -
*answer_fdbck.php* -*help_general.php* -*help_search.php* 
+*.css +*.js -ad.doubleclick.net/* +*.gif +*.jpg +*.png 
+*.tif +*.bmp +*.zip +*.tar +*.tgz +*.gz +*.rar +*.z +*.exe 
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm 
+*.wav +*.vob +*.qt +*.vid +*.ac3 )

And in hts-log.txt within the first 20-30 downloaded files 
is:
17:46:40	Info: 	engine: save-name: local name: 
smarthome.custhelp.com/cgi-
bin/smarthome.cfg/php/enduser/acct_login.html -> 
smarthome.custhelp.com/cgi-
bin/smarthome.cfg/php/enduser/acct_loginc652.html
17:46:40	Info: 	engine: transfer-status: link 
recorded: 

Is HTTrack considering the PHP page to be html and not 
applying the scan rules to it?
Finally, I know there are javascript problems with httrack 
handling this board, but I'll handle those another way 
(manually specifying the start pages in httrack...).  I'm 
also going to lower the mirror depth level.  This problem 
in particular looks different from a javascript problem. 
 
Reply


All articles

Subject Author Date
HTTrack ignoring scan rule @smarthome.custhelp.com

10/16/2002 08:39
Re: HTTrack ignoring scan rule @smarthome.custhelp.com

10/16/2002 20:02




3

Created with FORUM 2.0.11