HTTrack Website Copier
Free software offline browser - FORUM
Subject: Unusual possible bug, ignoring scan rules.
Author: Cryptor
Date: 04/15/2002 17:33
 
Recently, I was trying to mirror a website with HTTrack
v3.16 for windows. It was an archive of files, with the
files in www.website.com/subdir/YYYY/M/D/index.html,
where YYYY is the 4 digit year, and M and D are Month
and Day, in either 1 or 2 digit form. Each 'index.html'
linked to pages in its own directory. Unfortuantly,
there was no page that linked to all of these pages
index pages, they were only able to be accessed via a
cgi search. So I made a file consisting of this, using
a little python script...

<http://www.website.com/subdir/1999/1/1/>
<http://www.website.com/subdir/1999/1/2/>
<http://www.website.com/subdir/1999/1/3/>
.
.
.
<http://www.website.com/subdir/2001/12/31/>


... where those dots in between were about 1000 days
from 1999 to 2001. I fed this file to HTTrack as a list
of URL's to read. I then made a few scan rules to limit
the size of this mirror to only necassary information.
It started mirroring fine, but it ignored my scan
rules, and still downloaded the files I banned with
'-(filespec)'. I played with the scan rules a little,
but to no avail. I then created a new URL file, this
time with only on months worth on entires (30). This
time it followed my scan rules fine. I increased the
number of entires to the entirity of 1999, and the scan
rules worked fine again. Only when I made the number of
entires the full three and a bit years, the scan rules
were ignored. I know this is an unusually large file
(over 1000 lines, around 60k) but I could think of no
other way to mirror this site. But since the scan rules
failed, I have to split the mirror into three mirrors,
one for each year.

Any help would be appresiated, and thanks for the great
website copier.

Thanks
 
Reply


All articles

Subject Author Date
Unusual possible bug, ignoring scan rules.

04/15/2002 17:33
Re: Unusual possible bug, ignoring scan rules.

04/15/2002 20:15




3

Created with FORUM 2.0.11