User-Defined Structure & scan rules - HTTrack Website Copier Forum

Subject: User-Defined Structure & scan rules
Author: Bandit
Date: 11/25/2009 18:52
SUBJ: User-Defined Structure & scan rules


> >
> > -* +*.jpg
> > Along with a URL list from a txt file, each URL in a line:
> > <http://www.thecoverproject.net/view.php?cover_id=1>
> > And it downloads download_cover_randomnumber.php
> > files that are like the jpg images I'm trying to get
> > in size. So I tried to change their extension to
> > jpg, and it worked :)
> >
>
> I don't understand that. Unless you overrode mime types
>
<http://www.thecoverproject.net/download_cover.php?file=n64_1080snowboarding.jpg>
> should have returned an image and be properly
> renamed to download_coverHHHH.jpg
>

I don't understand it either, but I have tested this at length and have found
no way around it.  (Other than build structure, below.)  I have set "--assume
download_cover.php=image/jpeg" on and off and turned "Force old HTTP/1.0
requests" on and off (among other things in various combinations).

Using debug I've found 
 response for www.thecoverproject.net/download_cover.php?file={misc}.jpg:
 Content-Disposition: attachment;filename={misc}.jpg
 Content-Type: application/x-download
in the headers but I don't know why HTT still saves the file type as .php
locally.

Headers also show that the server does not report a "Content-Length" for a
download_cover.php request.  Because of this, I am presuming, trying to filter
(include OR exclude) by size fails.  Too bad because some of these are big-ass
files :)

I also noticed these headers:
 Expires: Thu, 19 Nov 1981 08:52:00 GMT
 Pragma: no-cache
 Cache-Control: private
I'm sure not if it's related, but HTT seems to think all these files have
never been cached and need to be re-downloaded each time I run a test.  Seems
like the PHP may be poorly written (ya think? see above) as this is a HUGE
waste of bandwidth for that site.  Some of these images are 4-8MB and it
appears that the files will be transferred EVERY time a user clicks the link,
even if the data from a previous request is sitting in their browser's cache.

Just a note, then, that "updating" is not a good plan because it will
(probably) take the same amount of time and bandwidth as an original download. 
(IDK, at this point maybe I "tweaked" something wrong and shouldn't have this
problem LOL)

> >
> > Is there any way that it could grab the jpg in their
> > original name? I mean, as they are displayed on the
> > original site (console_game_number_region.jpg).
> >
> 
> <http://httrack.kauler.com/help/User-defined_structure>
>

This isn't exactly what you want, but it's as close to it as I could figure
out.  Again I am assuming WinHTTrack, go to the Build tab and under Local
Structure Type click Options.
Change
%h%p/%n%q.%t
to
%h%p/%n%[cover_id:.ID=:::].%t%[file:.:::]
%h[cover_id:%n.%t.ID=:::][file:.:::]

Tell it "OK", "Okay!!" when it warns you about using userdef.  (Choose/check
"Do Not Purge Old Files" while you're there.  Unless you *specifically*
want/need it to purge them, you'll not want it to get rid of these on a whim. 
And "un"-purging is a pain even when possible! LOL)

Note the "file" param appends a "." and the value (i.e. filename) found in the
query string's "file=" parameter to the end of "download_cover.php",
effectively making it a ".jpg" file (as long as the server-side output is
correct) and giving you the filename that you want.  In the end you get a
bunch of files named download_cover.php.{misc}.jpg and you can use a utility
like CKRename[*1] to clean those up.  (You could also probably use this
structure to avoid that "%h/%[cover_id:View.ID=:.html::]%[file::::]".)

The "cover_id" param puts that ID number in your view.?.html filenames for
your reference to be able to quicklyand easily tell which "Cover ID's" have
been scanned or mirrored.  (The MD5 may prevent filename clashes, but gives no
clue what the mirror contains.)  I've found this build structure to work with
the following filters:
-* +*thecoverproject.net/view.php?cover_id=* 
+*thecoverproject.net/download_cover*.jpg 
and starting URL's in the form of
<http://www.thecoverproject.net/view.php?cover_id=1>
<http://www.thecoverproject.net/view.php?cover_id=2>
...
<http://www.thecoverproject.net/view.php?cover_id=12345>

Setting the max mirror depth to 2 yields 1 "cover" for each starting URL, even
though additional view.php's are found and scanned (not a problem).  If you
set it to 1, it will not download the images.  If you set it to 3, all the
starting URL's are scanned and all the links contained in those are scanned,
but this may make it more difficult to keep organized if you are going to set
up multiple mirrors for this project.  Setting the max mirror depth above 3
(or leaving it blank/default) does not seem to matter (i.e. same as setting it
to 3).

Sorry this is so long, but I was determined to make this work and this was the
best I could come up with.  I tried to explain in the best detail I could
think of to leave you with the fewest unanswered questions :)

Really hope it helps!
B^p


[*1]link to CKRename:
<http://www.softpedia.com/get/System/File-Management/CKRename.shtml>
Create subthread
All articles
Subject	Author	Date
Re: Get only the images from a certain path		11/25/2009 15:04
User-Defined Structure & scan rules		11/25/2009 18:52
Re: User-Defined Structure & scan rules		11/25/2009 21:58