| Hi there,
I spent hours in reading FAQs and reading through the forums. I couldn't find
my question answered. Please forgive me if it was answered, though, and I was
not able to transfer the information to my problem.
This is the situation:
There's a website offering file downloads. My primary aim is to get those
downloads. Those downloads are 7z files.
The site structure looks kind of this:
- list of categorys
- category a
- alphabetical list: A
- file description page with link to 7z file
- alphabetical list: B
- alphabetical list: C
- alphabetical list: D
- alphabetical list: E
- ...
- alphabetical list: Z
- category b
- category c
- ...
- category n
My approach is to start at the point with least complexity and then wrapping
the automation for the whole site around it. It seemed like a specific file
description page would be a good point to start with. Unfortunately, what I
thought being the "least complex" thing causes me hard headaches.
The file description page has an URL like e.g.
<http://website.net/details-840.htm>
The corresponding download URL would be
<http://website.net/download.php?id=840>
When I start the download in a web browser, Firefox' download dialogue appears
offering me to download the file "Dai Meiro - Meikyuu no Tatsujin.7z".
LiveHTTPHeaders transcribes the following:
----------------------------------------------------------
<http://website.net/download.php?id=840>
GET /download.php?id=840 HTTP/1.1
Host: website.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.15)
Gecko/20110303 Firefox/3.6.15 ( .NET CLR 3.5.30729; .NET4.0C)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: <http://website.net/details-840.htm>
Cookie: __utma=8087398.1021369519.1286700001.1301221866.1301227586.14;
__utmz=8087398.1286700001.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);
style_cookie=printonly; phpbb3_u76eg_u=1; phpbb3_u76eg_k=;
phpbb3_u76eg_sid=4410861c3004b346d3c3e0a478e20e8c;
PHPSESSID=04eec00f42dc667046d3ca4db1e6cb7a; __utmb=8087398.2.10.1301227586;
__utmc=8087398
HTTP/1.1 200 OK
Date: Sun, 27 Mar 2011 12:11:25 GMT
Server: Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny9 with Suhosin-Patch
X-Powered-By: PHP/5.2.6-1+lenny9
Expires: 0
Cache-Control: must-revalidate, post-check=0, pre-check=0, private
Pragma: public
Content-Disposition: attachment; filename="Dai Meiro - Meikyuu no
Tatsujin.7z";
Content-Transfer-Encoding: binary
Content-Length: 101729
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: application/force-download
----------------------------------------------------------
It seems like the website owner doesn't like hot linking. Retrieving e.g.
<http://website.net/download.php?id=840> directly leads to a notice page.
Letting httrack dig through the site works. Well, kind of. Actually, it
downloads the file. But, instead of naming it "Dai Meiro - Meikyuu no
Tatsujin.7z" it becomes download2c95.php. Now, that's the point where I'm
stuck.
I found several threads concerning such a behaviour, all resulting in
<http://httrack.kauler.com/help/User-defined_structure>
But I'm not able to apply this information to a solution for me.
This thread seems to be similar to my problem:
<http://forum.httrack.com/readmsg/21112/index.html>
But, unfortunately, I couldn't figure out any solution for my case.
Is this
<http://forum.httrack.com/readmsg/21116/21112/index.html>
still the case, causing all my trouble?
Or am I doing something wrong? | |