| I've seen larger, more established websites do this where the page loads a
file, and you are also allowed to access the file via viewing of page source.
However, when you try to access a directory containing the file and others
similar to it, the site returns a 404.
For example, sites like NotDoppler have this protection where you can crawl
<http://i.notdoppler.com/files/strikeforceheroes2.swf>, though when you try to
spider <http://i.notdoppler.com/files/> or <http://i.notdoppler.com>, httrack
returns | Error: "Not Found" (404) at link i.notdoppler.com/
Other similar sites such as 1cup1coffee.com do not have these restrictions and
allow downloading all the swf content in the website.
Httrack is set to ignore robots.txt, and if the described urls are entered in
a browser it will return 404 as well. How can you download the data and how
does it work? | |