general understanding - HTTrack Website Copier Forum

Subject: general understanding

Author: CB

Date: 09/14/2012 01:13

This may only require a simple yes or no answer ... 
If I understand correctly, when we create a mirror, we end up with a
collection of html files and associated images, etc.
So even if a sever has php, cgi, etc, we don't actually download that, just
the html those server-side pages create for our browsers to see ... therefore
a mirror may be much larger (hard drive space taken) than the original host
since that content may be dynamically generated.  If we download html only,
there is no need to use filters to get .asp, .php, etc ... we would only use a
mime type for .asp, .php, etc if we know the dynamic pages create html in
order to help the httrack engine know how to parse the .asp/.php, etc.
correct?Does that also apply to Flash-based websites?Essentially I want to get
a list of the links in a website (from the new.txt log file) without
downloading anything if possible, but I know the engine needs to download the
html so it can parse it - I just want to do that as efficiently as possible. 
I currently use "Store html files" in the Experts Only tab (Primary Scan Rule)
and have filters excluding all the various image, audio, and video
extensions.
Should I skip the filters and use "Just scan" in the Experts Only tab
instead?Also, if the engine stalls on parsing a page, is there a way to set a
timeout so it will skip parsing that page?

All articles

Subject	Author	Date
general understanding		09/14/2012 01:13
Re: general understanding		09/14/2012 14:22