Re: general understanding - HTTrack Website Copier Forum

Subject: Re: general understanding

Author: WHRoeder

Date: 09/14/2012 14:22

1) Always post the command line used (or log file line two) so we know what the
site is, what your settings are, etc.
2) Always post the URLs you're not getting and from what URL.
3) Always post anything USEFUL from the log file.
4) If you want everything use the near flag (get non-html files related) not
filters.

> If I understand correctly, when we create a mirror,
> we end up with a collection of html files and
> associated images, etc.
> So even if a sever has php, cgi, etc, we don't
> actually download that, just the html those
> server-side pages create for our browsers to see ...
> therefore a mirror may be much larger (hard drive
> space taken) than the original host since that
> content may be dynamically generated.  
So far correct

> If we
> download html only, there is no need to use filters
There's an option to DL only html

> to get .asp, .php, etc ... we would only use a mime
> type for .asp, .php, etc if we know the dynamic
Server pages may have no extension.

> pages create html in order to help the httrack
> engine know how to parse the .asp/.php, etc.
> correct?You can only GET html and referenced files from a web server.
Nothing to do with HTT.

> Does that also apply to Flash-based websites?HTT will download the flash
just fine. But it is the flash that downloads the content from the server. In
a mirror, your machine is the server but the files are not there. Does not
work.

> Essentially I want to get a list of the links in a
> website (from the new.txt log file) without
> downloading anything if possible, but I know the
If you don't download an image, I don't think its url will be in the file.

>  I currently use "Store html files" in the Experts
That may work.

> Only tab (Primary Scan Rule) and have filters
> excluding all the various image, audio, and video
> extensions.
If you exclude them, they wont be DL, so wont be in the new.txt

> Also, if the engine stalls on parsing a page, is
> there a way to set a timeout so it will skip parsing
> that page?
If it stalls, the program is dead. What you're seeing is misleading. It is
stalling on the DL, not the parse.
I always run with timeout=60 retry=9 to avoid temporary problems.

Create subthread

All articles

Subject	Author	Date
general understanding		09/14/2012 01:13
Re: general understanding		09/14/2012 14:22