| Thank you for your reply.
All the files have the same head meta tag i.e.
http-equiv="Content-type" content="text/html;charset=iso-8859-1"
and yet some are given the correct .html extension and others are given the
incorrect .txt extension.
Interestingly, if I update the archive i.e. run the script again, against an
existing archive (leaving all the cache, log files, etc. in place), all the
files with incorrect .txt extensions are renamed to the correct .html
extensions.
Can you suggest any explanation for this behaviour?
My current experience inclines me to the opinion that if I run each archive
twice I'll get the correct result, but without a logical explanation of why
this happens, I don't feel I can implement the product in a production
environment.
Could it be something to do with processing capacity / response times?
The mirroring process takes 2 hours to complete - 5286 links scanned, 5216
files written (178165717 bytes overall) [181347071 bytes received at 24140
bytes/sec], 2.5 requests per connection - and most files on this site are
password protected.
This page <http://httrack.kauler.com/help/MIME_types> talks about using switches
to wait for the file type before starting any download (-%N0) and to re-ask
the server the file type of links (-%D0) - a 'cached-delayed-type-check' (?)
but these switches do not appear in the help for the version I am using and
cause an error if I try to use them.
I hope you don't mind me asking for your opinion on the points I've raised.
HTTrack has served me very well on static .html sites in the past and I don't
want to have to look elsewhere for a solution to these 'extensionless' sites. | |