Re: a few ideas - HTTrack Website Copier Forum

Subject: Re: a few ideas
Author: Xavier Roche
Date: 05/29/2003 15:00
> i would imagine that the same code that HTTrack uses to 
> save websites could also be used as a link checker - to 
> check that all the links on a website work, up to some 
> depth...  so that's my first suggestion - a link checker 
be 
> incorporated into HTTrack, sorta.

Err, the "test all links" option should do the trick? 
Note: You can not test links from a site without 
downloading pages that refers to these links

> my second suggestion is that perhapes HTTrack could 
> perhapes rename files based on the documents title, or a 
> portion of the title

Humm, this isn't possible, unfortunately, because httrack 
names all files BEFORE downloading them. This is a problem 
also when redirections are encountered ; that's right, but 
it was a design choice (the other method ; waiting for all 
links before generating the pages, has many other problems, 
such as stack management problems (memory) and reget issues)

> my third suggestion is that perhapes HTTrack could 
perform 
> alterations on the html as it is saving it.  for 
> example...  say some homepage contains code to display an 
> ad.  there would be a table, an img command, and perhapes 
> some simple javascript.  when archiving a page that's 
> hosted on, say, geocities, you'll be getting ads that 
> aren't part of the original code.  while i don't think 
> HTTrack could figure out which parts of the code to 
remove, 
> the user could.  the user could figure it out by looking 
at 
> one or maybe two pages, and then paste that into some 
text 
> area within some window of HTTrack, and then, as HTTrack 
is 
> downloading each file, it would delete that portion of 
the 
> file.

Humm. Such post-processing is rather complex ; and even if 
it can be done using some C callbacks (see the --callback 
wrappers in newest releases) I doubt that most users will 
be able to use it :(

> also, internet explorer can save homepages in a 'web 
> archive' mht format...  this format saves an ind. page 
and 
> all the images on that page into one file that is 
viewable 
> by ie (maybe by other browsers, too...  i dunno).  if 
this 
> is an open format, perhapes it could be incorporated into 
> HTThreads as a feature that can be enabled, but that is 
> disabled by default?
This is something like an open format: mht archives are in 
fact.. eml files that are in fat.. MIME messages! These 
files are compatible with all environments, including Unix 
(this is the standard mailbox format)

I could add this feature in the future ; but I don't know 
how many users would use it

> finally, usenet messages bundle images / attachments 
within 
> them with unecode...  would it be possible to do this 
with 
> html pages, as well?  i would think it would be easy 
enough 
> to try, but...  i never have, heh, and am too lazy too :)

"mht" files would perfetly fit this usage. I also suspect 
that a complete website could be embedded in a single mht 
file, including links inside. (you could "browse" the site 
from a single file)

It would produce a huge file, however (all data +>33% due 
to base64 encoding)
Create subthread
All articles
Subject	Author	Date
a few ideas		05/27/2003 08:55
Re: a few ideas		05/27/2003 21:07
Re: a few ideas		05/29/2003 15:00