HTTrack Website Copier
Free software offline browser - FORUM
Subject: Re: Not change file if it's to be updated w. callb
Author: Juan Fco Rodriguez
Date: 05/30/2005 09:54
 
> > I would like to detect the situation where a file
> > needs to be updated. If that is the case, I would like
> > to not overwrite that file, instead I would rename the file
> > using a suffix like "<original_nam>_newFile".
> 
> Humm, this is not an easy task for html data, because httrack always modify
them (as internal links might have to be updated to apply modified options, or
structural changes)

Hello Xavier,

I dont quite catch what you are telling me. Do you say
that httrack rewrites each and every html file because
it doesn't know in advance if an internal link will 
have changed ?. I thought the behaviour was easier. I mean,
if httrack receives "HTTP/1.1 304 Not Modified", then
it shouldn't rewrite the previously downloaded file, because
it has not changed. Otherwise, it would be possible to
detect that the file was already there, and save it under
a different name...all of this iff the running options 
have not changed and when it's called with an adecuate command line option. Am
I wrong ?
I'm considering the option of renaming the project when
I want to make an update, and using diff -r to detect
changes. This solution implies downloading everything from scratch to be able
to make comparisons. Im not very happy
with this.

I think that you understand the importance of a feature
like this one. If you are able to keep both the original
file and the new file then you can process the words of
both of them to, for example, keep a searchable index up to date. Of course,
after a post-processing I would make the
new files overwrite the old ones in order to start the
next iteration in the same "situation", e.g, the 
original site structure as it was seen at a specific date.

I've been able to imitate this behaviour with the use
of "save_name", "tranfer_status" and "end" callbacks.

In "save_name", if the filename already exists, I rename
to "<filename>_XXXX". In "transfer-status", I check for
both "back->r.notmodified" and "back->is_update". 
Depending of the results of these checks, I add the filename
to either a shared global list of filenames that are updates or to another
list of filenames that haven't changed (but I wrongly have renamed in
"save-name" previsouly).
With this information, I think I can create the behaviour
I'm looking for.

Would you like to review this implementation ? Ive got
some weird problems, for example, sometimes I've got a file that has been
modified ("! back->r.notmodified" = true) and that it hasn't been updated ("!
back->is_update" = true). So this means that the file is a new file ("state
added"). If this were actually a new file, I would have not appended the
suffix "_XXX" to the name because "fexist()" would have told me it didn't
exist (on "save-name"). Strangely, if I check for the existence of the suffix
when those conditions are hold in "transfer-status", sometimes it says that
the filename do have the suffix !....

Thanks in advance.


> 
> You can't neither use the "receive-header" callback, as the filename is not
always known at this stage.
> 
> I'll probably have to create a "store-link" callback of something similar,
called each time a link location is being created (added on the TODO list)
> 
> In the meantime, a copy of the project, and a recursive diff after the
mirror, might fit your needs ?> 
> > PS: I'm spaniard...what doest it mean "lien_back" ? :P
> 
> Humm, some bad frenglish name :p
> 
> 

 
Reply Create subthread


All articles

Subject Author Date
Not change file if it's to be updated w. callbacks

05/24/2005 12:36
Re: Not change file if it's to be updated w. callbacks

05/29/2005 16:44
Re: Not change file if it's to be updated w. callb

05/30/2005 09:54




d

Created with FORUM 2.0.11