Re: clarification on the update functionality needed

Subject: Re: clarification on the update functionality needed

Author: Krys

Date: 04/17/2013 23:13

> > > Is there functionality in httrack not to save
> the
> > > html files if they were previously saved?> 
> To be more precise, you have to revalidate the html
> file remotely (that is, ask the server whether the
> file has changed or not) in all cases for the update
> process.
> 
> (The html file is rewritten on disk anyway, because
> link computation may change the link layout - for
> example, foo.html and Foo.html might be different
> files, and might be ordered differently, leading to
> different foo.html and foo2.html files)

Thank you for the thorough description :)

I must have frustrated the most prolific answer giver here quite a lot.
Problem is that unless you have a good understanding of the problem the
succinct answers are not clear.
 
> The real problem is that MANY servers DO NOT CARE to
> fulfill update requests: they always return a "this
> page has been modified" status when httrack asks
> "has this page changed since last time ?" - and you
> end up retransmitting all data. That's unfortunate,
> but I can not do anything against lazy webmasters
> :(

Yes, I can see that this is a problem.

My problem is not server, but the fact that I don't need to maintain a mirror,
for which httrack is really nicely set up BTW. 
I want a thorough mirror of a news site to start with, easy now after a few
oupses with filters and options. 
Then I want to get in an efficient way the new major content, before it
becomes old news and gets thrown off the server. 

Unfortunately the news pages have a habit of having a panel of links to
"latest/most read/now trending" pages which do change regularly. 
So they are a legitimate target for an update.

Something that I could try if possible is to accept only links _from_ the
specific type of pages - index pages. 

Well httrack calls them index.html. I am not sure what they are on the server.
They are what you see when URL ends with "/", for example :
<http://www.bbc.co.uk/news/scotland/>.
These reside on several levels together with a lot of pages.

Is there a way of doing this?I know now to some extent how to filter the
target pages in the links, but the source of said links?
Cheers

Create subthread

All articles

Subject	Author	Date
clarification on the update functionality needed		04/17/2013 14:22
Re: clarification on the update functionality needed		04/17/2013 15:26
Re: clarification on the update functionality needed		04/17/2013 16:25
Re: clarification on the update functionality needed		04/17/2013 17:34
Re: clarification on the update functionality needed		04/17/2013 21:21
Re: clarification on the update functionality needed		04/17/2013 23:13