| > > > Is there functionality in httrack not to save
> the
> > > html files if they were previously saved?>
> To be more precise, you have to revalidate the html
> file remotely (that is, ask the server whether the
> file has changed or not) in all cases for the update
> process.
>
> (The html file is rewritten on disk anyway, because
> link computation may change the link layout - for
> example, foo.html and Foo.html might be different
> files, and might be ordered differently, leading to
> different foo.html and foo2.html files)
Thank you for the thorough description :)
I must have frustrated the most prolific answer giver here quite a lot.
Problem is that unless you have a good understanding of the problem the
succinct answers are not clear.
> The real problem is that MANY servers DO NOT CARE to
> fulfill update requests: they always return a "this
> page has been modified" status when httrack asks
> "has this page changed since last time ?" - and you
> end up retransmitting all data. That's unfortunate,
> but I can not do anything against lazy webmasters
> :(
Yes, I can see that this is a problem.
My problem is not server, but the fact that I don't need to maintain a mirror,
for which httrack is really nicely set up BTW.
I want a thorough mirror of a news site to start with, easy now after a few
oupses with filters and options.
Then I want to get in an efficient way the new major content, before it
becomes old news and gets thrown off the server.
Unfortunately the news pages have a habit of having a panel of links to
"latest/most read/now trending" pages which do change regularly.
So they are a legitimate target for an update.
Something that I could try if possible is to accept only links _from_ the
specific type of pages - index pages.
Well httrack calls them index.html. I am not sure what they are on the server.
They are what you see when URL ends with "/", for example :
<http://www.bbc.co.uk/news/scotland/>.
These reside on several levels together with a lot of pages.
Is there a way of doing this?I know now to some extent how to filter the
target pages in the links, but the source of said links?
Cheers
| |