| > Is httrack a suitable program for this?Yes and No..
> download all new post made on a phpbb forum
1st run would create the 1st mirror.
Then any subsequent update run will.. update. That would fetch all new posts.
> program needs to be running 24/7
schedule HTTrack with the scheduler of your choice, for example cron (see also
hint in the manual)
> and keep posts that are deleted
..here we come to the difficult part.
HTTrack (nor any other website copy tool) are not able to follow up /
differentiate what are deleted posts.
If a phpbb post gets deleted, but the actual URI stays the same (for example
when it uses pagination parameters and nothing else), then HTTrack would
indeed get the modified phpbb page, so the mirror matches the current status
of phpbb.
However, this can (and usually) will result in an update of an existing,
previously fetched web page. It can not know that there is a deleted post
inside, and therefore keep the previous copy as well.
A work-around would be: you script to take a snapshot copy of each HTTrack
run, for example zip up the whole website folder with a reasonable naming
convention. And keep continuing to use always the latest / current artifacts
downloaded by HTTrack for subsequent update runs.
By doing so, you would be able to keep technically all posts since you started
to capture the phpbb with this setup, but at the cost of increasing snapshot
zips and apparently discomfort in looking for posts, as you would need to scan
through all snapshots.
But with the snapshots you may be able to run a custom program in order to
figure out the diffs and consolidate on file basis the content into a single
destination at least and to purge any processed snapshots to save space and
clean up a bit. You may think about a naming convention of such identified
files, and these ones would have to be viewed "manually" in the browser, as
they would not be linked up with the current mirror.
A total different, but workable solution would be that you write a custom
scraping utility, which fetches the phpbb content by knowing and dissecting
its DOM structure(s) and request/response calls, and in optimal case even
stores the posts in a local database. That could be a little fun project
actually.
For HTTrack configuration please check the manual.
For cron configuration to schedule the update process, please check the
internet, which has plenty of resources.
For the "snapshot zip up" proposal, please check the internet on keywords such
as: shell programming, create zip files from folder with dynamic timestamp
naming, move/rename files, and the like.
| |