| > My question is this - has anyone ever thought about what it
> would take to have httrack save text based documents to a
> database, rather than as files on a hard drive? Does anyone
> have a methodology about how we might go about this or has
> anyone done it?
Well, the most simple way would be to mirror the sites as
usual and then to copy the files into a database in a
separate step :)) Another option would be to use the
callback postprocess-html; in this callback you could issue
an SQL INSERT resp. UPDATE statement. But the post-process
callback is not used e.g. for images, hence you must
traverse the directories of the mirror anyway, if you
really want to store everything in a database.
Alternatively, you can record at least the names of all
saved files via the save-name callback.
The main question is what you want to do with the database.
If you want to build a search engine which returns the
original URLs, again the save-name callback is your friend.
It allows you to record how the original URLs are mapped
to file system paths. In the most simple case, you can use
a table with three columns: document data, file path,
original URL. Use the file path column for updates of the
database; return the original URLs in search requests.
| |