HTTrack Website Copier
Free software offline browser - FORUM
Subject: 1) Indexing PDFs and 2) A novel use for HTTracker
Author: EricW
Date: 12/14/2006 15:25
 
1) I would like to include the contents of PDF files in the index file
generated by HTTracker. I am considering doing this my using an open source
PDF to HTML converter (XPDF at www.foolabs.com/xpdf) and modifying HTTracker
to exhibit the following behavior: When it downloads PDFs, call XPDF to make a
temporary HTML, and then include the words in that temporary HTML file in the
index file, but with the associated URL of the PDF, not of the temporary HTML
file. Does anyone have any thoughts as to the easiest way of accomplishing
this? Would playing with the MIME settings in HTTracker help? I am a moderate
level C programmer, and I am willing to contribute my solution back to the
HTTracker community in the spirit of open source.

2) FYI, I am intending to incorporate HTTracker in a project for school (I
attend Hood College in Frederick MD). Below is a description of the project. I
intend to share the results with the HTTracker community. I will likely write
the data base management portion in Visual Basic and not in C, but you may
find this application interesting anyway. Any comments/suggestions are
encouraged.

There are numerous instances where information you want is on a small local
government web site – but there is either no search engine or it does not
perform well. An example is the Washington County Government web site in
Washington County Maryland (www.washco-md.net). In addition, if a concerned
citizen wishes to be aware of meetings and events which may effect them, they
must check these web sites every week or so. County Commissioner meetings, for
example. Some web sites allow citizens to subscribe to email lists which
deliver meeting agendas. However, there are several factors making this
impractical for busy citizens. First, there are many committees one must track
to make sure “all the bases” are covered. In Washington County, there are
County Commissioner meetings, Planning Commission meetings, Zoning Appeals
meetings, and others. Not only must meeting agendas be monitored, but the
decisions which result from these meetings are often reached a week or more
after the meeting and are posted on the web site separately. Also, the citizen
must go through a lot of information that is not of interest to find the
occasional pertinent fact. One may argue that the local government should
provide better, more targeted, email services to citizens, but they simply do
not.

The Citizen Participation Enhancement Tool (CPET) Project is a response to
this problem. CPET is a software program I intend to create and make available
as open source software. It will work as follows: Users can indicate the
governmental agency they are interested in. I will initially offer only the
Washington County government web site. Next, the user may enter a few key
words, perhaps their street or town name. Once a week, NEW occurrences of
these key words are searched for and, if any are found, their associated URLs
are emailed to the user. Of course, users may unsubscribe to the service or
change their key words at any time.

The system consists of three pieces of software which work together. The first
is WinHTTrack Website Copier. This will handle the indexing (and mirroring if
desired) of the web sites in question. The one shortcoming of this software is
that, while Adobe Acrobat PDFs may be mirrored, the index file indexes HTML
files only, not the PDFs. This software is open source, and I hope to
integrate the indexing function with a second piece of open source software,
XPDF (www.foolabs.com/xpdf), which can convert the textual information in PDF
files into HTML files. The third software component to the system is custom
software that I write which performs the processing of requests for keyword
searches and generates emails when new matches are found.

Thank You.
 
Reply


All articles

Subject Author Date
1) Indexing PDFs and 2) A novel use for HTTracker

12/14/2006 15:25




6

Created with FORUM 2.0.11