This document is a how-to for delivering full-text, in-database search of Word, Excel, Powerpoint, and PDF documents using OpenACS, PostgreSQL, tsearch2, and a collection of command line utilities to convert binary formats to text or HTML. These techniques allows search to return results from both OpenACS applications, such as forum posts and blogs and from a full full text indexing of files in file-storage. OpenACS search is integrated with the OpenACS permissions model, so search results are only returned for documents the searcher can read. Full-text, in-database search will be included in OpenACS 5.3 scheduled for fall 2006. It will work with PostgreSQL 7.4 and up. (There is also support for full text document indexing with Oracle under OpenACS that will be addressed in a future post.) If you would like to use this feature now, the Search package from OpenACS CVS (HEAD) is required. To check out this package from anonymous cvs use the following command.
cvs -d:pserver:anonymous@cvs.openacs.org:/cvsroot co openacs-4/packages/search
Once you have the new search package installed, you will need the following utilities Any other document formats can be supported by installing a filter or utility to convert the document to text or html. If you install the utilities in /usr/local/bin the should work as soon as you index your documents. If the utilities are installed someplace else, you will need to edit packages/search/tcl/search-convert-procs.tcl to point to the location of the executable file for each utility. The final step is to reindex all your files. If you have documents in file-storage, a query similar to this one can be used to queue the files for indexing.
insert into search_observer_queue 
(select live_revision,now(),'UPDATE'
from cr_items ci,
cr_revisions cr
where ci.live_revision=cr.revision_id
and   ci.content_type='file_storage_object'
and   ci.name like '%.doc')
You can repeat that query changing the like '%.doc' criteria to like '%.xls', etc... for each file type you want to index. Pdfftotext will not extract text from a PDF document that does not allow copy/paste from the text of the PDF document. In this case only the text of the filename will be indexed.

This sounds

very cool. I intend to try it out for PDFs on one of my sites soon!

by Joe Oldak on 10/05/06

Add comment