Preprocessing for ZCatalog Indexing


DTMLMethod's PrincipiaSearchSource returns raw unprocessed content. If this is used for document indexing, documents may be indexed for irrelevant terms, such as HTML tags. On the other hand, essential index terms may be missing or wrong, e.g. due to the use of HTML entities or because documents components (included e.g. via DTML tags) are not included. This is especially serious for internationalized text where almost all content is dynamically generated based on the effective language.

The following module can be used to preprocess DTML methods and documents before indexing. It renders the object and then applies HTML filtering on the result. This filtering strips HTML/SGML tags and translates HTML 2.0 entities. The result can then be feed to ZCatalog's indexing machinery.


#	$Id: CatalogSupport.html,v 2002/02/23 13:40:19 dieter Exp $
'''Catalog Support Routines.'''

from sgmllib import SGMLParser
from string import join

class _StripTagParser(SGMLParser):
  '''SGML Parser removing any tags and translating HTML entities.'''

  from htmlentitydefs import entitydefs

  data= None

  def handle_data(self,data):
    if is None:[]

  def __str__(self):
    if is None: return ''
    return join(,'')

def filterRenderedHTML(self):
  '''renders *self* and filters HTML.

  can be used as method for DTML Methods/Documents indexing.

  # we pass "render_for_catalog__", such that a catalog aware object
  #  may take special actions, e.g. not create sessions
  # rendering may raise exceptions; in this case, this document
  #  does not provide information for this indexing category.
  try: render= self(self,self.REQUEST, render_for_catalog__=1)
  except: return ''

  # filter
    p= _StripTagParser()
    p.feed(render); p.close()
    return str(p)
  except: return ''

You can download this module, too.

Installation and Use

To use it automatically, do the following steps:

  1. make filterRenderedHTML an external method accessible from your catalogued objects,
  2. add the name of this external method as a text index to your catalog, in addition or as a replacement for PrincipiaSearchSource but in analogy to it.
  3. use the DTML variable render_for_catalog__ inside a DTML Method/Document, if it requires special treatment during index preprocessing.

You may need a patch for "ZCatalog.ZopeFindAndApply" (or wait for Zope 2.2.1), because older ZopeFindAndApply versions strip the acquisition context. Therefore, your external method would not be found by the objects and the index would remain empty.


The module is in an alpha state. So, expect some problems. The following problems are forseen:

Dieter Maurer
Last modified: Sun Jul 30 14:57:48 CEST 2000