Preprocessing for ZCatalog Indexing

Motivation

DTMLMethod's PrincipiaSearchSource returns raw unprocessed content. If this is used for document indexing, documents may be indexed for irrelevant terms, such as HTML tags. On the other hand, essential index terms may be missing or wrong, e.g. due to the use of HTML entities or because documents components (included e.g. via DTML tags) are not included. This is especially serious for internationalized text where almost all content is dynamically generated based on the effective language.

The following module can be used to preprocess DTML methods and documents before indexing. It renders the object and then applies HTML filtering on the result. This filtering strips HTML/SGML tags and translates HTML 2.0 entities. The result can then be feed to ZCatalog's indexing machinery.

Module

#	$Id: CatalogSupport.html,v 1.1.1.1 2002/02/23 13:40:19 dieter Exp $
'''Catalog Support Routines.'''

from sgmllib import SGMLParser
from string import join

class _StripTagParser(SGMLParser):
  '''SGML Parser removing any tags and translating HTML entities.'''

  from htmlentitydefs import entitydefs

  data= None

  def handle_data(self,data):
    if self.data is None: self.data=[]
    self.data.append(data)

  def __str__(self):
    if self.data is None: return ''
    return join(self.data,'')


def filterRenderedHTML(self):
  '''renders *self* and filters HTML.

  can be used as method for DTML Methods/Documents indexing.
  '''

  # we pass "render_for_catalog__", such that a catalog aware object
  #  may take special actions, e.g. not create sessions
  # rendering may raise exceptions; in this case, this document
  #  does not provide information for this indexing category.
  try: render= self(self,self.REQUEST, render_for_catalog__=1)
  except: return ''

  # filter
  try:
    p= _StripTagParser()
    p.feed(render); p.close()
    return str(p)
  except: return ''

You can download this module, too.

Installation and Use

To use it automatically, do the following steps:

  1. make filterRenderedHTML an external method accessible from your catalogued objects,
  2. add the name of this external method as a text index to your catalog, in addition or as a replacement for PrincipiaSearchSource but in analogy to it.
  3. use the DTML variable render_for_catalog__ inside a DTML Method/Document, if it requires special treatment during index preprocessing.

You may need a patch for "ZCatalog.ZopeFindAndApply" (or wait for Zope 2.2.1), because older ZopeFindAndApply versions strip the acquisition context. Therefore, your external method would not be found by the objects and the index would remain empty.

Caveats

The module is in an alpha state. So, expect some problems. The following problems are forseen:


Dieter Maurer
Last modified: Sun Jul 30 14:57:48 CEST 2000