ManagableIndex

1 Introduction

ManagableIndex can mean different things for different people. For a content manager, it brings flexible field, keyword and efficient range indexes managable via the ZMI to

For a developer, ManagableIndex provides a framework for index definition, improving on PluginIndexes. It provides for managability, automatically and intelligently handles unindexing when an object is reindexed and implements and, or and range queries (for not too complex indexes).

2 Concepts

The main tasks of an index are to index objects under a set of index terms derived from the object and efficiently locate all objects indexed under a given term.

Indexing consists of 3 stages: evaluation of the object to obtain the object's value (with respect to the index), deriving index terms from the value and storing the association term/object in a way such that the objects associated with a term can quickly be retrieved.

2.1 ValueProvider

Evaluation is specified by a sequence of ValueProviders associated with the index. A ValueProvider is a device that returns a value for an object. If the return value is not None, then it is interpreted as the object's value with respect to this ValueProvider.

A ValueProvider can have an associated IgnorePredicate, a TALES expression. When the IgnorePredicate yields true for a value, then it is ignored. You may e.g. specify that empty strings should be ignored.

A ValueProvider can have an associated Normalizer, a TALES expression. The Normalizer is applied to not ignored values to transform them in a standard form, e.g. the normalizer for a keyword index can transform a string into a one element tuple containing the string as a keyword index requires a sequence value.

The most essential ValueProviders are AttributeLookups. An AttributeLookup determines the object's value through the lookup of an attribute. The AttributeLookup's configuration determines whether acquisition can be used, whether a callable value should be called and whether exceptions should be ignored.

ExpressionEvaluators are another type of ValueProviders. These are TALES expressions defining an objects value. ExpressionEvaluators often avoid to define simple scripts just for indexing purposes.
Warning: Until Zope 2.7, TALES expressions have been trusted when used outside of Zope; from Zope 2.8 on, TALES expression evaluation is always subject to Zope's security restrictions even when used outside of Zope. This may have strange consequences when you perform index management (e.g. mass reindexing) in an external script (run outside of Zope). In such cases, you should probably let the script login as a Zope user with sufficient priviledges.

2.2 Value Combination

When an index has several ValueProviders, their values must be combined to define the object's value.

None and ignored values are always ignored and are not combined.

Currently, there are three combiners:

useFirst
use the first non-ignored value
This is available for all kinds of indexes.
union
combine all non-ignored values in a union like way.
This is only available for indexes that split the object's value into a set of index terms (such as keyword or text indexes).
aggregate
combine all values into a list.
This is only available for index types that expect aggregate values such as e.g. a RangeIndex.

2.3 Term Standardization

Once the object's value (with respect to the index) has been determined, the value is split into a set of index terms in an index specific way (e.g. a FieldIndex uses the value directly, a KeyWordIndex requires the value to be a sequence and uses its elements as index terms and a TextIndex splits the value into words and uses the words). The terms are then standardized.

Standardization consists of serveral steps: prenormalization such as case normalization, stemming, phonetic normalization, ... elimination of stop terms, normalization such as type conversion, type checking and copying.

2.3.1 Prenormalization

A index can define a term prenormalizer, a TALES expression. The prenormalizer is applied to terms before term expansion in queries and always as the first step of normalization during indexing. It can be used e.g. for case normalization, stemming, phonetic normalization, synonym normalization, ...

2.3.2 Stop Terms

An index can define a StopTermPredicate, a TALES expression. When the predicate yields true for a term, the term is ignored.

2.3.3 Term Normalization

An index can define a NormalizeTerm TALES expression. The expression can be used to transform the term into some standard form, e.g. convert a descriptive string or complex object into a code used for efficient indexing.

2.4 Type Restriction

The BTrees used in index implementation require that any index term must be persistently comparable to any other index term in this index (otherwise, the index gets corrupted). To help observing this essential property, the index term's type can be restricted.

The following type restrictions are defined:

not checked
the type is not restricted at all,
numeric
the type must be a numeric type (int, float or long),
string
the type must be a string,
ustring
the type must be a unicode string; TermTypeExtra can specify an encoding used for conversion.
integer
the type must be an integer,
DateTime
the type must be a DateTime object or float -- the index stores the date as a float (seconds since epoch),
DateTimeInteger
the type must be a DateTime object or integer -- the index stores the date as an integer (seconds since epoch; truncated, if necessary),
DateInteger
the type must be a DateTime object or integer -- the index stores an integer representing the date (400 * year + 31 * (month-1) + (day-1),
tuple with tuplespec in TermTypeExtra
The value must be a tuple of the structure given by tuplespec. Each tuple component is specified by a letter or by a parenthesized tuplespec (for "subtuples"). Recognized letters are n (numeric), s (string), u (unicode string), d (datetime). An encoding for unicode conversion can be specified after tuplespec, separated by ;.
instance with fullClassName in TermTypeExtra
the type must be an instance of the given class and the class must define an __cmp__ method. It is assumed without check, that this __cmp__ implements a persistent comparison.
expression checked with TermTypeExtra specifying checking TALES expression
the expression may try to transform the value into something acceptable. If it yields None or raises an exception, the type is considered inacceptable, otherwise the value is used instead of the original term.

All checkers try to convert a term into an acceptable type. Only when this fails, an exception is raised.

If you choose one of the integer types, i.e. integer, DateTimeInteger or DateInteger, an especially efficient index type is build. You must clear and reindex the index when the index type is changed. You must clear the index even when it is already empty (because the data structures may need to be changed).

2.4.1 Copying

It is dangerous to index mutable types. If a indexed mutable object is later changed, its ordering with respect to other indexed objects may change, corrupting the index. Corruption of this kind can be avoided when the mutable object is copied before it is stored in the index.

TermCopy specifies whether the value should be directly used, shallow copied or deep copied.

3 Index Types

Currently, ManagableIndexes supports 3 types of indexes: FieldIndex, KeywordIndex and RangeIndex. Further types can easily be implemented.

3.1 FieldIndex

A FieldIndex indexes an object under a single term, the object's value with respect to the index.

You get efficient DateTime and Date indexes by selecting DateTime (stored as float), DateTimeInteger (stored as integer) or DateInteger (stored as integer) as TermType. You must clear and reindex when you change the type.

You can very efficiently sort search results via a FieldIndex. The ZCatalog's method searchResults (aka __call__) supports this via its sort_on argument. My AdvancedQuery product supports arbitrary levels of sorting via FieldIndexes.

AdvancedQuery's FieldIndex based ascending sorting is much more efficient than descending sorting. To make descending sorting more efficient, you can tell your FieldIndex to maintain the sort keys (also) in reverse order. This will slow indexing down a bit but make descending sorts much faster. When you change this attribute, you must clear and reindex the index (otherwise, the reverse order is not consistent).

3.2 KeywordIndex

A KeywordIndex expects the value of an object to be a sequence. It indexes the object under each element of this sequence.

3.3 RangeIndex

A RangeIndex expects an object's value to be a pair specifying an interval low to high. The index can efficiently locate documents for which a given term t lies within the document's low and high bounds.

The object's value can either by constructed by a single attribute or with an aggregate combiner.

To provide for a partial plugin replacement for CMF's effective and expires indexes, RangeIndex supports a Boundary names property. If set, it should be a pair of two names low and high. The index will then execute queries of the form low <= val <= high. To be compatible with AdvancedQuery, the index replacing effective and expires should have the name ValidityRange.

RangeIndex efficiently supports improper ranges, i.e. those where at least one boundary is unlimited. You use its Mininal value and Maximal value properties to define which values should be considered as unlimited. These properties are TALES valued and are evaluated when the index is cleared. All values at or below (the value of) Minimal value are identified and interpreted as no lower limit; similarly, all values at or above (the value of) Maximal value are identified and interpreted as no upper limit.

The boolean property Organisation 'high-then-low' controls the index organisation. With high-then-low organisation, the high index is primary and the low index is subordinate; low-then-high indicates the opposite organisation. You should use high-then-low when val <= high is less likely than low <= val. This is to be expected for date ranges and typical queries against them.

3.4 WordIndex

A WordIndex indexes an object under a set of words. It uses a ZCTextIndex.PLexicon for splitting a text into a sequence of words.

A WordIndex lies between a KeywordIndex and a TextIndex. Like a TextIndex, it uses a Lexicon to split values into a sequence of words, stores integer word ids in its index and does not support range queries. Like a KeywordIndex, it indexes an object under a set of words -- no near or phrase queries and no relevancy ranking. Due to these restrictions, a WordIndex is very efficient with respect to transaction and load size.

The motivation to implement a WordIndex came from my observation, that almost all our ZEO loads above 100ms were caused by loading large word frequency IOBuckets used by TextIndexes for relevancy ranking -- a feature we do not need and use. Many of these loads transfered buckets of several 10 kB and a considerable portion of them took more than a second. As word frequency information is necessary for each document in a hit, you can imagine how fast our queries were.

On the other hand, a WordIndex is only useful when you use a flexible query framework, such as e.g. AdvancedQuery or CatalogQuery. The standard ZCatalog framework is too weak as it does not allow to have several subqueries against the same index in a query. That's the reason why TextIndexes come with their own query parser. I live in Germany and therefore the standard query parser is useless for us (we use und, oder, nicht instead of and, or and not) and I have AdvancedQuery -- thus I did not care to give the new WordIndex a query parser. You could easily provide one -- should you feel a need for it.

3.5 SimpleTextIndex

A simple text index can perform word or phrase queries.

It uses a ZCLexicon like lexicon to parse text into a word sequence.

Unlike almost all Zope text indexes, it does not have a built in query parser. If you need more complex queries use something like AdvancedQuery to specify your complex queries. Moreover, SimpleTextIndex does not support ranking.

Search terms are either a text or a sequence of ints. A text is converted via the lexicon into a sequence of ints. If an int sequence is passed in, then it is assumed that the text -> wordid conversion was performed outside. The query is either interpreted as an and query of the given words or as a phrase query, dependent on the phrase option.

For phrase queries, the use of IncrementalSearch2 is strongly recommended as it drastically speeds up phrase queries.

3.6 PathIndex

A PathIndex indexes an object under a path. A path is is either a tuple or a '/' separated string. The index supports path queries. For a given path, it locates objects that contain this path as subpath in its path value. Two search parameters level and depth control where in the object's path the given path may occur.

level and depth can be either None or an integer. When level and depth are both greater or equal 0, then an object with path op matches path p with respect to level and depth, iff op = p1 p p2 and len(p1) = level and len(p2) = depth. This means that level controls p's distance from the beginning and depth that from the end of op. A None value means that there is no restriction for the respective side. A negative value means that the distance from the respective side may be up to (including) the negated value. E.g. a "-2" allows up to 2 segments.

The default value for level is 0, that for depth is None.

Note, that getPhysicalPath requires acquisition to work properly. You must not set the acquisition type for such a value provider to none.

4 Matching

Most string and unicode based indexes (exception RangeIndex) support regular expression and glob matching. You use the query option match with a value of 'glob' or 'regexp' to call for this feature.

Note that for large indexes, the match term should begin with an (easily recognizable) plain text string. Otherwise, this query type can be very inefficient.

5 TALES Expressions

ManagableIndex uses TALES expressions at many places. They always can use the predefined variables:

index
the index
catalog
the catalog
root
the zope root object
modules
the modules importer
nothing
None
value
the value to be checked
object
the object currently indexed

If the TALES expression evaluates to a callable object, then this is called on the value and the result used; otherwise, the evaluation result is used directly.

6 Utilities

The module Utils contains some useful functions: convertToDateTime(value), convertToDateTimeInteger(value, exc=0) and convertToDateInteger(value, round_dir=-1). Please see the source documentation, for details.

7 Programmatic Creation

A primary goal for ManagableIndex is the easy and flexible configuration via the Zope Management Interface (ZMI). Occasionally, however, you want to create your ManagableIndexes programmatically and want to configure them in the same step. The ZCatalog's [manage_]addIndex allow you to pass configuration information down to the index construction in the extra argument.

For the creation of a ManagableIndex, extra must have a dict value. Its key 'ValueProviders' determines the indexes' value providers. All other keys can provide values for the indexes properties.

If the key 'ValueProviders' is present, its value must be a sequence of value provider specifications which define the value providers for the index. If the key is not present, a default value provider is created.

A value provider specification is a dict with the mandatory keys 'id' and 'type'. type specifies the value provider's type ('AttributeLookup' or 'ExpressionEvaluator'). Additional keys can defined values for the value providers properties.

If the configuration dicts contain keys different from the above mentioned special ones and not corresponding to a property, a ValueError exception is raised.

Look for the _properties definitions in the source, to determine the available properties for the various types of objects.


Dieter Maurer
Last modified: Thu Oct 4 16:51:15 CEST 2007