ManagableIndex
can mean different things for
different people. For a content manager, it brings flexible field,
keyword and efficient range indexes managable via the ZMI to
portal_type
to the Subject
index transforming it into a sequence (as required by a keyword
index).For a developer, ManagableIndex
provides a
framework for index definition, improving on
PluginIndexes
. It provides for managability,
automatically and intelligently handles unindexing when an object
is reindexed and implements and, or and
range queries (for not too complex indexes).
The main tasks of an index are to index objects under a set of index terms derived from the object and efficiently locate all objects indexed under a given term.
Indexing consists of 3 stages: evaluation of the object to obtain the object's value (with respect to the index), deriving index terms from the value and storing the association term/object in a way such that the objects associated with a term can quickly be retrieved.
Evaluation is specified by a sequence of ValueProviders
associated with the index. A ValueProvider is a device that returns
a value for an object. If the return value is not
None
, then it is interpreted as the object's value
with respect to this ValueProvider.
A ValueProvider
can have an associated
IgnorePredicate, a TALES expression. When the
IgnorePredicate
yields true for a value, then it is
ignored. You may e.g. specify that empty strings should be
ignored.
A ValueProvider
can have an associated
Normalizer, a TALES expression. The
Normalizer
is applied to not ignored values to
transform them in a standard form, e.g. the normalizer for a
keyword index can transform a string into a one element tuple
containing the string as a keyword index requires a sequence
value.
The most essential ValueProviders are AttributeLookups. An AttributeLookup determines the object's value through the lookup of an attribute. The AttributeLookup's configuration determines whether acquisition can be used, whether a callable value should be called and whether exceptions should be ignored.
ExpressionEvaluators are another type of
ValueProviders. These are TALES expressions defining an objects
value. ExpressionEvaluator
s often avoid to define
simple scripts just for indexing purposes.
Warning: Until Zope 2.7, TALES
expressions have been trusted when used outside of Zope; from Zope
2.8 on, TALES expression evaluation is always subject to Zope's
security restrictions even when used outside of Zope. This may have
strange consequences when you perform index management (e.g. mass
reindexing) in an external script (run outside of Zope). In such
cases, you should probably let the script login as a Zope user with
sufficient priviledges.
When an index has several ValueProviders, their values must be combined to define the object's value.
None
and ignored values are always ignored and are
not combined.
Currently, there are three combiners:
useFirst
union
aggregate
RangeIndex
.Once the object's value (with respect to the index) has been
determined, the value is split into a set of index terms in an
index specific way (e.g. a FieldIndex
uses the value
directly, a KeyWordIndex
requires the value to be a
sequence and uses its elements as index terms and a
TextIndex
splits the value into words and uses the
words). The terms are then standardized.
Standardization consists of serveral steps: prenormalization such as case normalization, stemming, phonetic normalization, ... elimination of stop terms, normalization such as type conversion, type checking and copying.
A index can define a term prenormalizer, a TALES expression. The prenormalizer is applied to terms before term expansion in queries and always as the first step of normalization during indexing. It can be used e.g. for case normalization, stemming, phonetic normalization, synonym normalization, ...
An index can define a StopTermPredicate, a TALES expression. When the predicate yields true for a term, the term is ignored.
An index can define a NormalizeTerm TALES expression. The expression can be used to transform the term into some standard form, e.g. convert a descriptive string or complex object into a code used for efficient indexing.
The BTrees used in index implementation require that any index term must be persistently comparable to any other index term in this index (otherwise, the index gets corrupted). To help observing this essential property, the index term's type can be restricted.
The following type restrictions are defined:
not checked
numeric
int
,
float
or long
),string
ustring
TermTypeExtra
can specify an encoding used for conversion.integer
DateTime
DateTime
object or float -- the
index stores the date as a float (seconds since epoch),DateTimeInteger
DateTime
object or integer --
the index stores the date as an integer (seconds since epoch;
truncated, if necessary),DateInteger
DateTime
object or integer --
the index stores an integer representing the date (400 *
year + 31 * (month-1) +
(day-1)
,tuple
with tuplespec in
TermTypeExtra
n
(numeric), s
(string), u
(unicode string), d
(datetime). An encoding for unicode conversion can be specified
after tuplespec, separated by ;
.instance
with fullClassName in
TermTypeExtra
__cmp__
method. It is assumed without
check, that this __cmp__
implements a persistent
comparison.expression checked
with TermTypeExtra
specifying checking TALES expressionNone
or raises an exception,
the type is considered inacceptable, otherwise the value is used
instead of the original term.All checkers try to convert a term into an acceptable type. Only when this fails, an exception is raised.
If you choose one of the integer types, i.e.
integer
, DateTimeInteger
or
DateInteger
, an especially efficient index type is
build. You must clear and reindex the index when the index type is
changed. You must clear the index even when it is already empty
(because the data structures may need to be changed).
It is dangerous to index mutable types. If a indexed mutable object is later changed, its ordering with respect to other indexed objects may change, corrupting the index. Corruption of this kind can be avoided when the mutable object is copied before it is stored in the index.
TermCopy
specifies whether the value should be
directly used, shallow copied or deep copied.
Currently, ManagableIndexes
supports 3 types of
indexes: FieldIndex
, KeywordIndex
and
RangeIndex
. Further types can easily be
implemented.
FieldIndex
A FieldIndex
indexes an object under a single term,
the object's value with respect to the index.
You get efficient DateTime and Date indexes by selecting
DateTime
(stored as float),
DateTimeInteger
(stored as integer) or
DateInteger
(stored as integer) as
TermType
. You must clear and reindex when you change
the type.
You can very efficiently sort search results via a
FieldIndex
. The ZCatalog
's method
searchResults
(aka __call__
) supports
this via its sort_on
argument. My
AdvancedQuery product supports arbitrary levels of sorting via
FieldIndex
es.
AdvancedQuery
's FieldIndex
based
ascending sorting is much more efficient than descending sorting.
To make descending sorting more efficient, you can tell your
FieldIndex
to maintain the sort keys (also) in reverse
order. This will slow indexing down a bit but make descending sorts
much faster. When you change this attribute, you must clear and
reindex the index (otherwise, the reverse order is not
consistent).
KeywordIndex
A KeywordIndex
expects the value of an object to be
a sequence. It indexes the object under each element of this
sequence.
RangeIndex
A RangeIndex
expects an object's value to be a pair
specifying an interval low to high. The index
can efficiently locate documents for which a given term
t lies within the document's low and
high bounds.
The object's value can either by constructed by a single
attribute or with an aggregate
combiner.
To provide for a partial plugin replacement for CMF's
effective
and expires
indexes,
RangeIndex
supports a Boundary names
property. If set, it should be a pair of two names low
and high. The index will then execute queries of the
form low <= val <=
high
. To be compatible with AdvancedQuery
, the index
replacing effective
and expires
should
have the name ValidityRange
.
RangeIndex
efficiently supports improper ranges,
i.e. those where at least one boundary is unlimited. You use its
Mininal value
and Maximal value
properties to define which values should be considered as
unlimited. These properties are TALES valued and are evaluated when
the index is cleared. All values at or below (the value of)
Minimal value
are identified and interpreted as no
lower limit; similarly, all values at or above (the value of)
Maximal value
are identified and interpreted as no
upper limit.
The boolean property Organisation 'high-then-low'
controls the index organisation. With high-then-low
organisation, the high
index is primary and the
low
index is subordinate; low-then-high
indicates the opposite organisation. You should use
high-then-low
when val <=
high
is less likely than low
<= val
. This is to be expected for date ranges
and typical queries against them.
WordIndex
A WordIndex
indexes an object under a set of words.
It uses a ZCTextIndex.PLexicon
for splitting a text
into a sequence of words.
A WordIndex
lies between a
KeywordIndex
and a TextIndex
. Like a
TextIndex
, it uses a Lexicon
to split
values into a sequence of words, stores integer word ids in its
index and does not support range queries. Like a
KeywordIndex
, it indexes an object under a set of
words -- no near or phrase queries and no relevancy ranking. Due to
these restrictions, a WordIndex
is very efficient with
respect to transaction and load size.
The motivation to implement a WordIndex
came from
my observation, that almost all our ZEO loads above 100ms were
caused by loading large word frequency IOBuckets used by
TextIndex
es for relevancy ranking -- a feature we do
not need and use. Many of these loads transfered buckets of several
10 kB and a considerable portion of them took more than a second.
As word frequency information is necessary for each document in a
hit, you can imagine how fast our queries were.
On the other hand, a WordIndex
is only useful when
you use a flexible query framework, such as e.g.
AdvancedQuery
or CatalogQuery
. The
standard ZCatalog
framework is too weak as it does not
allow to have several subqueries against the same index in a query.
That's the reason why TextIndex
es come with their own
query parser. I live in Germany and therefore the standard query
parser is useless for us (we use und
,
oder
, nicht
instead of and
,
or
and not
) and I have
AdvancedQuery
-- thus I did not care to give the new
WordIndex
a query parser. You could easily provide one
-- should you feel a need for it.
SimpleTextIndex
A simple text index can perform word or phrase queries.
It uses a ZCLexicon
like lexicon to parse text into
a word sequence.
Unlike almost all Zope text indexes, it does not have a built in
query parser. If you need more complex queries use something like
AdvancedQuery
to specify your complex queries.
Moreover, SimpleTextIndex
does not support
ranking.
Search terms are either a text or a sequence of ints. A text is
converted via the lexicon into a sequence of ints. If an int
sequence is passed in, then it is assumed that the text ->
wordid conversion was performed outside. The query is either
interpreted as an and query of the given words or as a phrase
query, dependent on the phrase
option.
For phrase queries, the use of IncrementalSearch2
is strongly recommended as it drastically speeds up phrase
queries.
PathIndex
A PathIndex
indexes an object under a path. A path
is is either a tuple or a '/' separated string. The index supports
path queries. For a given path, it locates objects that contain
this path as subpath in its path value. Two search parameters
level and depth control where in the object's
path the given path may occur.
level and depth can be either
None
or an integer. When level and
depth are both greater or equal 0, then an object with
path op matches path p with respect to
level and depth, iff op =
p1 p p2
and
len(p1) = level
and
len(p2) = depth
. This means that
level controls p's distance from the
beginning and depth that from the end of op.
A None
value means that there is no restriction for
the respective side. A negative value means that the distance from
the respective side may be up to (including) the negated value.
E.g. a "-2" allows up to 2 segments.
The default value for level is 0
, that
for depth is None
.
Note, that getPhysicalPath
requires acquisition to
work properly. You must not set the acquisition type for such a
value provider to none
.
Most string and unicode based indexes (exception
RangeIndex
) support regular expression and glob
matching. You use the query option match
with a value
of 'glob'
or 'regexp'
to call for this
feature.
Note that for large indexes, the match term should begin with an (easily recognizable) plain text string. Otherwise, this query type can be very inefficient.
ManagableIndex
uses TALES expressions at many
places. They always can use the predefined variables:
index
catalog
root
modules
nothing
None
value
object
If the TALES expression evaluates to a callable object, then this is called on the value and the result used; otherwise, the evaluation result is used directly.
The module Utils
contains some useful functions:
convertToDateTime(value)
,
convertToDateTimeInteger(value, exc=0)
and
convertToDateInteger(value, round_dir=-1)
. Please see
the source documentation, for details.
A primary goal for ManagableIndex
is the easy and
flexible configuration via the Zope Management Interface (ZMI).
Occasionally, however, you want to create your ManagableIndexes
programmatically and want to configure them in the same step. The
ZCatalog
's [manage_]addIndex
allow you to
pass configuration information down to the index construction in
the extra
argument.
For the creation of a ManagableIndex, extra
must
have a dict value. Its key 'ValueProviders'
determines
the indexes' value providers. All other keys can provide values for
the indexes properties.
If the key 'ValueProviders'
is present, its value
must be a sequence of value provider specifications which define
the value providers for the index. If the key is not present, a
default value provider is created.
A value provider specification is a dict with the mandatory keys
'id'
and 'type'
. type
specifies the value provider's type ('AttributeLookup'
or 'ExpressionEvaluator'
). Additional keys can defined
values for the value providers properties.
If the configuration dicts contain keys different from the above
mentioned special ones and not corresponding to a property, a
ValueError
exception is raised.
Look for the _properties
definitions in the source,
to determine the available properties for the various types of
objects.