AdvancedQuery

1 Introduction

AdvancedQuery is a Zope product aimed to overcome several limitations of ZCatalog's native search function.

Like ZCatalog's search, it supports elementary index searches. While ZCatalog can combine such elementary searches only by "and", AdvancedQuery allows them to be combined arbitrary with & (and), | (or) and ~ (not).

While ZCatalog supports an efficient sorting via an index on one level, AdvancedQuery supports sorting via any level of (field) indexes. Moreover, it sorts the result incrementally -- only as far as you access your result. This can drastically speed up the time required for sorting. It uses Python's generators for this (and thus requires Python 2.2 or better).

2 Query Objects

Queries are specified by (full blown) Python objects. They are constructed in the following way:

Expression printed as Meaning
Eq(index, value, filter=False) index = value the documents indexed by index under value
Le(index, value, filter=False) index <= value the documents indexed by index under a value less or equal value
Ge(index, value, filter=False) index >= value the documents indexed by index under a value greater or equal value
MatchGlob(index, pattern, filter=False) index =~ pattern the documents indexed by index under a value matching the glob pattern. A glob pattern can contain wildcards * (matches any sequence of characters) and ? (matches any single character).
This query type is only supported by most string or unicode based ManagableIndexes (exception: RangeIndex). Many TextIndexes support glob matching via the Eq query.
MatchRegexp(index, regexp, filter=False) index =~~ regexp the documents indexed by index under a value matching the regular expression regexp. See the re module documentation in the Python Library Reference, for a description of regular expressions.
This query type is only supported by most string or unicode based ManagableIndexes (exception: RangeIndex).
Between(index, low, high, filter=False) low <= index <= high the documents indexed by index under a value between low and high
In(index, sequence, filter=False) index in sequence the documents indexed by index under a value in sequence
Generic(index, value, filter=False) index ~~ value this query type is used to pass any search expression to index as understood by it. Such search expressions usually take the form of a dictionary with query as the most essential key. Generic is necessary to use the full power of specialized indexes, such as the level argument for PathIndex searches.
Indexed(index) Indexed(index) the documents that are indexed by index. This does not work for all index types.
LiteralResultSet(set) LiteralResultSet(set) the documents specified by set.
set must be an IISet, IITreeSet or sequence of catalog "data_record_id_"s.
This can e.g. be used to further restrict the document set previously obtained through a query.
~ query ~ query Not: the documents that do not satisfy query
query1 & query2 (query1 & query2) And: the documents satisfying both query1 and query2
And(*queries) (query1 & ... & queryn) And: the documents satisfying all queries; if queries is empty, any document satisfies this And query
query1 | query2 (query1 | query2) Or: the documents satisfying either query1 or query2 (or both)
Or(*queries) (query1 | ... | queryn) Or: the documents satisfying (at least) one of queries; if queries is empty, no document satisfies this Or query

And and Or queries are so called CompositeQuerys. They possess a method addSubquery(query) to add an additional subquery.

The constructors are imported from Products.AdvancedQuery.

AdvancedQuery uses so called Monkey Patching to give ZCatalog the new method makeAdvancedQuery(catalogSearchSpec). A catalogSearchSpec is a search specification as described in the Zope Book for ZCatalog searches (essentially a dictionary mapping index names to search specifications). makeAdvancedQuery returns the equivalent AdvancedQuery search object.

3 Query evaluation

AdvancedQuery uses so called Monkey Patching to give ZCatalog and (if available) the CMF CatalogTool the new method evalAdvancedQuery(query, sortSpecs=(), withSortValues=_notPassed). evalAdvancedQuery evaluates query and then sorts the document result set according to sortSpecs.
If withSortValues is not passed in, it is set to True if sortSpecs contains a ranking specification (as you are probably interested in the rank) and to False otherwise.
If withSortValues, then the data_record_score_ attribute of the returned proxies is abused to hold the sort value. It is a tuple with one component per component in sortSpecs. The attribute data_record_normalized_score_ is set to None.

The CatalogTool's evalAdvancedQuery uses a new ValidityRange index (if present) in preference to the effective and expires indexes to restrict searches to valid objects. ValidityRange is expected to be a ManagableIndex RangeIndex. Searches via such a range index are considerably more efficient than those via the individual indexes.

4 Sorting

AdvancedQuery supports incremental multi-level lexicographic sorting via field index like indexes. If an index used for sorting is not field index like (i.e. does not index an object under at most one value), you may get funny (and partly non determistic) results.

Sorting is specified by a sequence of sort specifications, each for a single level. Such a specification is either an index name, a pair index name and direction or a ranking specification (see below). A direction is 'asc' (ascending) or 'desc' (descending); if the direction is not specified, 'asc' is assumed.

When the result contains documents not indexed by a sorting index, such documents are delivered after indexed documents. This happens always, independant of search direction.

5 Incremental Filtering

From version 1.1 on, AdvancedQuery supports incremental filtering. Incremental filtering can be very promissing for an unspecific subquery inside an otherwise specific And query, especially for large Le, Ge, Between and range subqueries. If we use the index in the normal way a huge Or query is constructed for such subqueries. Even IncrementalSearch2 cannot fully optimize the search against this huge Or query. Whith incremental filtering the index is not used in the normal way. Instead, the remaining And subqueries are used to produce a set of document candidates. These are then filtered by the filtering subquery, discarding documents not matching the subquery. Provided that the other And subqueries already have reduced the document set sufficiently, incremental filtering can save a lot of time.

You request incremental filtering for an (elementary) subquery with the filter keyword argument. Usually, you use it only for some subqueries of specific And queries. Otherwise, incremental filtering may not reduce but increase the query time (even considerably).

If you have more than a single filtering subquery in an And query, their order might be relevant for efficiency. You should put filtering subqueries that are likely to reduce the document set more before other filtering subqueries.

Incremental filtering requires that you have version 1.0 of IncrementalSearch or IncrementalSearch2 installed. Furthermore, incremental filtering is only effective, if the index supports it. This is true for most version 1.3 ManagableIndex index types. If some condition for incremental filtering is not met, the filter keyword is simply ignored.

6 Ranking

From version 2.0 on, AdvancedQuery supports incremental ranking. Ranking is a form of sorting. Therefore, you specify it as a sort spec. Ranking can be combined with other sort specs in the usual way (leading to multi-level sorting).

Like sorting in general, ranking is performed incrementally -- just as far as you have looked at the result. Therefore, although ranking in general is very expensive, its effect can be small if you only look at the first few (hundred) result objects (rather than the several hundred thousands).

Currently, the ranking specifications RankByQueries_Sum, and RankByQueries_Max are supported. In both cases, you call the constructors with one or more pairs (q, vq), i.e. with a sequence of weighted queries.
The rank of a document is the sum or the maximum of the weights for queries matching the document, respectively.
Note that the runtime behaviour for RankByQueries_Sum is exponential, that of RankByQueries_Max linear in the number of queries involved in the ranking.
Note that you probably want to normalize the document ranks. The ranking classes above have methods getQueryValueSum() and getQueryValueMax(), respectively, that can help with this.

7 Examples

from Products.AdvancedQuery import Eq, Between, Le

# search for objects below 'a/b/c' with ids between 'a' and 'z~'
query = Eq('path','a/b/c') & Between('id', 'a', 'z~')

# evaluate and sort descending by 'modified' and ascending by 'Creator'
context.Catalog.evalAdvancedQuery(query, (('modified','desc'), 'Creator',))

# search 'News' not yet archived and 'File's not yet expired.
now = context.ZopeTime()
query = Eq('portal_type', 'News') & ~ Le('ArchivalDate', now)
        | Eq('portal_type', 'File') & ~ Le('expires', now)
context.Catalog.evalAdvancedQuery(query)

# search 'News' containing 'AdvancedQuery' and filter out
# not yet effective or still expired documents.
query = Eq('portal_type', 'News') & Eq('SearchableText', 'AdvancedQuery') \
  & Ge('expires', now, filter=True) & Le('effective', now, filter=True)
context.Catalog.evalAdvancedQuery(query)

# search for 'ranking' in 'SearchableText' and rank very high
# when the term is in 'Subject' and high when it is in 'Title'.
# print the id and the normalized rank
from Products.AdvancedQuery import RankByQueries_Sum
term = 'ranking'
rs = RankByQueries_Sum((Eq('Subject', term),16), (Eq('Title', term),8))
norm = 1 + rs.getQueryValueSum()
for r in context.Catalog.evalAdvancedQuery(
    Eq('SearchableText', term), (rs,)
    ):
    print r.getId, (1 + r.data_record_score_) / norm

8 Important note about caching

You must not cache the result of an AdvancedQuery unless you have ensured that sorting has finished (e.g. by accessing the last element in the result). This is because AdvancedQuery uses incremental sorting with BTrees iterators. Like any iterator, they do not like when the base object changes during iteration. Nasty types of (apparently) non-deterministic errors can happen when the index changes during sorting.

9 Download and installation

Download the most recent version via my Zope Page.

Install by unpacking the tar archive into your Products folder and restart your Zope.

10 License

This software is open source and licensed under a BSD style license. See the license file in the distribution for details.


Dieter Maurer
Last modified: Sun Jun 25 21:13:29 CEST 2006