PyXPath

1 Introduction
2 Installation
3 Usage
4 Requirements
5 Usage Conditions
6 Known Bugs/Weaknesses
7 Download

1 Introduction

PyXPath is an implementation of the XPath working draft 9-July-1999.

XPath is the expression sublanguage of both XSLT and XPointer.

PyXPath is laid on top of PyDOM, the DOM implementation of Python's XML special interest group.

For parser construction, PyXPath uses bison and Scott Hassan's PyBison package.

2 Installation

There are two TGZ archives pyxpath.tgz and pyxpath_s.tgz.

pyxpath.tgz is necessary to use the software. The archive contains two python packages dmutil and PyBison. Unpack it in Python's site-packages directory. While dmutil contains code developed by me, PyBison is a small (and slightly patched) part of Scott Hassan's PyBison distribution.

pyxpath_s.tgz contains additional sources. You need this archive only if you want to change (or view) the bison grammar for XPath. To change the grammar, you will need bison, PyBison, a C development system and the patch utility. PyBison must be patched with the patch file env.pat provided in the archive. The patch allows parser object specific customization.

3 Usage

The central module is dmutil.xsl.xpath. It contains the parser factory makeParser and two evalation context classes ParseContext and Env.

makeParser contructs an XPath parser. Such a parser parses XPath expression strings and constructs corresponding XPath objects. An XPath object can be evaluated with a node, a nodelist and an Env instance to obtain a value.

makeParser accepts two optional parameters, a ParseContext instance context, and a BaseFactory instance factory. context defines the namespaces and function library available for XPath parsing. factory is the factory object used to contruct XPath objects.

A XPath parser can be applied to a XPath expression string. It accepts as optional argument a ParseContext instance context. The expression is parsed with the namespaces defined in context and the context parameter given during parser construction and with the functions defined by either the context argument, the context argument given for parser construction or the factory. The context parameters default to None, the factory to dmutil.xsl.DomFactory.DomFactory. With this default setting, the constructed XPath object does not recognize namespaces and can use the functions defined in the XPath core library; it can be evaluated with three parameters, a PyDom node, a PyDom nodelist containing the node and an Env instance specifying the available variables and their values.


from dmutil.xsl.xpath import makeParser, Env

domtree=....		    # create a PyDom document

P= makeParser(); E= Env()   # make a parser and a variable environment
E.setVariable('x','Hallo')  # binds x to 'Hallo'

links= P('//A[@HREF]').eval(domtree,[domtree],E)
			    # selects all links in a HTML document
anchors= P('//A[@NAME]').eval(domtree,[domtree],E)
			    # selects all anchors in a HTML document

You find more XPath examples in the test_xpath.py test case file.

XPath knows 5 data types (extendible): Boolean, Number, String, Nodeset and Return Tree Fragment. PyXPath maps these to the Python data types int, float, string, list and (unspecified) instance, respectively. You must use one of these types for values of variables.

4 Requirements

PyXPath requires a Python 1.5.x installation together with the Python XML package xml-0.5.1.

Python can be downloaded from the Python homepage, the XML package from the XML-SIG repository.

5 Usage Conditions

You can use PyXPath under an Open Source license at your own risk. Please see the copyright notice at the beginning of dmutil/xsl/xpath.py, for details.

6 Known Bugs/Weaknesses

6.1 XML/XPath incompatibilities

PyXPath works with the ISO Latin-1 subset of Unicode rather than Unicode itself, as required by XML
namespaces are recognized by the parser; however, namespace prefixes are not transformed into URIs for matching, contrary to the XPath specification
Special floating point values, such as NaN and Infinity are not supported. Exceptions are raised, instead.
The namespace axis is not yet recognized.

6.2 `id` references

PyXPath has no way to determine which attributes are used as id attributes. In order to support id references, PyXPath requires each node._document to have an attribute _idMap. This attribute must be a dictionary mapping elements to their id attribute. If such an attribute does not exist, id references are not found.
The DomFactory module contains the class IdDecl. Its contructor has a dictionary mapping element names to id attributes as idMap parameter. If an IdDecl instance is applied to a document, it installs its idMap as _idMap attribute of the document.

6.3 Incompatibilities between DOM and XSL

XSL does not support CDATA, EntityReference and Notation nodes, which can occur in DOM trees. This incompatibility is not yet handled correctly.
Fortunately, most (SAX based) parsers transform CDATA and EntityReference into normal text nodes, at least if not explicitely told otherwise. However, in this process, the XSL requirement is violated that no two text nodes are adjacent. The function normalize can be applied to a document to merge adjacent text nodes.

6.4 Efficiency

Most XPath objects can be evaluated with one to three tree parses over the document. Some XPath constructs, however, may have quadratic complexity (in the tree size).

7 Download

PyXPath -- required for PyXPath usage: pyxpath-0.1.tgz 25 kB TGZ archive
Additional Sources -- only required for modifications to the XPath grammar: pyxpath_s-0.1.tgz 2 kB TGZ archive

Dieter Maurer

Last modified: Tue Aug 3 23:30:30 CEST 1999