Contents
This page lists some of the modules in pypy/rlib together with some hints for what they can be used for. The modules here will make up some general library useful for RPython programs (since most of the standard library modules are not RPython). Most of these modules are somewhat rough still and are likely to change at some point. Usually it is useful to look at the tests in pypy/rlib/test to get an impression of how to use a module.
The pypy/rlib/listsort.py module contains an implementation of the timsort sorting algorithm (the sort method of lists is not RPython). To use it, subclass from the listsort.TimSort class and override the lt method to change the comparison behaviour. The constructor of TimSort takes a list as an argument, which will be sorted in place when the sort method of the TimSort instance is called. Warning: currently only one type of list can be sorted using the listsort module in one program, otherwise the annotator will be confused.
The pypy/rlib/nonconst.py module is useful mostly for tests. The flow object space and the annotator do quite some constant folding, which is sometimes not desired in a test. To prevent constant folding on a certain value, use the NonConst class. The constructor of NonConst takes an arbitrary value. The instance of NonConst will behave during annotation like that value, but no constant folding will happen.
The pypy/rlib/objectmodel.py module is a mixed bag of various functionality. Some of the more useful ones are:
The pypy/rlib/rarithmetic.py module contains functionality to handle the small differences in the behaviour of arithmetic code in regular Python and RPython code. Most of them are already described in the coding guide
The pypy/rlib/rbigint.py module contains a full RPython implementation of the Python long type (which itself is not supported in RPython). The rbigint class contains that implementation. To construct rbigint instances use the static methods fromint, frombool, fromfloat and fromdecimalstr. To convert back to other types use the methods toint, tobool, touint and tofloat. Since RPython does not support operator overloading, all the special methods of rbigint that would normally start and end with “__” have these underscores left out for better readability (so a.add(b) can be used to add two rbigint instances).
The pypy/rlib/rrandom.py module contains an implementation of the mersenne twister random number generator. It contains one class Random which most importantly has a random method which returns a pseudo-random floating point number between 0.0 and 1.0.
The pypy/rlib/rsocket.py module contains an RPython implementation of the functionality of the socket standard library with a slightly different interface. The difficulty with the Python socket API is that addresses are not “well-typed” objects: depending on the address family they are tuples, or strings, and so on, which is not suitable for RPython. Instead, rsocket contains a hierarchy of Address classes, in a typical static-OO-programming style.
The pypy/rlib/streamio.py contains an RPython stream I/O implementation (which was started by Guido van Rossum as sio.py in the CPython sandbox as a prototype for the upcoming new file implementation in Python 3000).
The pypy/rlib/unroll.py module most importantly contains the function unrolling_iterable which wraps an iterator. Looping over the iterator in RPython code will not produce a loop in the resulting flow graph but will unroll the loop instead.
The pypy/rlib/parsing/ module is a still in-development module to generate tokenizers and parsers in RPython. It is still highly experimental and only really used by the Prolog interpreter (although in slightly non-standard ways). The easiest way to specify a tokenizer/grammar is to write it down using regular expressions and simple EBNF format.
The regular expressions are implemented using finite automatons. The parsing engine uses packrat parsing, which has O(n) parsing time but is more powerful than LL(n) and LR(n) grammars.
The regular expression syntax is mostly a subset of the syntax of the re module. By default, non-special characters match themselves. If you concatenate regular expressions the result will match the concatenation of strings matched by the single regular expressions.
To parse a regular expression and to get a matcher for it, you can use the function make_runner(s) in the pypy.rlib.parsing.regexparse module. It returns a object with a recognize(input) method that returns True or False depending on whether input matches the string or not.
To describe a tokenizer and a grammar the pypy.rlib.parsing.ebnfparse defines a syntax for doing that.
The syntax file contains a sequence or rules. Every rule either describes a regular expression or a grammar rule.
Regular expressions rules have the form:
NAME: "regex";
NAME is the name of the token that the regular expression produces (it has to consist of upper-case letters), regex is a regular expression with the syntax described above. One token name is special-cased: a token called IGNORE will be filtered out of the token stream before being passed on to the parser and can thus be used to match comments or non-significant whitespace.
Grammar rules have the form:
name: expansion_1 | expansion_2 | ... | expansion_n;
Where expansion_i is a sequence of nonterminal or token names:
symbol_1 symbol_2 symbol_3 ... symbol_n
This means that the nonterminal symbol name (which has to consist of lower-case letters) can be expanded into any of the expansions. The expansions can consist of a sequence of token names, nonterminal names or literals, which are strings in quotes that are matched literally.
An example to make this clearer:
IGNORE: " ";
DECIMAL: "0|[1-9][0-9]*";
additive: multitive "+" additive |
multitive;
multitive: primary "*" multitive |
primary;
primary: "(" additive ")" | DECIMAL;
This grammar describes the syntax of arithmetic impressions involving addition and multiplication. The tokenizer produces a stream of either DECIMAL tokens or tokens that have matched one of the literals “+”, “*”, “(” or ”)”. Any space will be ignored. The grammar produces a syntax tree that follows the precedence of the operators. For example the expression 12 + 4 * 5 is parsed into the following tree:
The parsing process builds up a tree consisting of instances of Symbol and Nonterminal, the former corresponding to tokens, the latter to nonterminal symbols. Both classes live in the pypy/rlib/parsing/tree.py module. You can use the view() method Nonterminal instances to get a pygame view of the parse tree.
Symbol instances have the following attributes: symbol, which is the name of the token and additional_info which is the matched source.
Nonterminal instances have the following attributes: symbol is the name of the nonterminal and children which is a list of the children attributes.
To write tree visitors for the parse trees that are RPython, there is a special baseclass RPythonVisitor in pypy/rlib/parsing/tree.py to use. If your class uses this, it will grow a dispatch(node) method, that calls an appropriate visit_<symbol> method, depending on the node argument. Here the <symbol> is replaced by the symbol attribute of the visited node.
For the visitor to be RPython, the return values of all the visit methods need to be of the same type.
As the tree of arithmetic example above shows, by default the parse tree contains a lot of nodes that are not really conveying useful information. To get rid of some of them, there is some support in the grammar format to automatically create a visitor that transforms the tree to remove the additional nodes. The simplest such transformation just removes nodes, but there are more complex ones.
The syntax for these transformations is to enclose symbols in expansions of a nonterminal by [...], <...> or >...<.
This will produce a transformer that completely removes the enclosed symbols from the tree.
Example:
IGNORE: " ";
n: "A" [","] n | "A";
Parsing the string “A, A, A” gives the tree:
After transformation the tree has the ”,” nodes removed:
This will replace the parent with symbol. Every expansion can contain at most one symbol that is enclosed by <...>, because the parent can only be replaced once, obviously.
Example:
IGNORE: " ";
n: "a" "b" "c" m;
m: "(" <n> ")" | "d";
Parsing the string “a b c (a b c d)” gives the tree:
After transformation the tree looks like this:
This replaces the nodes nonterminal_1 to nonterminal_n by their children.
Example:
IGNORE: " ";
DECIMAL: "0|[1-9][0-9]*";
list: DECIMAL >list< | DECIMAL;
Parsing the string “1 2” gives the tree:
after the transformation the tree looks like:
Note that the transformation works recursively. That means that the following also works: if the string “1 2 3 4 5” is parsed the tree at first looks like this:
But after transformation the whole thing collapses to one node with a lot of children:
There are some extensions to the EBNF grammar format that are really only syntactic sugar but make writing grammars less tedious. These are:
These are implemented by adding some more rules to the grammar in the correct way. Examples: the grammar:
s: a b? c;
is transformed to look like this:
s: a >_maybe_symbol_0_< c | a c;
_maybe_symbol_0_: b;
The grammar:
s: a b* c;
is transformed to look like this:
s: a >_star_symbol_0< c | a c;
_star_symbol_0: b >_symbol_star_0< | b;
The grammar:
s: a b+ c;
is transformed to look like this:
s: a >_plus_symbol_0< c;
_plus_symbol_0: b >_plus_symbol_0< | b;
A semi-complete parser for the json format:
STRING: "\\"[^\\\\"]*\\"";
NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\+\-]?[0-9]+)?";
IGNORE: " |\n";
value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |
<"true"> | <"false">;
object: ["{"] (entry [","])* entry ["}"];
array: ["["] (value [","])* value ["]"];
entry: STRING [":"] value;
The resulting tree for parsing the string:
{"a": "5", "b": [1, null, 3, true, {"f": "g", "h": 6}]}
looks like this: