Welcome to EBNFParser’s documentation!

Quick Start

Installing

The EBNFParser only supports Python 3.6+ now.

You can install it by using PyPI.

pip install -U EBNFParser

Hello World

We can try to parse Lisp grammar syntax into AST(Abstract Synatx Tree) as our first attempt.

(define add3 (x y z)
          (add x
              (add y z)))

Here is a source code example:

  • lisp.ruiko
ignore [space]  # ignore the tokens with this(these) name(s).

space   := R'\s';

Atom    := R'[^\(\)\s\`]+'; # use Regex

Expr    ::= Atom
        | Quote
        | '(' Expr* ')';


Quote   ::=  '`' Expr ;

Stmts   ::= Expr*;

And then use it to generate a parser and make a test script automatically by EBNFParser.

Finally, test it.

ruiko lisp.ruiko lisp_parser.py --test
python test_lang.py Stmts "(definie f (x y z) (add (add x y) z))"
 =========================ebnfparser test script================================
Stmts[
    Expr[
        [name: auto_const, string: "("]
        Expr[
            [name: Atom, string: "definie"]
...(omit)

Integrate EBNFParser Into Your Own Project

For example, if we have generated the lisp parser file like the above as a module MyProject.Lisp.parser .

from Ruikowa.ObjectRegex.ASTDef import Ast
from Ruikowa.ErrorHandler import ErrorHandler
from Ruikowa.ObjectRegex.MetaInfo import MetaInfo
from Ruikowa.ObjectRegex.Tokenizer import Tokenizer

from lisp_parser import Stmts, token_table

import typing as t

def token_func(src_code: str) -> t.Iterable[Tokenizer]:
    return Tokenizer.from_raw_strings(
        src_code, token_table, ({"space"}, {}))

parser = ErrorHandler(Stmts.match, token_func)

def parse(filename: str) -> Ast:

    return parser.from_file(filename)

# just create a file `test.lisp` and write some lisp codes.
print(parse("./test.lisp"))

An Ruikowa.ObjectRegex.Ast is a nested list of Tokenizers, for instance:

AstName[
    AstName[
        Tokenizer1
        Tokenizer2
        AstName[
            ...
        ]
    ]
    Tokenizer3
]

You can use obj.name to get the name of an instance of Ast or Tokenizer.

Parsing in EBNFParser

EBNFParser is a parser generator framework to parse raw string into structured nested list(AST).

Pasring of EBNFParser has following steps:

Tokenizing

Tokenizing is the very first step to split input string into a sequence of Ruikowa.ObjectRegex.Tokenizer objects.

A Ruikowa.ObjectRegex.Tokenizer has the following readonly attributes:

  • name : str
    type of the tokenizer.
  • string : str
    string content(from input raw string) of the tokenizer.
  • colno : int
    column number in current file.
  • lineno : int
    row number in current file.

Example:

  • parsing_tokenizing.ruiko
MyTokenType  := 'abc' '233';
# The above syntax defines a literal parser to parse strings like "abc" or "233".
# "abc", "233" will be added into `token_table` to generate automatical tokenizing function.

parserToTest ::= MyTokenType+;
# The above syntax defines a combined parser with `MyTokenType`.

# A combined parser is combined by several literal parsers and other combined parsers,
#   which can handle very complicated cases of a sequence of `Ruikowa.ObjectRegex.Tokenizer` objects.
  • compile it
ruiko parsing_tokenizing.ruiko parsing_tokenizing.py --test
  • test it
python test_lang.py parserToTest "abc233233"
=========================ebnfparser test script================================
parserToTest[
    [name: MyTokenType, string: "abc"]
    [name: MyTokenType, string: "233"]
    [name: MyTokenType, string: "233"]
]

Take care that if you’re using anonymous literal pattern when definining a combined parser, like the following:

Just ::= 'just'+;

Then name of all the anonymous tokenizers is just "auto_const" :

ruiko just.ruiko just --test
python test_lang.py Just "justjustjust"
=========================ebnfparser test script================================
Just[
    [name: auto_const, string: "just"]
    [name: auto_const, string: "just"]
    [name: auto_const, string: "just"]
]

CastMap(Optional)

Sometimes we need special cases, a vivid instance is keyword .

The string content of a keyword could be also matched by identifier (in most programming languages we have identifiers), just as the following case:

  • parsing_CastMap.ruiko
ignore [space]
space        := R'\s+';
# ignore the whitespace characters.


identifier   := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
keyword      := 'def' 'for' 'public';

parserToTest ::= (identifier | keyword)+;

There is no doubt that identifier will cover the cases of keyword

ruiko parsing_CastMap.ruiko parsing_CastMap.py --test
python test.py parserToTest "def for public"
=========================ebnfparser test script================================
parserToTest[
    [name: identifier, string: "def"]
    [name: identifier, string: "for"]
    [name: identifier, string: "public"]
]

Take care that all of the Tokenizers have name identifier, not keyword ! As as result, the keyword could be used in some illegal places, just like:

for = 1
for for <- [for] do
    for

The above example might not trouble you, but of course there could be something severer.

I’d like to give a solution adopted by EBNFParser auto-token.

(modify parsing_CastMap.ruiko

identifier   := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
keyword cast := 'def' 'for' 'public';

Here we define a cast map that will map the string tokenized by identifier`(like :code:”def”, :code:”for”` and "public") to a const string, and output a Ruikowa.ObjectRegex.Tokenizer which name is a const string "keyword".

ruiko parsing_CastMap.ruiko parsing_CastMap.py --test
python test.py parserToTest "def for public other"
=========================ebnfparser test script================================
 parserToTest[
    [name: keyword, string: "def"]
    [name: keyword, string: "for"]
    [name: keyword, string: "public"]
    [name: identifier, string: "other"]
]

Perfect!

ReStructure Tokenizers

This is what the word “parsing” accurately means.

Maybe you’ve heard about some sequence operation like flatMap (Scala-flatMap) , collect (FSharp-collect) , selectMany (Linq-SelectMany), that’s great, because parsing is its inverse!

raw words :

    ["def", "f", "(", "x", ")", "=", "x"]

after parsing there is an AST:

    FunctionDef[
        "f"
              # "def" is thrown away because it's useless to semantics, but you can
              # preserve it, causing noises. The same below.
        ArgList[
            "x"
        ],

        Expression[
            "x"
        ]
    ]

And structures of the parsed just match what you defined with EBNF.

Here is an example to generate above AST by using a EBNF idiom - ruiko which is proposed by EBNFParser to extend primary EBNF.

keyword     cast as K       := 'def';
identifier                  := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
FunctionDef throw ['def']   ::= K'def' identifier '(' ArgList ')' '=' Expression;
Expression                  ::= ... # omit
ArgList                     ::= ... # omit

What’s more, EBNFParser supports unlimited left recursions.

Ruiko EBNF

Grammar

ignore [token1, token2, ...]
# optional, discard some tokenizers with specific names.
# it only affects when you're using EBNFParser automatical tokenizing function.

deftoken directory1.directory2...directoryn.filename
# your custom token function. cannot be applied when you're using auto token.


token1  := ...;
token2  := ...;

token3 cast := ...;
# define a cast map

token4 cast as K := ...;
# def cast map and custom prefix

token5 as K := ...;
# only def custom prefix


token6 := ...;

token7 of token5 := ...;
# add more patterns to token5

parser1 ::=  token3 token5+ | [token6] token4* (parser2 parser3){3, 10};
# define a combined parser
/*
   `|` means `or`,
   `[<patterns>]` means `optional`,
   `(<patterns>)` means `make a new pattern by several patterns`,
   `pattern+` means one or more,
   `pattern*` means zero or more,
   `pattern{a, b}` means matching this pattern more than `a` times and less than b;
   `pattern{a}` means matching this pattern more than `a` times.
*/

parser2 throw [parser3 ';'] = parser3 parser1 ';';
/*
the result from `parser2` will not contains
    a term(`Tokenizer` or `Ast`) with name=parser3 or string=";"
*/

More accurately, see the bootstrap grammar here.

Regex Prefix

Regex prefix in ruiko EBNF would add a regex pattern to token_table, which might be used for generating an automatical tokenizing function(unless you use your custom tokenizing function).

When you want to use Regex prefix, just type R'<your regex pattern>'.

  • url.ruiko
url := R'https.*?\.(com|cn|org|net)';
other := R'.';
parserToTest throw [other] ::= (url | other)+;

test it .. code

ruiko url.ruiko url --test
python test_lang.py parserToTest "https://github.comasdas https://123.net"
=========================ebnfparser test script================================
parserToTest[
    [name: url, string: "https://github.com"]
    [name: url, string: "https://123.net"]
]

You should take care that there is only regex matching in tokenizing process, and when literal parsers and combined parsers are parsing tokenizers, they are matching whether the name is what they expect(in fact, what parsers are comparing by is not the name, it’s the memory address, so EBNFParser is very quick in this process).

Cast Map

SomeToken cast as S := 'abc';
Alpha               := R'[a-z]+';
F                   ::= S'abc' | Alpha;

The ruiko codes above defines a tokenize named SomeToken with a prefix S.

When the input source is splitted into a sequence of tokenizers , however, even the literal parser Alpha is supposed to match all string matched by regex pattern "[a-z]+", it cannot match a tokenizer with attribute string="abc" generated by EBNFParser automatical tokenizing, that’s because all the "all" has been casted into a unique string in a buffer pool, and all of them have the same name SomeToken, not Alpha.

Here is a string with value "abc" located at an unique memory address, and every literal parser defined by "abc" just matched it only.

Just as what I told you at Section Regex Prefix , The literal parser defined as Alpha := R'[a-z]+' just matches the tokenizer whose name is Alpha.

Custom Prefix

If you’re using custom tokenizing, several Ruikowa.ObjectRegex.Tokenizer objects with the same attribute string="abc" (and have the same memory address) could have different names.

To distinguish from each other, you can do as the following:

  • Grammar
SomeToken as S := 'abc';
Alpha          := R'[a-z]+';
F              ::= S'abc' | Alpha;
G              ::= 'abc';
H              ::= G | F ;
[name: SomeToken, string: "abc"]
...

If you are using combined parser G to match above tokenizers, you’ll fail, because in the grammar G is defined as G::='abc' , it means G only accepts the a tokenizer who has an attribute name="auto_const" and another attribute string="abc" (and it’s from the unique buff pool, not a string created by regex matching).