Welcome to pent!¶
A common frustration in data analysis is software tooling that only generates its output in human-readable fashion. Thus, even if there is visible structure to the data, that structure is embedded in a format that can be awkward to parse.
Take the following toy data:
>>> text = """{lots of content}
...
... $data1
... 0 0.000
... 1 -3.853
... 2 1.219
...
... $data2
... 0 3.142
... 1 2.718
... 2 6.022
...
... {lots more content}"""
Say it’s needed to extract the list of decimal values in $data1, without the accompanying integers. Further, say that in any given particular output file, this list of values can be of any length.
One could write a line-by-line search to parse out the values, but that’s a slow way to go about it if there are many such data blocks that need to be extracted.
Regex is a pretty natural tool to use here, but writing the regex to retrieve these values is a non-trivial task: because of the way regex capture groups work, you really have to write two regexes. The first regex captures the whole chunk of text of interest, and the second searches within that chunk to capture the values from the individual lines.
pent
writes all this regex for you.
All you have to do is provide pent
with the structure of
the text using its custom mini-language,
including which parts should be captured for output,
and it will scrape the data directly from the text:
>>> prs = pent.Parser(
... head="@.$data1",
... body="#.+i #!..d",
... )
>>> prs.capture_body(text)
[[['0.000'], ['-3.853'], ['1.219']]]
This is just one example of pent
’s parsing
capabilities—it’s an extremely flexible tool, which can retrieve
just about anything you want from just about any
surrounding text.
Usage instructions for pent
are provided in the
tutorial, broken up into
(1) an explanation of the
basics of the syntax
and (2) exposition of a number of
(more-)realistic examples.
For those so inclined, a formal grammar of
the mini-language is also provided.
What pent
is not¶
pent
is not well suited for parsing data with an extensively
nested or recursive structure, especially if that structure
is defined by clear rules. Have JSON, XML, or YAML? There are other
libraries specifically made for those formats, and you should use them.
pent
ultimately is just a fancy regex generator, and thus it
carries the same functional constraints. If you build a Parser
that is too complex, it will run until approximately
the heat death of the universe!
Contents:
pent Parser Tutorial¶
There is almost always more than one way to construct a pent
Parser
to capture a given dataset. Sometimes, if the data
format is complex or contains irrelevant content interspersed with the
data of interest, significant pre- or post-processing may be required. As well,
it’s important to inspect your starting data carefully, often by
loading it into a Python string, to be sure there aren’t, say, a bunch of
unprintable characters floating around and fouling the regex matches.
This tutorial starts by describing the basic structure of
the semantic components of pent
’s parsing model:
tokens, patterns, and Parsers
.
It then lays out some approaches to constructing Parsers
for realistic datasets, with the goal of enabling new users
to get quickly up to speed
building their own Parsers
.
For a formal description of the grammar of the tokens used herein, see the pent Mini-Language Grammar.
Basic Usage¶
pent
searches text in a line-by-line fashion,
where a line of text is delimited by the start/end
of the string, and/or by newline(s).
Each line of text to be matched by pent
is represented
by a pattern, passed into a Parser
.
Each pattern is a string composed of zero or more whitespace-separated tokens,
which define in a structured way what the overall pattern should match.
Both patterns and tokens can also include flags,
which modify the semantics of how they are processed.
At present, whitespace is hardcoded to include only spaces and tab characters (\t). Various options for user-configurable whitespace definition are planned (#26).
Basic Usage: Tokens¶
pent
understands four kinds of tokens, which match varying types of content.
One is an ‘any’ token,
which matches an arbitrary span of whitespace and/or
non-whitespace content. The other three types are intended to match specific kinds of
content within the line of text that are often, but not always,
separated from surrounding content by whitespace.
All four kinds of tokens accept a flag
that instructs the encapsulating
Parser
to capture the content matching the token for output.
A subset of the tokens accepts a flag
that alters how the Parser
handles the presence or absence of whitespace
following the content matching the token.
The ‘Any’ Token: ~¶
The ‘any’ token will match anything, including a completely blank line. It behaves essentially the same as “.*” in regex.
Currently, the ‘any’ token only accepts the ‘capture’ flag (becoming “~!”). Addition of support for the ‘space-after’ flags is planned (#78).
Note that any content matched by a capturing ‘any’ token will be
split at whitespace in Parser
output.
The ‘Misc’ Token: &¶
The ‘misc’ token matches any sequence of non-whitespace characters. Its uses are similar to the ‘any’ token, except that its match is confined to a single whitespace-delimited piece of content. It is mainly intended for use on non-numerical data whose content is not constant, and thus the ‘literal’ token cannot be used.
The ‘misc’ token has one required argument, indicating whether it should match exactly one piece of content (&.) or one-or-more pieces of content (&+). When matching one-or-more, the ‘misc’ token interleaves required whitespace between each reptition.
At this time, the functional difference between “~” and “&+” is minimal.
The ‘misc’ token accepts both the capture flag and the space-after modifier flags.
The ‘Literal’ Token: @¶
The ‘literal’ token matches an exact sequence of one or more whitespace-delimited characters, which is provided as a required argument in the token definition.
Similar to the ‘misc’ token, the ‘literal’ token also has the quantity specifier as a required argument: either “@.” for exactly one match or “@+” for one-or-more matches.
The argument for the string to be matched follows the quantity argument. Thus, to match the text foo exactly once a suitable token might be “@.foo”.
In the situation where it’s needed to match a literal string containing a space, the entire token can be enclosed in quotes: “’@.this has spaces’”.
The ‘literal’ token differs from the ‘misc’ and ‘number’ tokens in that when the one-or-more argument is used, it prohibits whitespace between the repetitions. This allows, e.g., a long sequence of hyphens to be represented by a token like “@+-”. Similarly, a long sequence of alternating hyphens and spaces could be represented by “’@+- ‘”.
The ‘literal’ token accepts both the capture flag and the space-after modifier flags.
The ‘Number’ Token: #¶
The ‘number’ token allows for selectively matching numbers of varying types in the text being parsed; in particular, matches can be constrained by sign (positive, negative, or either) or by format (integer, decimal, or scientific notation; or, combinations of these).
The ‘number’ token takes three required, single-character arguments:
Quantity:
#. for exactly one, or
#+ for one-or-more.
Sign:
#[.+]+ for positive,
#[.+]- for negative, or
#[.+]. for either sign.
Number Format:
#[.+][.-+]i for integer,
#[.+][.-+]d for decimal,
#[.+][.-+]s for scientific notation,
#[.+][.-+]f for float (decimal or scinot)
#[.+][.-+]g for general (integer or float).
The ability to specify different types of number formatting was implemented for this token because it is often the case that numbers printed in different formats have different semantic significance, and it’s thus useful to be able to filter/capture based on that format. This example illustrates a simplified case of this.
As with the ‘misc’ token, when matching in one-or-more quantity mode, the ‘number’ token interleaves required whitespace between each reptition.
The ‘number’ token accepts both the capture flag and the space-after modifier flags.
Token Flags¶
Currently, two types of flags can be passed to tokens: capture flag and the space-after modifier flags.
If both flags are used in a given token, the space-after modifier flag must precede the capture flag.
In most cases, not all of the data in a block of text is of interest
for downstream processing. Thus, pent
provides the token-level
‘capture’ flag, “!”, which marks
the content of that token for inclusion in the output of
capture_body()
and
capture_struct()
.
The ‘capture’ flag is an integral part of all of the
tutorial examples.
With no space-after flag provided, all tokens REQUIRE the presence
of trailing whitespace (or EOL)
in order to match. This is because most content is anticipated to be
whitespace-delineated, and thus this default leads to
more concise Parser
definitions.
However, there are situations where changing this behavior is
useful for defining a well-targeted Parser
, and some where
changing it is necessary in order to compose
a functional Parser
at all.
As an example, take the following line of text:
The foo is in the foo.
The token “@.foo” would match the first occurrence of the word “foo”, because it has whitespace after it, but it would not match the second occurrence, since it is immediately followed by a period.
In order to match both occurrences, the ‘optional trailing whitespace flag’, “o”, could be added, leading to the token “@o.foo”.
If it were desired only to match the second occurrence, the ‘prohibited trailing whitespace flag’, “x”, could be added, yielding “@x.foo”.
This tutorial example provides further illustration of the use of these flags in more-realistic situations.
Basic Usage: Patterns¶
A pent
pattern is a series of whitespace-delimited
tokens that represents all non-whitespace
content on a given line of text.
A blank line—one that is empty, or contains only whitespace—can be matched with an empty pattern string:
>>> check_pattern(pattern="", text="")
MATCH
>>> check_pattern(pattern="", text=" ")
MATCH
>>> check_pattern(pattern="", text=" \t ")
MATCH
If a line contains one piece of non-whitespace text, a single token will suffice to match the whole line:
>>> check_pattern(pattern="&.", text="foo")
MATCH
>>> check_pattern(pattern="&.", text=" foo")
MATCH
>>> check_pattern(pattern="#..i", text="-5")
MATCH
>>> check_pattern(pattern="#..i", text=" 50000 ")
MATCH
>>> check_pattern(pattern="#..f", text="2") # Wrong number type
NO MATCH
>>> check_pattern(pattern="#.-i", text="2") # Wrong number sign
NO MATCH
>>> check_pattern(pattern="", text="42") # Line is not blank
NO MATCH
If a line contains more than one piece of non-whitespace text, all pieces must be matched by a token in the pattern:
>>> check_pattern(pattern="&+", text="foo bar baz") # One-or-more gets all three
MATCH
>>> check_pattern(pattern="&. &.", text="foo bar baz") # Only 2/3 words matched
NO MATCH
>>> check_pattern(pattern="&. #..i", text="foo 42")
MATCH
>>> check_pattern(pattern="&+ #..i", text="foo bar baz 42")
MATCH
>>> check_pattern(pattern="#+.i", text="-2 -1 0 1 2")
MATCH
>>> check_pattern(pattern="#+.i", text="-2 -1 foo 1 2") # 'foo' is not an int
NO MATCH
>>> check_pattern(pattern="#+.i &. #+.i", text="-2 -1 foo 1 2")
MATCH
Be careful when using “~” and “&+”, as they may match more aggressively than expected:
>>> check_pattern(pattern="~ #+.i", text="foo bar 42 34")
MATCH
>>> show_capture(pattern="~! #+.i", text="foo bar 42 34")
[[['foo', 'bar']]]
>>> check_pattern(pattern="&+ #+.i", text="foo bar 42 34")
MATCH
>>> show_capture(pattern="&!+ #+.i", text="foo bar 42 34")
[[['foo', 'bar', '42']]]
>>> check_pattern(pattern="&+ #+.i", text="foo 42 bar 34")
MATCH
>>> show_capture(pattern="&!+ #+.i", text="foo 42 bar 34")
[[['foo', '42', 'bar']]]
Punctuation will foul matches unless explicitly accounted for:
>>> check_pattern(pattern="#+.i", text="1 2 ---- 3 4")
NO MATCH
>>> check_pattern(pattern="#+.i &. #+.i", text="1 2 ---- 3 4")
MATCH
In situations where punctuation is directly adjacent to the content
to be captured, the space-after flags
must be used to modify pent
’s expectations for whitespace:
>>> check_pattern(pattern="~ #..d @..", text="The value is 3.1415.") # No space between number and '.'
NO MATCH
>>> check_pattern(pattern="~ #x..d @..", text="The value is 3.1415.")
MATCH
In situations where some initial content will definitely appear on a line,
but some additional trailing content may or may not appear at the end of the line,
it’s important to use one of the space-after modifier flags in order for
pent
to find a match when the trailing content is absent.
This is because the default required
trailing whitespace will (naturally) require whitespace to be present
between the end of the matched content and the end of the line,
and if EOL immediately follows the content the pattern match will fail,
since the required whitespace is absent:
>>> check_pattern(pattern="&. #.+i ~", text="always 42 sometimes")
MATCH
>>> check_pattern(pattern="&. #.+i ~", text="always 42")
NO MATCH
>>> check_pattern(pattern="&. #.+i ~", text="always 42 ")
MATCH
>>> check_pattern(pattern="&. #x.+i ~", text="always 42")
MATCH
>>> check_pattern(pattern="&. #x.+i ~", text="always 42 sometimes")
MATCH
Optional Line Flag: ?¶
In some cases, an entire line of text will be present in some occurrences
of a desired Parser
match with a block of text, but absent in others.
To accommodate such situations, pent
recognizes an ‘optional-line flag’ in a pattern.
This flag is a sole “?”, occurring as the first “token”
in the pattern. Inclusion of this flag will cause the pattern
to match in the following three cases:
A line is present that completely matches the optional pattern (per usual behavior).
A blank line (no non-whitespace content) is present where the optional pattern would match.
NO line is present where the optional pattern would match.
It is difficult to construct meaningful examples of this behavior
without using a full Parser
construction; as such, see
this tutorial page
for more details.
Basic Usage: Parsers¶
The Parser
is the main user-facing interface to pent
,
where the patterns matching the data of interest
are defined. Parsers
are created with three arguments,
head, body, and tail. All Parsers
must have a body;
head and tail are optional.
A section of text matched by
a given Parser
will have the following structure:
If head is defined, it will be matched exactly once, and its content must immediately precede the body content.
body will be matched one or more times.
If tail is defined, it will be matched exactly once, and its content must immediately follow the body content.
Each of head, body, and tail can be one of three things:
A single
pent
pattern, matching a single line of textAn ordered iterable (
tuple
,list
, etc.) of patterns, matching a number of lines of text equal to the length of the iterableA
Parser
, matching its entire contents
The syntax and matching structure of Parsers
using
these three kinds of arguments are illustrated below
using trivial examples. Application of pent
to
more-realistic situations is demonstrated in the
Examples section of the tutorial.
In the below examples, most illustrations are of the use of head, rather than tail. However, the principles apply equally well to both.
Matching with Single Patterns¶
The simplest possible Parser
only has body defined,
containing a single pent
pattern:
>>> prs = pent.Parser(body="@!.bar")
>>> text = """foo
... bar
... baz"""
>>> prs.capture_body(text)
[[['bar']]]
As noted, body will match multiple times in a row:
>>> text = """foo
... bar
... bar
... bar
... baz"""
>>> prs.capture_body(text)
[[['bar'], ['bar'], ['bar']]]
Multiple occurrences of body in the text will match independently:
>>> text = """foo
... bar
... baz
... bar
... baz"""
>>> prs.capture_body(text)
[[['bar']], [['bar']]]
If only that first bar is of interest,
the Parser
match can be constrained with a head:
>>> prs_head = pent.Parser(head="@.foo", body="@!.bar")
>>> prs_head.capture_body(text)
[[['bar']]]
Adding just a tail doesn’t really help, since baz follows both instances of bar:
>>> prs_tail = pent.Parser(body="@!.bar", tail="@.baz")
>>> prs_tail.capture_body(text)
[[['bar']], [['bar']]]
Matching with Iterables of Patterns¶
Sometimes data is structured in such a way that
it’s necessary to associate more than one line of text
with a given portion of a Parser
. This is most
common with head and tail, but it can occur with
body as well. These situations are addressed
by using iterables of patterns when
instantiating a Parser
.
The following is a situation where the header portion of the data contains two lines, one being a string label and the other being a series of integers, and it’s important to capture only the “wanted” data block:
>>> text = """WANTED_DATA
... 1 2 3
... 1.5 2.1 1.1
...
... UNWANTED_DATA
... 1 2 3
... 0.1 0.4 0.2
... """
>>> pent.Parser(
... head=("@.WANTED_DATA", "#++i"),
... body="#!++d"
... ).capture_body(text)
[[['1.5', '2.1', '1.1']]]
Note that even though WANTED_DATA appears in the header
line of the ‘unwanted’ data block, since the
@.WANTED_DATA token does not match
the complete contents of UNWANTED_DATA,
the Parser
does not match that second block.
If head were left out, or defined just to match the rows of integers, both datasets would be retrieved:
>>> pent.Parser(head="#++i", body="#!++d").capture_body(text)
[[['1.5', '2.1', '1.1']], [['0.1', '0.4', '0.2']]]
Situations calling for passing an iterable into body are less common, but can occur if there is a strictly repeating, cyclic pattern to the text to be parsed:
>>> text_good = """DATA
... foo
... bar
... foo
... bar
... foo
... bar"""
>>> prs = pent.Parser(
... head="@.DATA",
... body=("@!.foo", "@!.bar")
... )
>>> prs.capture_body(text_good)
[[['foo', 'bar'], ['foo', 'bar'], ['foo', 'bar']]]
Note in the .capture_body() output that even though
each foo and bar appear on separate
lines in the text, because the capture of each pair is defined
as the body of a single Parser
, they end up being treated as though
they had been on the same line. Another example of this behavior
can be found in this tutorial example.
If the lines of body text are not strictly cyclic-repeating, this approach won’t work:
>>> text_bad = """DATA
... foo
... bar
...
... foo
... bar"""
>>> prs.capture_body(text_bad)
[[['foo', 'bar']]]
There are other approaches that can handle such situations, such as the optional-line pattern flag:
>>> pent.Parser(
... head="? @.DATA",
... body=("@!.foo", "@!.bar")
... ).capture_body(text_bad)
[[['foo', 'bar']], [['foo', 'bar']]]
Matching with a Nested Parser
¶
For data with more complex internal structure, often the best
way to match it is to pass a Parser
to one or more of
head, body, or tail.
In situations where the header or footer content has a variable
number of lines that all match the same pattern, passing
a Parser
is often the most concise approach, as it
exploits the implicit matching of one-or-more lines by
the body of that internal Parser
:
>>> text_head = """foo
... 1 2 3
... bar
... bar
...
... foo
... 1 2 3
... 4 5 6 7 8
... 9 10
... bar
... bar
... bar"""
>>> prs_head = pent.Parser(
... head=pent.Parser(
... head="@.foo",
... body="#++i",
... ),
... body="@!.bar",
... )
>>> prs_head.capture_body(text_head)
[[['bar'], ['bar']], [['bar'], ['bar'], ['bar']]]
Another common use of an internal Parser
is when
the main data content itself has a header/body/footer structure,
but it is also necessary to specify an overall header for the
data in order to avoid capturing multiple times within the
broader text:
>>> text_body = """WANTED
... foo
... bar
... bar
...
... UNWANTED
... foo
... bar
... bar
... bar
... bar"""
>>> prs_body = pent.Parser(
... head="@.WANTED",
... body=pent.Parser(
... head="@.foo",
... body="@!.bar",
... ),
... )
>>> prs_body.capture_body(text_body)
[[[['bar'], ['bar']]]]
A clearer description of this approach is provided in this tutorial example.
Examples¶
This section of the tutorial contains examples of applications
of pent
to parsing of “real” (or, at least “real-like”) datasets.
Capturing with a Single Parser
¶
This first example is a modified version of the dataset used in the first half of the project README, drawn from a .hess file generated by ORCA:
>>> text = dedent("""\
... $vibrational_frequencies
... 6
... 0 0.000000
... 1 0.000000
... 2 -194.490162
... 3 -198.587114
... 4 389.931897
... 5 402.713910
... """)
A Minimal Parser
Body¶
Focusing first on the main section of the data, the goal here is to retrieve the floats in the right-hand column; the rest of the content is irrelevant. However, the integers in the left-hand column still have to be represented in the pattern, even if they’re not captured.
So, to represent those leading integers, the first token of the body pattern needs to be a single number (#.) that’s not captured (omit !), with a positive sign (+) and integer format (i), leading to #.+i.
Then, to match the second, decimal value on each line, the second token needs to also be a single number (#.) of decimal format (d). But, since we want these values to be captured in output, it’s necessary to insert ! after #. And, since some of the values in this list are negative and some are positive, the token should allow any sign (.). Thus, the second token should be #!..d.
So, a first stab at the body of the Parser
would be:
>>> prs = pent.Parser(body="#.+i #!..d")
>>> prs.capture_body(text)
[[['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']]]
Works nicely! There are two things to note about the data returned here, though:
First, all of the numerical values are returned as strings. pent
tries to
maximize flexibility by making no assumptions about what needs to be
done with the data. Thus, some post-processing will always be required.
For example, to get the captured values from data into a numpy
array,
one could do the following:
>>> arr = np.asarray(prs.capture_body(text), dtype=float).squeeze()
>>> print(arr)
[ 0. 0. -194.490162 -198.587114 389.931897 402.71391 ]
Second, the captured data is always returned as a nested series of lists.
In situations like this one, where a single Parser
is used,
the nesting will be three levels deep. This is because each matching block
of data is returned as a matrix (a list of lists), and each of these matrices
is then in turn a member of the outermost list.
In this particular instance, since the body captures exactly one value per line of test parsed, the innermost lists are length-one. And, since there are six lines that match the body pattern, the matrix that is returned is of size 6x1 (a list containing six length-one lists).
This means that if there had been a gap in the data, the outermost list would have had length greater than one:
>>> text2 = dedent("""\
... 0 0.000000
... 1 0.000000
...
... 2 -194.490162
... 3 -198.587114
... """)
>>> prs.capture_body(text2)
[[['0.000000'], ['0.000000']], [['-194.490162'], ['-198.587114']]]
There are two blocks of data here, each with two rows of one value each, so
the return value from capture_body()
is a length-two list,
where each item of that list represents a 2x1 matrix.
Capturing Multiple Values per Line¶
If one wanted to also capture the integer indices in each row, the only change needed would be to add the ! capturing flag to that first token:
>>> pent.Parser(body="#!.+i #!..d").capture_body(text2)
[[['0', '0.000000'], ['1', '0.000000']], [['2', '-194.490162'], ['3', '-198.587114']]]
Constraining the Parser
Match with a head¶
However, what if there are other datasets in the file that have this same format, but that we don’t want to capture:
>>> text3 = dedent("""\
... $vibrational_frequencies
... 6
... 0 0.000000
... 1 0.000000
... 2 -194.490162
... 3 -198.587114
... 4 389.931897
... 5 402.713910
...
... $unrelated_data
... 3
... 0 3.316
... 1 -4.311
... 2 12.120
... """)
The original Parser
will grab both of these blocks of data:
>>> prs.capture_body(text3)
[[['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], [['3.316'], ['-4.311'], ['12.120']]]
The Parser
can be constrained to only the data we want by introducing a head
pattern:
>>> prs2 = pent.Parser(
... head=["@.$vibrational_frequencies", "#!.+i"],
... body="#.+i #!..d"
... )
>>> prs2.capture_body(text3)
[[['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']]]
This use of head introduces two concepts: (1) the ‘literal string’ token, @,
in combination with the “.” quantity marker telling the
Parser
to match the literal string exactly once; and (2) the pent
feature wherein a length-n ordered iterable of pattern strings
(here, length-two) will match n lines from the data string. In this case,
the first string in the tuple matches the
“$vibrational_frequencies” marker in the first line of the header,
and the second captures the single positive integer in the second line of the header.
Capturing in head and tail with capture_struct()
¶
In the example immediately above, note that even though the “!”
capturing flag is specified in the second element of the head,
that captured value does not show up in the
capture_body()
output. Captures in head and tail must
be retrieved using capture_struct()
:
>>> prs2.capture_struct(text3)
[{<ParserField.Head: 'head'>: [['6']], <ParserField.Body: 'body'>: [['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], <ParserField.Tail: 'tail'>: None}]
>>> prs2.capture_struct(text3)[0][pent.ParserField.Head]
[['6']]
The return value from capture_struct()
has length equal to the number of times the Parser
matched
within the text. Here, since the pattern only matched once, the return
value is of length one.
As a convenience, the lists returned by capture_struct()
are actually of type ThruList
, a custom subclass of list
,
which will silently pass through indices/keys to their first argument
if and only if they are of length one.
Thus, the following would also work for prs2 operating on text3:
>>> prs2.capture_struct(text3)[pent.ParserField.Head]
[['6']]
But, it would break for the original prs, where the overall pattern matched twice:
>>> prs.capture_struct(text3)
[{<ParserField.Head: 'head'>: None, <ParserField.Body: 'body'>: [['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], <ParserField.Tail: 'tail'>: None}, {<ParserField.Head: 'head'>: None, <ParserField.Body: 'body'>: [['3.316'], ['-4.311'], ['12.120']], <ParserField.Tail: 'tail'>: None}]
>>> prs.capture_struct(text3)[pent.ParserField.Head]
Traceback (most recent call last):
...
pent.errors.ThruListError: Invalid ThruList index: Numeric index required for len != 1
As a final note, consider the difference between the head and tail results
for the below Parser
, where head is defined but has no capturing tokens present
(yields [[]]
), but tail is not specified (yields None
):
>>> pent.Parser(head="#.+i", body="#.+i #!..d").capture_struct(text)
[{<ParserField.Head: 'head'>: [[]], <ParserField.Body: 'body'>: [['0.000000'], ['0.000000'], ['-194.490162'], ['-198.587114'], ['389.931897'], ['402.713910']], <ParserField.Tail: 'tail'>: None}]
Capturing with Nested Parser
s¶
pent
is also able to parse and capture higher-dimensional data
stored as free text. Take the following data string:
>>> text = dedent("""\
... $hessian
... 4
... 0 1
... 0 0.473532 0.004379
... 1 0.004785 0.028807
... 2 0.004785 -0.022335
... 3 -0.418007 0.008333
... 2 3
... 0 0.004379 -0.416666
... 1 -0.022335 0.008067
... 2 0.028807 0.008067
... 3 0.008333 0.420926
... """)
text represents a 4x4 matrix, with the first two columns printed in one section,
and the second two columns printed in a separate, following section.
Each row and column is marked with its respective index.
In order to import this data successfully, the body of the main
Parser
will have to be set to a different, inner Parser
.
Defining the Inner Parser
¶
Each section of data columns starts with a row containing only positive integers, which does not need to be captured. After that leading row are multiple rows with data, each of which leads with a single positive integer, followed by decimal-format data of any sign:
>>> text_inner = dedent("""\
... 0 1
... 0 0.473532 0.004379
... 1 0.004785 0.028807
... 2 0.004785 -0.022335
... 3 -0.418007 0.008333
... """)
One way to construct a Parser
for this internal block is as follows:
>>> prs_inner = pent.Parser(
... head="#++i",
... body="#.+i #!+.d",
... )
>>> prs_inner.capture_body(text_inner)
[[['0.473532', '0.004379'], ['0.004785', '0.028807'], ['0.004785', '-0.022335'], ['-0.418007', '0.008333']]]
Note that even though the multiple decimal values in each row of the data block
were matched by the single “#!+.d” token in body,
they were reported as separate values in the output.
As currently implemented, pent
will always split captured content
at any internal whitespace; a further example of this with the ‘any’ token
can be seen here.
Defining the Outer Parser
¶
The outer Parser
then makes use of the inner Parser
as its body,
with the two header lines defined in head:
>>> prs_outer = pent.Parser(
... head=("@.$hessian", "#.+i"),
... body=prs_inner,
... )
>>> data = prs_outer.capture_body(text)
>>> data
[[[['0.473532', '0.004379'], ['0.004785', '0.028807'], ['0.004785', '-0.022335'], ['-0.418007', '0.008333']], [['0.004379', '-0.416666'], ['-0.022335', '0.008067'], ['0.028807', '0.008067'], ['0.008333', '0.420926']]]]
Structure of the Returned data¶
The structure of the list returned by capture_body()
nests four levels deep:
>>> arr = np.asarray(data, dtype=float)
>>> arr.shape
(1, 2, 4, 2)
This is because:
Each block of data is returned as a matrix (adds two levels);
The body of prs_outer is a
Parser
(adds one level); andThe
capture_body()
method wraps everything in a list (adds one level).
So, working from left to right, the (1, 2, 4, 2) shape of the data arises because:
The overall prs_outer matched 1 time;
The inner prs_inner, as the body of prs_outer, matched 2 times; and
Both blocks of data matched by prs_inner have 4 rows and 2 columns
Reassembling the Full 4x4 Matrix¶
In cases like this, numpy
’s column_stack()
provides
a simple way to reassemble the full 4x4 matrix of data, though
it is necessary to convert each matrix to an ndarray
separately:
>>> np.column_stack([np.asarray(block, dtype=float) for block in data[0]])
array([[ 0.473532, 0.004379, 0.004379, -0.416666],
[ 0.004785, 0.028807, -0.022335, 0.008067],
[ 0.004785, -0.022335, 0.028807, 0.008067],
[-0.418007, 0.008333, 0.008333, 0.420926]])
data[0] is used instead of data in the generator expression
so that the two inner 4x2 blocks of data are yielded separately to asarray()
.
Coping with Mismatched Data Block Sizes¶
Nothing guarantees that the data in a chunk of text will have properly matched
internal dimensions, however. pent
will still import the data, but
it may not be possible to pull it directly into a numpy
array
as was done above:
>>> text2 = dedent("""\
... $hessian
... 4
... 0 1
... 0 0.473532 0.004379
... 1 0.004785 0.028807
... 2 0.004785 -0.022335
... 3 -0.418007 0.008333
... 2 3
... 0 0.004379 -0.416666
... 1 -0.022335 0.008067
... """)
>>> data2 = prs_outer.capture_body(text2)
>>> data2
[[[['0.473532', '0.004379'], ['0.004785', '0.028807'], ['0.004785', '-0.022335'], ['-0.418007', '0.008333']], [['0.004379', '-0.416666'], ['-0.022335', '0.008067']]]]
>>> np.asarray(data2, dtype=float)
Traceback (most recent call last):
...
ValueError: setting an array element with a sequence.
>>> np.column_stack([np.asarray(block, dtype=float) for block in data2[0]])
Traceback (most recent call last):
...
ValueError: all the input array dimensions except for the concatenation axis must match exactly
In situations like this, the returned data structure either must be processed
with methods that can accommodate the missing data, or the missing data must be explicitly
filled in before conversion to ndarray
.
The Misc Token¶
Sometimes, data is laid out in text in a fashion
where it cannot be matched using only numerical values.
Either some elements of the data of interest are themselves
non-numeric, or there are non-numeric portions of content
interspersed with the numeric data of interest.
pent
provides the “misc” token
(&) to handle these kinds of situations.
Take the following data, which is an example of the XYZ format for representing the atomic coordinates of a chemical system:
>>> text_xyz = dedent("""
... 5
... Coordinates from MeCl2F_2
... C -3.081564 2.283942 0.044943
... Cl -1.303141 2.255173 0.064645
... Cl -3.706406 3.411601 -1.180577
... F -3.541771 2.647036 1.270358
... H -3.439068 1.277858 -0.199370
... """)
In this case, pretty much everything in the text block is of interest. The first number indicates how many atoms are present (useful for cross-checking the data import), the line of text is an arbitrary string describing the chemical system, and the data block provides the atomic symbol of each atom and its xyz position in space.
The following Parser
will enable capture of the entire contents
of the string:
>>> prs_xyz = pent.Parser(
... head=("#!..i", "~!"),
... body="&!. #!+.d",
... )
The atomic symbols and coordinates are most easily retrieved
with capture_body()
:
>>> data_atoms = prs_xyz.capture_body(text_xyz)
>>> data_atoms
[[['C', '-3.081564', '2.283942', '0.044943'], ['Cl', '-1.303141', '2.255173', '0.064645'], ['Cl', '-3.706406', '3.411601', '-1.180577'], ['F', '-3.541771', '2.647036', '1.270358'], ['H', '-3.439068', '1.277858', '-0.199370']]]
The atom count and description can be retrieved with
capture_struct()
:
>>> data_struct = prs_xyz.capture_struct(text_xyz)
>>> data_struct[pent.ParserField.Head][0]
['5', 'Coordinates', 'from', 'MeCl2F_2']
Unlike in body, where two-dimensional structure is inferred in captured data,
in head and tail all captures are returned as elements of a single, flat list
.
Currently, it is not possible to avoid the splitting of all captured content at whitespace, even it it was captured from a single ‘any’ or ‘literal’ token. #26 and/or #62 are planned and will provide mechanism(s) to change this behavior.
As an aside, in this particular case the ‘misc’ token was not strictly necessary in the body, as the capturing ‘any’ token (~!) would also have worked:
>>> prs_any = pent.Parser(
... head=("#.+i", "~"),
... body="~! #!+.d",
... )
>>> prs_any.capture_body(text_xyz)
[[['C', '-3.081564', '2.283942', '0.044943'], ['Cl', '-1.303141', '2.255173', '0.064645'], ['Cl', '-3.706406', '3.411601', '-1.180577'], ['F', '-3.541771', '2.647036', '1.270358'], ['H', '-3.439068', '1.277858', '-0.199370']]]
However, there are situations where the ability
of the ‘misc’ token to match
only a single, arbitrary piece of whitespace-delimited
content is useful in order to narrow the specificity of
the Parser
match.
Another example of the use of the ‘misc’ token is given at *Post-Processing of Captured Data.
*Post-Processing of Captured Data¶
Sometimes, data in text is laid out in a way such that
pent
can’t retrieve only the data of interest
using a Parser
. In these cases, post-processing
of the data obtained from capture_body()
is the
simplest approach.
Multiwfn LI
*Internal Spaces in One-Or-More Matches¶
Illustration of how misc/number and literal token types handle them differently.
The Optional-Line Token¶
In some situations, data is output in a fashion such that a line of, e.g., header text is present in some parts of the content of interest, but not others. Take the following fictitious example:
>>> text = dedent("""
... $DATA
... ITERATION 1
... 0 1 2
... 1.5 3.1 2.4
... 3 4 5
... -0.1 2.7 -9.3
... ITERATION 2
... 0 1 2
... 1.6 2.9 1.8
... 3 4 5
... -0.4 2.1 -8.7
... """)
This data block could be matched with triply nested Parsers
:
>>> prs_3x = pent.Parser(
... head="@.$DATA",
... body=pent.Parser(
... head="@.ITERATION #..i",
... body=pent.Parser(
... head="#++i",
... body="#!+.d",
... ),
... ),
... )
>>> prs_3x.capture_body(text)
[[[[['1.5', '3.1', '2.4']], [['-0.1', '2.7', '-9.3']]], [[['1.6', '2.9', '1.8']], [['-0.4', '2.1', '-8.7']]]]]
However, that definition is quite bulky, and for more complex patterns and larger text inputs the three layers of nesting can sometimes lead to problematically slow parsing times.
The optional-line
pattern flag allows for a simpler Parser
structure here:
>>> prs_opt = pent.Parser(
... head=("? @.$DATA", "@.ITERATION #..i"),
... body=pent.Parser(
... head="#++i",
... body="#!+.d",
... ),
... )
>>> prs_opt.capture_body(text)
[[[['1.5', '3.1', '2.4']], [['-0.1', '2.7', '-9.3']]], [[['1.6', '2.9', '1.8']], [['-0.4', '2.1', '-8.7']]]]
The $DATA is now wrapped into the head
of the outer of just two Parsers
, flagged as optional so that
the ITERATION 2 can be matched.
This approach also returns the data with one fewer level of
list
enclosure, which may be convenient in
downstream processing.
Since in this example the lines containing integers and the lines containing decimals are strictly alternating, yet another alternative would be to include the integer ‘header’ lines as a non-captured portion of the body:
>>> prs_opt = pent.Parser(
... head=("? @.$DATA", "@.ITERATION #..i"),
... body=pent.Parser(
... body=("#++i", "#!+.d"),
... )
... )
>>> prs_opt.capture_body(text)
[[[['1.5', '3.1', '2.4'], ['-0.1', '2.7', '-9.3']]], [[['1.6', '2.9', '1.8'], ['-0.4', '2.1', '-8.7']]]]
Doing it this way results in each ITERATION’s data being grouped into a two-dimensional matrix, instead of each individual line of decimal values occurring in its own matrix. This may or may not be desirable, depending on the semantics of the data being captured.
The Three Cases of Optional-Line Matches¶
More generally, as noted at the ‘pattern’ basic usage page, a pattern with the optional flag will match in three situations:
When a line is present matching the optional pattern:
>>> prs = pent.Parser(body=("@!.a", "? @!.b", "@!.c")) >>> prs.capture_body("""a ... b ... c""") [[['a', 'b', 'c']]]
When a blank line is present where the optional pattern would match:
>>> prs.capture_body("""a ... ... c""") [[['a', None, 'c']]]
When there is no line present where the optional pattern would match:
>>> prs.capture_body("""a ... c""") [[['a', None, 'c']]]
If a line is present that does not match the optional pattern,
the entire Parser
will fail to match:
>>> prs.capture_body("""a
... foo
... c""")
[]
Required/Optional/Prohibited Trailing Whitespace¶
By default, number (#), misc (&), and literal (@) tokens require trailing whitespace to be present in the text in order to match:
>>> text_space = dedent("""\
... foo: 5
... bar: 8
... """)
>>> text_nospace = dedent("""\
... foo:5
... bar:8
... """)
>>> prs_req = pent.Parser(body="&. #!.+i")
>>> prs_req.capture_body(text_space)
[[['5'], ['8']]]
>>> prs_req.capture_body(text_nospace)
[]
pent
provides a means to make this trailing whitespace
either optional or prohibited, if needed,
via a token-level flag.
Optional trailing whitespace is indicated with an “o” flag in the token:
>>> prs_opt = pent.Parser(body="&o. #!.+i")
>>> prs_opt.capture_body(text_space)
[[['5'], ['8']]]
>>> prs_opt.capture_body(text_nospace)
[[['5'], ['8']]]
Similarly, prohibited trailing whitespace is indicated with an “x” flag in the token:
>>> prs_prohib = pent.Parser(body="&x. #!.+i")
>>> prs_prohib.capture_body(text_space)
[]
>>> prs_prohib.capture_body(text_nospace)
[[['5'], ['8']]]
If used in combination with the capturing “!” flag, the trailing-space flag is placed before the capturing flag; e.g., as “&x!.”.
One common situation where this capability is needed is when a number of interest is contained in prose text and falls at the end of a sentence:
>>> text_prose = dedent("""\
... pi is approximately 3.14159.
... """)
>>> pent.Parser(body="~ #!..d &.").capture_body(text_prose)
[]
>>> pent.Parser(body="~ #x!..d &.").capture_body(text_prose)
[[['3.14159']]]
Don’t forget to include a token for that trailing period!
The Parser
won’t find a match, otherwise:
>>> pent.Parser(body="~ #x!..d").capture_body(text_prose)
[]
Limitations of the “Any” Token¶
Note that, as currently implemented, the ‘any’ token
(~) does not allow specification of
optional or prohibited trailing whitespace; any
content that it matches must be followed by
whitespace for the Parser
to work:
>>> text_sandwich = dedent("""\
... This number3.14159is sandwiched in text.
... """)
>>> pent.Parser(body="~ #x!..d ~").capture_body(text_sandwich)
[]
In order to match this value, the preceding text must be matched either by a literal or a misc token:
>>> pent.Parser(body="~ @x.number #x!..d ~").capture_body(text_sandwich)
[[['3.14159']]]
>>> pent.Parser(body="~ &x. #x!..d ~").capture_body(text_sandwich)
[[['3.14159']]]
This deficiency will be addressed in #78.
*Pre-Processing/Data Cleanup Example¶
pending
* Incomplete
pent Mini-Language Grammar¶
As discussed here, a pent
Parser
is constructed by passing it patterns composed of tokens. The grammar below
specifies what constitutes a valid pent
token.
For completeness, even though the
optional-line pattern flag
is called a flag and not a token, internally pent
parses this flag
as though it were a token, and thus it is included here.
This grammar is expressed in an approximation of extended Backus-Naur form. Content in double quotes represents a literal string, the pipe character indicates alternatives, square brackets indicate optional token flags, and parentheses indicate required token flags.
Grammar
token ::= optional_line_flag | content_token
optional_line_flag ::= "?"
content_token ::= any_token | literal_token | misc_token | number_token
any_token ::= "~"[capture]
literal_token ::= "@"[space_after][capture](quantity)(literal_content)
misc_token ::= "&"[space_after][capture](quantity)
number_token ::= "#"[space_after][capture](quantity)(sign)(num_type)
space_after ::= optional_space_after | no_space_after
optional_space_after ::= "o"
no_space_after ::= "x"
capture ::= "!"
quantity ::= match_one | match_one_or_more
match_one ::= "."
match_one_or_more ::= "+"
sign ::= any_sign | positive_sign | negative_sign
any_sign ::= "."
positive_sign ::= "+"
negative_sign ::= "-"
num_type ::= integer | decimal | sci_not | float | general
integer ::= "i"
decimal ::= "d"
sci_not ::= "s"
float ::= "f"
general ::= "g"
API (draft page)¶
Unstructured API dump, to provide cross-reference targets for other portions of the docs.
Any of the objects/attributes/methods documented here may
become private implementation details in future
versions of pent
.
Mini-language parser for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
8 Sep 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
class
pent.parser.
Parser
(head=None, body=None, tail=None)¶ Mini-language parser for structured numerical data.
-
capture_body
(text)¶ Capture all values from the pattern body, recursing if needed.
-
classmethod
capture_parser
(prs, text)¶ Perform capture of a Parser pattern.
-
classmethod
capture_section
(sec, text)¶ Perform capture of a str, iterable, or Parser section.
-
classmethod
capture_str_pattern
(pat_str, text)¶ Perform capture of string/iterable-of-str pattern.
-
capture_struct
(text)¶ Perform capture of marked groups to nested dict(s).
-
classmethod
convert_line
(line, *, capture_groups=True, group_id=0)¶ Convert line of tokens to regex.
The constructed regex is required to match the entirety of a line of text, using lookbehind and lookahead at the start and end of the pattern, respectively.
group_id indicates the starting value of the index for any capture groups added.
-
classmethod
convert_section
(sec, capture_groups=False, capture_sections=True)¶ Convert the head, body or tail to regex.
-
static
generate_captures
(m)¶ Generate captures from a regex match.
-
pattern
(capture_sections=True)¶ Return the regex pattern for the entire parser.
The individual capture groups are NEVER inserted when regex is generated this way.
Instead, head/body/tail capture groups are inserted, in order to subdivide matched text by these subsets. These ‘section’ capture groups are ONLY inserted for the top-level Parser, though – they are suppressed for inner nested Parsers.
-
Token handling for mini-language parser for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
20 Sep 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
class
pent.token.
Token
(token, do_capture=True)¶ Encapsulates transforming mini-language patterns tokens into regex.
-
property
capture
¶ Return flag for whether a regex capture group should be created.
-
do_capture
¶ Whether group capture should be added or not
-
property
is_any
¶ Return flag for whether the token is an “any content” token.
-
property
is_misc
¶ Return flag for whether the token is a misc token.
-
property
is_num
¶ Return flag for whether the token matches a number.
-
property
is_optional_line
¶ Return flag for whether the token flags an optional line.
-
property
is_str
¶ Return flag for whether the token matches a literal string.
-
property
match_quantity
¶ Return match quantity.
None
forpent.enums.Content.Any
orpent.enums.Content.OptionalLine
-
needs_group_id
¶ Flag for whether group ID substitution needs to be done
-
property
space_after
¶ Return Enum value for handling of post-match whitespace.
-
token
¶ Mini-language token string to be parsed
-
property
Regex patterns for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
2 Sep 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
pent.patterns.
number_patterns
= {(<Number.Decimal: 'd'>, <Sign.Positive: '+'>): '[+]?(\\d+\\.\\d*|\\d*\\.\\d+)', (<Number.Decimal: 'd'>, <Sign.Negative: '-'>): '-(\\d+\\.\\d*|\\d*\\.\\d+)', (<Number.Decimal: 'd'>, <Sign.Any: '.'>): '[+-]?(\\d+\\.\\d*|\\d*\\.\\d+)', (<Number.Float: 'f'>, <Sign.Positive: '+'>): '[+]?((\\d+\\.\\d*|\\d*\\.\\d+)|(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+))', (<Number.Float: 'f'>, <Sign.Negative: '-'>): '-((\\d+\\.\\d*|\\d*\\.\\d+)|(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+))', (<Number.Float: 'f'>, <Sign.Any: '.'>): '[+-]?((\\d+\\.\\d*|\\d*\\.\\d+)|(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+))', (<Number.General: 'g'>, <Sign.Positive: '+'>): '[+]?((\\d+\\.\\d*|\\d*\\.\\d+)|(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+)|\\d+)', (<Number.General: 'g'>, <Sign.Negative: '-'>): '-((\\d+\\.\\d*|\\d*\\.\\d+)|(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+)|\\d+)', (<Number.General: 'g'>, <Sign.Any: '.'>): '[+-]?((\\d+\\.\\d*|\\d*\\.\\d+)|(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+)|\\d+)', (<Number.Integer: 'i'>, <Sign.Positive: '+'>): '[+]?\\d+', (<Number.Integer: 'i'>, <Sign.Negative: '-'>): '-\\d+', (<Number.Integer: 'i'>, <Sign.Any: '.'>): '[+-]?\\d+', (<Number.SciNot: 's'>, <Sign.Positive: '+'>): '[+]?(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+)', (<Number.SciNot: 's'>, <Sign.Negative: '-'>): '-(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+)', (<Number.SciNot: 's'>, <Sign.Any: '.'>): '[+-]?(\\d+\\.?\\d*[deDE][+-]?\\d+|\\d*\\.\\d+[deDE][+-]?\\d+)'}¶ dict
ofpyparsing
patterns matching single numbers.
-
pent.patterns.
std_num_punct
= 'deDE+.-'¶ str
with the standard numerical punctuation to include as not marking word boundaries. de is included to account for scientific notation.
-
pent.patterns.
std_scinot_markers
= 'deDE'¶ str
with the standard allowed scientific notation exponent marker characters
-
pent.patterns.
std_word_chars
= 'a-zA-Z0-9deDE+.-'¶ Standard word marker characters for pent
-
pent.patterns.
std_wordify
(p)¶ Wrap a token in the
pent
standard word start/end markers.
-
pent.patterns.
std_wordify_close
(p)¶ Append the standard word end markers.
-
pent.patterns.
std_wordify_open
(p)¶ Prepend the standard word start markers.
-
pent.patterns.
wordify_close
(p, word_chars)¶ Append the word end markers.
-
pent.patterns.
wordify_open
(p, word_chars)¶ Prepend the word start markers.
-
pent.patterns.
wordify_pattern
(p, word_chars)¶ Wrap pattern with word start/end markers using arbitrary word chars.
Enums
for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
3 Sep 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
class
pent.enums.
Content
¶ Enumeration for the possible types of content.
-
Any
= '~'¶ Arbitrary match, including whitespace
-
Misc
= '&'¶ Arbitrary single-“word” match, no whitespace
-
Number
= '#'¶ Number
-
OptionalLine
= '?'¶ Flag to mark pattern line as optional
-
String
= '@'¶ Literal string
-
-
class
pent.enums.
Number
¶ Enumeration for the different kinds of recognized number primitives.
-
Decimal
= 'd'¶ Decimal floating-point value; no scientific/exponential notation
-
Float
= 'f'¶ “Floating-point value with or without an exponent
-
General
= 'g'¶ “General” value; integer, float, or scientific notation
-
Integer
= 'i'¶ Integer value; no decimal or scientific/exponential notation
-
SciNot
= 's'¶ Scientific/exponential notation, where exponent is required
-
-
class
pent.enums.
ParserField
¶ Enumeration for the fields/subsections of a Parser pattern.
-
Body
= 'body'¶ Body
-
Head
= 'head'¶ Header
-
Tail
= 'tail'¶ Tail/footer
-
-
class
pent.enums.
Quantity
¶ Enumeration for the various match quantities.
-
OneOrMore
= '+'¶ One-or-more match
-
Single
= '.'¶ Single value match
-
-
class
pent.enums.
Sign
¶ Enumeration for the different kinds of recognized numerical signs.
-
Any
= '.'¶ Any sign
-
Negative
= '-'¶ Negative value only (leading ‘-‘ required; includes negative zero)
-
Positive
= '+'¶ Positive value only (leading ‘+’ optional; includes zero)
-
-
class
pent.enums.
SpaceAfter
¶ Enumeration for the various constraints on space after tokens.
-
Optional
= 'o'¶ Optional following space
-
Prohibited
= 'x'¶ Following space prohibited
-
Required
= ''¶ Default is required following space; no explicit enum value
-
-
class
pent.enums.
TokenField
¶ Enumeration for fields within a mini-language number token.
-
Capture
= 'capture'¶ Flag to ignore matched content when collecting into regex groups
-
Number
= 'number'¶ Format of the numerical value (int, float, scinot, decimal, general)
-
Quantity
= 'quantity'¶ Match quantity of the field (single value, optional, one-or-more, zero-or-more, etc.)
-
Sign
= 'sign'¶ Sign of acceptable values (any, positive, negative)
-
SignNumber
= 'sign_number'¶ Combined sign and number, for initial pattern group retrieval
-
SpaceAfter
= 'space_after'¶ Flag to change the space-after behavior of a token
-
Str
= 'str'¶ Literal content, for a string match
-
Type
= 'type'¶ Content type (any, string, number)
-
Custom exceptions for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
10 Sep 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
exception
pent.errors.
LineError
(line)¶ Raised during attempts to parse invalid token sequences.
-
exception
pent.errors.
PentError
¶ Superclass for all custom pent errors.
-
exception
pent.errors.
SectionError
(msg='')¶ Raised from failed attempts to parse a Parser section.
-
exception
pent.errors.
ThruListError
(msg='')¶ Raised from failed ThruList indexing attempts.
-
exception
pent.errors.
TokenError
(token)¶ Raised during attempts to parse an invalid token.
Custom list object for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
3 Oct 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
class
pent.thrulist.
ThruList
¶ List that passes through key if len == 1.
Utility functions for pent
.
pent
Extracts Numerical Text.
- Author
Brian Skinn (bskinn@alum.mit.edu)
- File Created
14 Oct 2018
- Copyright
(c) Brian Skinn 2018-2019
- Source Repository
- Documentation
- License
The MIT License; see LICENSE.txt for full license terms
Members
-
pent.utils.
column_stack_2d
(data)¶ Perform column-stacking on a list of 2d data blocks.