Welcome to TERNIP’s documentation!¶
Contents:
ternip Package¶
ternip
Package¶
timex
Module¶
Subpackages¶
formats Package¶
formats
Package¶
gate
Module¶
tempeval2
Module¶
-
class
ternip.formats.tempeval2.
TempEval2Document
(file, docid='', dct='XXXXXXXX')[source]¶ Bases:
object
A class which uses the format of stand-off format of TempEval-2
-
static
create
(sents, docid='')[source]¶ Creates a TempEval-2 document from the internal representation
sents is the [[(word, pos, timexes), ...], ...] format.
-
get_sents
()[source]¶ Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.
-
static
tern
Module¶
-
class
ternip.formats.tern.
TernDocument
(file, nodename='TEXT', has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.timex2.Timex2XmlDocument
A class which can handle TERN documents
-
static
create
(sents, docid, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False, dct='')[source]¶ Creates a TERN document from the internal representation
sents is the [[(word, pos, timexes), ...], ...] format.
tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.
If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S
add_LEX is similar, but for token boundaries
pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.
dct is the document creation time string
-
static
timeml
Module¶
-
class
ternip.formats.timeml.
TimeMlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.timex3.Timex3XmlDocument
A class which holds a TimeML representation of a document.
Suitable for use with the AQUAINT dataset.
-
static
create
(sents, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False)[source]¶ Creates a TimeML document from the internal representation
sents is the [[(word, pos, timexes), ...], ...] format.
tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.
If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S
add_LEX is similar, but for token boundaries
pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.
-
static
timex2
Module¶
-
class
ternip.formats.timex2.
Timex2XmlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.xml_doc.XmlDocument
A class which takes any random XML document and adds TIMEX2 tags to it.
timex3
Module¶
-
class
ternip.formats.timex3.
Timex3XmlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
ternip.formats.xml_doc.XmlDocument
A class which takes any random XML document and adds TIMEX3 tags to it.
Suitable for use with Timebank, which contains many superfluous tags that aren’t in the TimeML spec, even though it claims to be TimeML.
xml_doc
Module¶
-
class
ternip.formats.xml_doc.
XmlDocument
(file, nodename=None, has_S=False, has_LEX=False, pos_attr=False)[source]¶ Bases:
object
An abstract base class which all XML types can inherit from. This implements almost everything, apart from the conversion of timex objects to and from timex tags in the XML. This is done by child classes
-
static
create
(sents, tok_offsets=None, add_S=False, add_LEX=False, pos_attr=False)[source]¶ This is an abstract function for building XML documents from the internal representation only. You are not guaranteed to get out of get_sents what you put in here. Sentences and words will be retokenised and retagged unless you explicitly add S and LEX tags and the POS attribute to the document using the optional arguments.
sents is the [[(word, pos, timexes), ...], ...] format.
tok_offsets is used to correctly reinsert whitespace lost in tokenisation. It’s in the format of a list of lists of integers, where each integer is the offset from the start of the sentence of that token. If set to None (the default), then a single space is assumed between all tokens.
If add_S is set to something other than false, then the tags to indicate sentence boundaries are added, with the name of the tag being the value of add_S
add_LEX is similar, but for token boundaries
pos_attr is similar but refers to the name of the attribute on the LEX (or whatever) tag that holds the POS tag.
-
get_sents
()[source]¶ Returns a representation of this document in the [[(word, pos, timexes), ...], ...] format.
If there are any TIMEXes in the input document that cross sentence boundaries (and the input is not already broken up into sentences with the S tag), then those TIMEXes are disregarded.
-
reconcile
(sents, add_S=False, add_LEX=False, pos_attr=False)[source]¶ Reconciles this document against the new internal representation. If add_S is set to anything other than False, this means tags are indicated to indicate the sentence boundaries, with the tag names being the value of add_S. add_LEX is the same, but for marking token boundaries, and pos_attr is the name of the attribute which holds the POS tag for that token. This is mainly useful for transforming the TERN documents into something that GUTime can parse.
If your document already contains S and LEX tags, and add_S/add_LEX is set to add them, old S/LEX tags will be stripped first. If pos_attr is set and the attribute name differs from the old POS attribute name on the lex tag, then the old attribute will be removed.
Sentence/token boundaries will not be altered in the final document unless add_S/add_LEX is set. If you have changed the token boundaries in the internal representation from the original form, but are not then adding them back in, reconciliation may give undefined results.
There are some inputs which would output invalid XML. For example, if this document has elements which span multiple sentences, but not whole parts of them, then you will be unable to add XML tags and get valid XML, so failure will occur in unexpected ways.
If you are adding LEX tags, and your XML document contains tags internal to tokens, then reconciliation will fail, as it expects tokens to be in a continuous piece of whitespace.
-
static
rule_engine Package¶
expressions
Module¶
normalisation_rule
Module¶
-
class
ternip.rule_engine.normalisation_rule.
NormalisationRule
(match, type=None, id='', value=None, change_type=None, freq=None, quant=None, mod=None, guards=None, after_guards=None, before_guards=None, sent_guards=None, after=None, tokenise=True, deliminate_numbers=False)[source]¶ Bases:
ternip.rule_engine.rule.Rule
A class that represents normalisation rules
-
apply
(timex, cur_context, dct, body, before, after)[source]¶ Applies this rule to this timex, where body is the full extent covered by this timex, before is the preceeding text in the sentence, and after is the proceeding text in the sentence, in the [(token, POS), ...] form
A boolean indicating whether or not application was successful is returned. The timex may also be modified, so should be passed in by reference.
-
normalisation_rule_block
Module¶
-
class
ternip.rule_engine.normalisation_rule_block.
NormalisationRuleBlock
(id, after, type, rules)[source]¶ Bases:
ternip.rule_engine.rule_block.RuleBlock
A block of normalisation rules
normalisation_rule_engine
Module¶
-
class
ternip.rule_engine.normalisation_rule_engine.
NormalisationRuleEngine
[source]¶ Bases:
ternip.rule_engine.rule_engine.RuleEngine
A class which does normalisation using a rule engine
Complex rules must have a string member called ‘id’, which is used for after ordering, a list of strings called ‘after’ (which can be an empty list) which consists of IDs that must have run before this rule. Additionally, a function called ‘apply’ which takes a list of (token, pos, timexes) tuples and returns them in the same form with potentially modified timexes.
recognition_rule
Module¶
-
class
ternip.rule_engine.recognition_rule.
RecognitionRule
(match, type, id, guards=None, after_guards=None, before_guards=None, after=None, squelch=False, case_sensitive=False, deliminate_numbers=False)[source]¶ Bases:
ternip.rule_engine.rule.Rule
A class that represents identification rules
-
apply
(sent)[source]¶ Applies this rule to the tokenised sentence. The ‘after’ ordering must be checked by the caller to ensure correct rule application.
sent is a list of tuples (token, POS, [timexes])
A tuple is returned where the first element is a list in the same form as sent, with additional timexes added to the 3rd element if need be, and the second element in the tuple is whether or not this rule matched anything
-
recognition_rule_block
Module¶
-
class
ternip.rule_engine.recognition_rule_block.
RecognitionRuleBlock
(id, after, type, rules)[source]¶ Bases:
ternip.rule_engine.rule_block.RuleBlock
A block of recognition rules
recognition_rule_engine
Module¶
-
class
ternip.rule_engine.recognition_rule_engine.
RecognitionRuleEngine
[source]¶ Bases:
ternip.rule_engine.rule_engine.RuleEngine
A class which does recognition using a rule engine
Complex rules must have a string member called ‘id’, which is used for after ordering, a list of strings called ‘after’ (which can be an empty list) which consists of IDs that must have run before this rule. Additionally, a function called ‘apply’ which takes a list of (token, pos, timexes) tuples and returns them in the same form with potentially modified timexes.
-
tag
(sents)[source]¶ This function actually does word recognition. It expects content to be split into tokenised, POS tagged, sentences. i.e., a list of lists of tuples ([[(token, pos-tag, timexes), ...], ...]). Rules are applied one at a time.
What is returned is in the same form, except the token tuples contain a third element consisting of the set of timexes associated with that token.
-
rule
Module¶
rule_block
Module¶
rule_engine
Module¶
-
class
ternip.rule_engine.rule_engine.
RuleEngine
[source]¶ Bases:
object
A base class for rule engines to use
-
load_block
(filename)[source]¶ Load a block of rules, then check for consistency Throws rule_load_errors if a rule fails to load
-
load_rule
(filename)[source]¶ Load a rule, then check for consistency
Throws rule_load_error if a rule fails to load
-
load_rules
(path)[source]¶ Do rule loading. Loads all files ending in .pyrule as ‘complex’ rules (direct Python code), .rule using the documented rule format, and .ruleblock as blocks which contain sequences of rules. For direct Python code, the rule must be a class called ‘rule’.
Throws rule_load_errors containing errors for all rules that failed to load.
-
Subpackages¶
normalisation_functions Package¶
date_functions
Module¶-
ternip.rule_engine.normalisation_functions.date_functions.
convert_to_24_hours
(time, ap)[source]¶ Given a hour and an a/p specifier, then convert the hour into 24 hour clock if need be
-
ternip.rule_engine.normalisation_functions.date_functions.
date_to_dow
(y, m, d)[source]¶ Gets the integer day of week for a date. Sunday is 0.
-
ternip.rule_engine.normalisation_functions.date_functions.
date_to_iso
(string)[source]¶ A translation of GUTime’s Date2ISO function. Given some date/time string representing an absolute date, then return a date string in the basic ISO format.
-
ternip.rule_engine.normalisation_functions.date_functions.
date_to_week
(y, m, d)[source]¶ Convert a date into a week number string, with year
-
ternip.rule_engine.normalisation_functions.date_functions.
easter_date
(y)[source]¶ Return the date of Easter for that year as a string
-
ternip.rule_engine.normalisation_functions.date_functions.
extract_timezone
(string)[source]¶ Given some string, try and extract the timezone it refers to. Returns a string.
relative_date_functions
Module¶-
ternip.rule_engine.normalisation_functions.relative_date_functions.
compute_offset_base
(ref_date, expression, current_direction)[source]¶ Given a reference date, some simple expression (yesterday/tomorrow or a day of week) and the direction of the relative expression, the base date with which to compute the offset from as a date string
-
ternip.rule_engine.normalisation_functions.relative_date_functions.
offset_from_date
(v, offset, gran='D', exact=False)[source]¶ Given a date string and some numeric offset, as well as a unit, then compute the offset from that value by offset gran’s. Gran defaults to D. If exact is set to true, then the exact date is figured out, otherwise the level of granuality given by gran is used. Returns a date string.
string_conversions
Module¶Functions which convert strings to some number index
-
ternip.rule_engine.normalisation_functions.string_conversions.
build_duration_value
(num, unit)[source]¶
-
ternip.rule_engine.normalisation_functions.string_conversions.
day_to_num
(day)[source]¶ Given the name of a day, the number of that day. Sunday is 0. Invalid data gets 7. All returned as integers.
-
ternip.rule_engine.normalisation_functions.string_conversions.
decade_nums
(dec)[source]¶ Given the decade component (less the ty suffix) of a year, the number of that year as an integer.
-
ternip.rule_engine.normalisation_functions.string_conversions.
fixed_holiday_date
(hol)[source]¶ Get the date string MMDD of a holiday
-
ternip.rule_engine.normalisation_functions.string_conversions.
month_to_num
(m)[source]¶ Given a name of a month, get the number of that month. Invalid data gets 0. Returned as an integer.
-
ternip.rule_engine.normalisation_functions.string_conversions.
nth_dow_holiday_date
(hol)[source]¶ Given the name of a holiday which always occur on the Nth X of some month, where X is day of week, returns tuples of the form (month, dow, n) representing the information about that holiday.
-
ternip.rule_engine.normalisation_functions.string_conversions.
season
(s)[source]¶ Transforms a season name into an identifer from TIDES. Invalid data gets returned as is