Chana: An NLP toolkit for the Shipibo-Konibo language of Peru.¶
Welcome to Chana’s documentation. Get started with Installation and then get an overview with the Quickstart. The rest of the docs describe each component of Chana in detail, with a full reference in the API section.
User’s Guide¶
This part of the documentation focuses on a quick set-up and instructions for using the basic NLP tools of Chana.
Installation¶
Python Version¶
We recommend using the version of Python 3. Chana supports Python 3.4 and newer.
Dependencies¶
To make chana work it is needed to have the following packages.
- NumPy the fundamental package for scientific computing with Python.
- Scikit-learn a package for machine learning in Python.
- Python-crfsuite a python binding to CRFsuite.
Install Chana¶
If you already have a working installation of numpy, scipy, and python-crfsuite
the easiest way to install chana is using pip
pip install chana
Quickstart¶
This page gives a quick introduction to Chana. It assumes you already have Chana installed. If you do not, head over to the Installation section.
Using the Chana NLP Toolkit¶
A minimal code to use the Chana Toolkit looks something like this:
import chana.lemmatizer as lem
import chana.ner as ner
import chana.syllabificator as syl
import chana.pos_tagger as pos
lemmatizer = lem.ShipiboLemmatizer()
lemma = lemmatizer.lemmatize('pikanwe')
ship_ner = ner.ShipiboNER()
ner_tags = ship_ner.crf_tag('Enero Limanko atsa enra piawe')
tagger = pos.ShipiboPosTagger()
tags = tagger.pos_tag('Atsa enra piai')
syllables = syl.syllabify('atsabo')
So what did that code do?
- First we imported the Chana tools (Lemmatizer, NER, Syllabificator and Pos-Tagger).
- Next we create an instance of the Shipibo Lemmatizer and then we used it to get the lemma of a Shipibo word.
- We then create an instance of the Shipibo NER and use it to get the NER tags of a Shipibo sentence.
- Next we create an instance of the Shipibo Pos-Tagger and use it to get the pos-tags of a Shipibo sentence.
- Finally we use the syllabify function of the Shipibo Syllabificator in order to get the syllables of a Shipibo word.
Just save it as test.py
or something similar, to start using the chana tools
with your own input.
API Reference¶
Information about the functions, classes or methods of Chana.
API¶
This part of the documentation covers all the functions of Chana.
Modules¶
Lemmatizer¶
Lemmatizer for shipibo-konibo Source model is from the Chana project and a use KNeighborsClassifier from scikit-learn
-
class
chana.lemmatizer.
GeneralLemmatizer
(features_length=10, n_neighbors=5)[source]¶ Instance of a new lemmatizer to be trained and used
-
get_lemma
(rule, word)[source]¶ Method that returns the lemma of a word given a possible rule
Parameters: - rule (list) – a rule to transform a word
- word (str) – a word to be transformed
Returns: word transformed
Return type: str
Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.GeneralLemmatizer() >>> lemmatizer.get_lemma(['bo>'],'shipibobo') 'shipibo'
-
get_rule
(word)[source]¶ Method that returns the transformation rule for a word
Parameters: word (str) – a word to get the rule Returns: numpy array with the rule Return type: array Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.GeneralLemmatizer() >>> lemmatizer.get_rule('perrito') array(['ito>0'], dtype='<U16')
-
lemmatize
(word)[source]¶ Method that predicts the lemma of a word with the trained model
Parameters: word (str) – a word to get the lemma Returns: lemma of the word Return type: str Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.GeneralLemmatizer() >>> lemmatizer.lemmatize('perrito') 'perro'
-
preprocess_word
(word)[source]¶ Method that turns a word in an array of features for the classifier according to its features_length
Parameters: word (str) – a word to be transformed Returns: list with the features Return type: list Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.GeneralLemmatizer() >>> lemmatizer.preprocess_word('perritos') [115, 111, 116, 105, 114, 114, 101, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
-
train
(words, lemmas)[source]¶ Method that trains a new lemmatizer with a list of words and a list of lemmas of the same size
Parameters: - words (list) – list of words
- lemmas (list) – list of lemmas
Returns: none
Return type: None
Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.GeneralLemmatizer() >>> lemmas = ['perro','gato','mono'] >>> words = ['perritos','gatitos','monotes'] >>> lemmatizer.train(words,lemmas)
-
-
class
chana.lemmatizer.
ShipiboLemmatizer
[source]¶ Instance of the pre-trained shipibo lemmatizer
-
get_lemma
(rule, word)[source]¶ Method that returns the lemma of a shipibo word given a possible rule
Parameters: - rule (list) – a rule to transform a word
- word (str) – a word to be transformed
Returns: word transformed
Return type: str
Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.ShipiboLemmatizer() >>> lemmatizer.get_lemma(['bo>'],'shipibobo') 'shipibo'
-
get_rule
(word)[source]¶ Method that returns the transformation rule for a shipibo word
Parameters: word (str) – a word to get the rule Returns: numpy array with the rule Return type: array Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.ShipiboLemmatizer() >>> lemmatizer.get_rule('pikanwe') array(['anwe>i'], dtype='<U16')
-
lemmatize
(word)[source]¶ Method that predicts the lemma of a shipibo word
Parameters: word (str) – a word to get the lemma Returns: lemma of the word Return type: str Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.ShipiboLemmatizer() >>> lemmatizer.lemmatize('pikanwe') 'piki'
-
preprocess_word
(word)[source]¶ Method that turns a word in an array of features for the classifier
Parameters: word (str) – a word to be transformed Returns: list with the features Return type: list Example: >>> import chana.lemmatizer >>> lemmatizer = chana.lemmatizer.ShipiboLemmatizer() >>> lemmatizer.preprocess_word('shipibobo') [111, 98, 111, 98, 105, 112, 105, 104, 115, 0, 0, 0, 0, 0, 0, 0, 0, 0]
-
-
chana.lemmatizer.
has_shipibo_suffix
(str)[source]¶ Function that returns the possible existence of a shipo suffix in a a word
Parameters: str (str) – word to evaluate Returns: True or False Return type: bool Example: >>> import chana.lemmatizer >>> chana.lemmatizer.has_shipibo_suffix('pianra') True
-
chana.lemmatizer.
longest_common_substring
(string1, string2)[source]¶ Function to find the longest common substring of two strings
Parameters: - string1 (str) – string1
- string2 (str) – string2
Returns: longest common substring
Return type: str
Example: >>> import chana.lemmatizer >>> chana.lemmatizer.longest_common_substring('limanko','limanra') 'liman'
-
chana.lemmatizer.
replace_last
(source_string, replace_what, replace_with)[source]¶ Function that replaces the last ocurrence of a string in a word
Parameters: - source_string (str) – the source string
- replace_what (str) – the substring to be replaced
- replace_with (str) – the string to be inserted
Returns: string with the replacement
Return type: str
Example: >>> import chana.lemmatizer >>> chana.lemmatizer.replace_last('piati','ti','ra') 'piara'
NER¶
Named-entity recognizer for shipibo-konibo Source model is from the Chana project and use predefined rules for the language as well as a crf from pycrfsuite
-
class
chana.ner.
ShipiboNER
[source]¶ Instance of the rule based NER for shipibo
-
check_dates
(words, entity_tag)[source]¶ Inner method that tags the dates of a sentence with ‘FEC’
Parameters: - words (list) – a list of words to be evaluated
- entity_tag (list) – a list of words to be evaluated
Returns: none
Return type: None
-
check_locations
(words, entity_tag)[source]¶ Inner method that tags the locations of a sentence with ‘LOC’
Parameters: - words (list) – a list of words to be evaluated
- entity_tag (list) – a list of words to be evaluated
Returns: none
Return type: None
-
check_names
(words, entity_tag)[source]¶ Inner method that tags the names/persons of a sentence with ‘PER’
Parameters: - words (list) – a list of words to be evaluated
- entity_tag (list) – a list of words to be evaluated
Returns: none
Return type: None
-
check_numbers
(words, entity_tag)[source]¶ Inner method that tags the numbers of a sentence with ‘NUM’
Parameters: - words (list) – a list of words to be evaluated
- entity_tag (list) – a list of words to be evaluated
Returns: none
Return type: None
-
check_organizations
(words, entity_tag)[source]¶ Inner method that tags the organizations of a sentence with ‘ORG’
Parameters: - words (list) – a list of words to be evaluated
- entity_tag (list) – a list of words to be evaluated
Returns: none
Return type: None
-
crf_tag
(sentence)[source]¶ Method that tags a sentence with the rule based method and then with the crf model
Parameters: sentence (str) – a sentence to be evaluated Returns: list with the ner tags Return type: list Example: >>> import chana.ner >>> ner = chana.ner.ShipiboNer() >>> ner.crf_tag('Limanko enra atsawe') ['LOC', 'O', 'O']
-
rule_tag
(sentence)[source]¶ Method that tags a sentence with the rule based system
Parameters: sentence (str) – a sentence to be evaluated Returns: list with the ner tags Return type: list Example: >>> import chana.ner >>> ner = chana.ner.ShipiboNer() >>> ner.rule_tag('Limanko enra atsawe') ['LOC', 'O', 'O']
-
sent2features
(sent)[source]¶ Inner method that add features to a sentence to be tagged by the crf model
Parameters: sent (list) – a sentence in list form to be transformed into features Returns: list with features Return type: list
-
word2features
(sent, i)[source]¶ Inner method that add features to the words of a sentence to be tagged by the crf model
Parameters: - sent (list) – a sentence in list form to be transformed into features
- i (int) – index of the word to be evaluated
Returns: list with the features for the indexed word
Return type: list
-
-
chana.ner.
is_date
(word)[source]¶ Function that returns ‘FEC’ if a shipo word is a date or False if not
Parameters: word (str) – a word to be evaluated Returns: ‘FEC’ if a shipo word is a date or False if not Return type: str Example: >>> import chana.ner >>> chana.ner.is_date('Agosto') 'FEC'
-
chana.ner.
is_location
(word)[source]¶ Function that returns ‘LOC’ if a shipo word is a location or False if not
Parameters: word (str) – a word to be evaluated Returns: ‘LOC’ if a shipo word is a location or False if not Return type: str Example: >>> import chana.ner >>> chana.is_location.is_name('Limanko') 'LOC'
-
chana.ner.
is_name
(word)[source]¶ Function that returns ‘PER’ if a shipo word is a proper name/person or False if not
Parameters: word (str) – a word to be evaluated Returns: ‘PER’ if a shipo word is a proper name/person or False if not Return type: str Example: >>> import chana.ner >>> chana.ner.is_name('Adriano') 'PER'
-
chana.ner.
is_number
(word)[source]¶ Function that returns ‘NUM’ if a shipo word is a number or False if not
Parameters: word (str) – a word to be evaluated Returns: ‘NUM’ if a shipo word is a number or False if not Return type: str Example: >>> import chana.ner >>> chana.ner.is_number('kimisha') 'NUM'
-
chana.ner.
is_organization
(word)[source]¶ Function that returns ‘ORG’ if a shipo word is an organization or False if not
Parameters: word (str) – a word to be evaluated Returns: ‘ORG’ if a shipo word is an organization or False if not Return type: str Example: >>> import chana.ner >>> chana.ner.is_organization('AUT') 'ORG'
Pos_Tagger¶
Part-of-Speech (POS) Tagger for shipibo-konibo. Source model is from the Chana project
-
class
chana.pos_tagger.
ShipiboPosTagger
[source]¶ Instance of the pre-trained shipibo part-of-speech tagger
-
features
(sentence, tags, index)[source]¶ Method that returns the features of a word in a sentence to be used by the model
Parameters: - sentence (str) – a sentence in shipibo-konibo
- tags (list) – tags to be returned for the word
- index (int) – position of the word in the sentence
Returns: dict of features for the indexed word
Return type: dict
Example: >>> import chana.pos_tagger >>> tagger = chana.pos_tagger.ShipiboPosTagger() >>> tagger.features('Atsa ea piai',['','',''],2) {'word': 's', 'prevWord': 't', 'nextWord': 'a', 'isFirst': False, 'isLast': False, 'isCapitalized': False, 'isAllCaps': False, 'isAllLowers': True, 'prefix-1': 's', 'prefix-2': 's', 'prefix-3': 's', 'prefix-4': 's', 'suffix-1': 's', 'suffix-2': 's', 'suffix-3': 's', 'suffix-4': 's', 'tag-1': '', 'tag-2': ''}
-
full_pos_tag
(sentence)[source]¶ Method that predict the pos-tags of a shipibo sentence and returns the full tag in spanish
Parameters: sentence (str) – a sentence in shipibo-konibo Returns: list of the tags in spanish Return type: list Example: >>> import chana.pos_tagger >>> tagger = chana.pos_tagger.ShipiboPosTagger() >>> tagger.full_pos_tag('Atsa ea piai') ['Nombre', 'Pronombre', 'Verbo']
-
get_complete_tag
(pos)[source]¶ Method that returns the full tag in spanish of a tag
Parameters: pos (str) – a pos tag in the UD format Returns: str with the tag in spanish Return type: str Example: >>> import chana.pos_tagger >>> tagger = chana.pos_tagger.ShipiboPosTagger() >>> tagger.get_complete_tag('ADJ') 'Adjetivo'
-
pos_tag
(sentence)[source]¶ Method that predict the pos-tags of a shipibo sentence in the UD format
Parameters: sentence (str) – a sentence in shipibo-konibo Returns: list of the tags in UD format Return type: list Example: >>> import chana.pos_tagger >>> tagger = chana.pos_tagger.ShipiboPosTagger() >>> tagger.pos_tag('Atsa ea piai') ['NOUN', 'PRON', 'VERB']
-
Syllabificator¶
Syllabificator for shipibo-konibo. General functions and rules to syllabify a shipibo-konibo word
-
chana.syllabificator.
accentuate
(letter)[source]¶ Function that adds the accentuation mark of a letter:
Parameters: letter (str) – a letter to be accentuated Returns: letter accentuated Return type: str Example: >>> import chana.syllabificator >>> chana.syllabificator.accentuate('a') á
-
chana.syllabificator.
change
(syllable)[source]¶ Function that returns the original form of a syllable
Parameters: syllable (str) – a syllable to be transformed Returns: syllable with its original form Return type: str Example: >>> import chana.syllabificator >>> chana.syllabificator.change('1a') cha
-
chana.syllabificator.
get_vc
(word)[source]¶ Function that returns all the vowels and consonants of a word
Parameters: word (str) – word to get its vowels and consonants Returns: list of ‘V’ and ‘C’ for each letter of the word Return type: list Example: >>> import chana.syllabificator >>> chana.syllabificator.get_vc('piti') [['p', 'C'], ['i', 'V'], ['t', 'C'], ['i', 'V']]
Additional Notes¶
Legal information and changelog.
License¶
Chana is licensed under a three clause MIT License. It basically means: A permissive license that lets people do anything they want with the code as long as they provide attribution back to the owner and don’t hold the owner liable. The full license text can be found below (Chana License).
Authors¶
Chana is written and maintained by Jose Pereira and various contributors:
Development Lead¶
Jose Pereira <jpereira@pucp.edu.pe>
Contributors¶
- Jose Pereira (Lemmatizer)
- Rodolfo Mercado (Pos-Tagger)
- Vivian Gongora (NER)
- Carlo Alva (Syllabificator)
General License Definitions¶
The following section contains the full license texts for Flask and the documentation.
- “AUTHORS” hereby refers to all the authors listed in the Authors section.
- The “Chana License” applies to all the source code shipped as part of Chana as well as documentation.
Chana License¶
Copyright (c) 2018 Chana PUCP
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.