Welcome to polyglot’s documentation!¶
polyglot¶
Polyglot is a natural language pipeline that supports massive multilingual applications.
- Free software: GPLv3 license
- Documentation: http://polyglot.readthedocs.org.
Features¶
- Tokenization (165 Languages)
- Language detection (196 Languages)
- Named Entity Recognition (40 Languages)
- Part of Speech Tagging (16 Languages)
- Sentiment Analysis (136 Languages)
- Word Embeddings (137 Languages)
- Morphological analysis (135 Languages)
- Transliteration (69 Languages)
Developer¶
- Rami Al-Rfou @
rmyeid gmail com
Quick Tutorial¶
import polyglot
from polyglot.text import Text, Word
Language Detection¶
text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))
Language Detected: Code=fr, Name=French
Tokenization¶
zen = Text("Beautiful is better than ugly. "
"Explicit is better than implicit. "
"Simple is better than complex.")
print(zen.words)
[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']
print(zen.sentences)
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]
Part of Speech Tagging¶
text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")
print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
print(u"{:<16}{:>2}".format(word, tag))
Word POS Tag
------------------------------
O DET
primeiro ADJ
uso NOUN
de ADP
desobediência NOUN
civil ADJ
em ADP
massa NOUN
ocorreu ADJ
em ADP
setembro NOUN
de ADP
1906 NUM
. PUNCT
Named Entity Recognition¶
text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)
[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]
Polarity¶
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
print("{:<16}{:>2}".format(w, w.polarity))
Word Polarity
------------------------------
Beautiful 0
is 0
better 1
than 0
ugly -1
. 0
Embeddings¶
word = Word("Obama", language="en")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])
Neighbors (Synonms) of Obama
------------------------------
Bush
Reagan
Clinton
Ahmadinejad
Nixon
Karzai
McCain
Biden
Huckabee
Lula
The first 10 dimensions out the 256 dimensions
[-2.57382345 1.52175975 0.51070285 1.08678675 -0.74386948 -1.18616164
2.92784619 -0.25694436 -1.40958667 -2.39675403]
Morphology¶
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)
[u'Pre', u'process', u'ing']
Transliteration¶
from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))
препрокессинг
Contents:¶
Installation¶
Installing/Upgrading From the PyPI¶
$ pip install polyglot
Dependencies¶
polyglot depends on numpy and libicu-dev, on ubuntu/debian linux distribution you can install such packages by executing the following command:
sudo apt-get install python-numpy libicu-dev
Language Detection¶
Polyglot depends on pycld2 library which in turn depends on cld2 library for detecting language(s) used in plain text.
from polyglot.detect import Detector
Example¶
arabic_text = u"""
أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
"""
detector = Detector(arabic_text)
print(detector.language)
name: Arabic code: ar confidence: 99.0 read bytes: 907
Mixed Text¶
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""
If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:
for language in Detector(mixed_text).languages:
print(language)
name: English code: en confidence: 87.0 read bytes: 1154
name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755
name: un code: un confidence: 0.0 read bytes: 0
To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line
for line in mixed_text.strip().splitlines():
print(line + u"\n")
for language in Detector(line).languages:
print(language)
print("\n")
China (simplified Chinese: 中国; traditional Chinese: 中國),
name: English code: en confidence: 71.0 read bytes: 887
name: Chinese code: zh_Hant confidence: 11.0 read bytes: 1755
name: un code: un confidence: 0.0 read bytes: 0
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
name: English code: en confidence: 98.0 read bytes: 1291
name: un code: un confidence: 0.0 read bytes: 0
name: un code: un confidence: 0.0 read bytes: 0
Best Effort Strategy¶
Sometimes, there is no enough text to make a decision, like detecting a
language from one word. This forces the detector to switch to a best
effort strategy, a warning will be thrown and the attribute reliable
will be set to False
.
detector = Detector("pizza")
print(detector)
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
Prediction is reliable: False
Language 1: name: English code: en confidence: 85.0 read bytes: 1194
Language 2: name: un code: un confidence: 0.0 read bytes: 0
Language 3: name: un code: un confidence: 0.0 read bytes: 0
In case, that the detection is not reliable even when we are using the
best effort strategy, an exception UnknownLanguage
will be thrown.
print(Detector("4"))
---------------------------------------------------------------------------
UnknownLanguage Traceback (most recent call last)
<ipython-input-9-de43776398b9> in <module>()
----> 1 print(Detector("4"))
/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in __init__(self, text, quiet)
63 self.quiet = quiet
64 """If true, exceptions will be silenced."""
---> 65 self.detect(text)
66
67 @staticmethod
/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in detect(self, text)
89
90 if not reliable and not self.quiet:
---> 91 raise UnknownLanguage("Try passing a longer snippet of text")
92 else:
93 logger.warning("Detector is not able to detect the language reliably.")
UnknownLanguage: Try passing a longer snippet of text
Such an exception may not be desirable especially for trivial cases like
characters that could belong to so many languages. In this case, we can
silence the exceptions by passing setting quiet
to True
print(Detector("4", quiet=True))
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
Prediction is reliable: False
Language 1: name: un code: un confidence: 0.0 read bytes: 0
Language 2: name: un code: un confidence: 0.0 read bytes: 0
Language 3: name: un code: un confidence: 0.0 read bytes: 0
Command Line¶
!polyglot detect --help
usage: polyglot detect [-h] [--input [INPUT [INPUT ...]]]
optional arguments:
-h, --help show this help message and exit
--input [INPUT [INPUT ...]]
The subcommand detect
tries to identify the language code for each
line in a text file. This could be convieniet if each line represents a
document or a sentence that could have been generated by a tokenizer
!polyglot detect --input testdata/cricket.txt
English Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs.
English David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth.
English Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them.
English Australia's score surpassed the 413-5 India made against Bermuda in 2007.
English It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively.
English The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney.
Supported Languages¶
cld2 can detect up to 165 languages.
from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))
1. Abkhazian 2. Afar 3. Afrikaans
4. Akan 5. Albanian 6. Amharic
7. Arabic 8. Armenian 9. Assamese
10. Aymara 11. Azerbaijani 12. Bashkir
13. Basque 14. Belarusian 15. Bengali
16. Bihari 17. Bislama 18. Bosnian
19. Breton 20. Bulgarian 21. Burmese
22. Catalan 23. Cebuano 24. Cherokee
25. Nyanja 26. Corsican 27. Croatian
28. Croatian 29. Czech 30. Chinese
31. Chinese 32. Chinese 33. Chinese
34. Chineset 35. Chineset 36. Chineset
37. Chineset 38. Chineset 39. Chineset
40. Danish 41. Dhivehi 42. Dutch
43. Dzongkha 44. English 45. Esperanto
46. Estonian 47. Ewe 48. Faroese
49. Fijian 50. Finnish 51. French
52. Frisian 53. Ga 54. Galician
55. Ganda 56. Georgian 57. German
58. Greek 59. Greenlandic 60. Guarani
61. Gujarati 62. Haitian_creole 63. Hausa
64. Hawaiian 65. Hebrew 66. Hebrew
67. Hindi 68. Hmong 69. Hungarian
70. Icelandic 71. Igbo 72. Indonesian
73. Interlingua 74. Interlingue 75. Inuktitut
76. Inupiak 77. Irish 78. Italian
79. Ignore 80. Javanese 81. Javanese
82. Japanese 83. Kannada 84. Kashmiri
85. Kazakh 86. Khasi 87. Khmer
88. Kinyarwanda 89. Krio 90. Kurdish
91. Kyrgyz 92. Korean 93. Laothian
94. Latin 95. Latvian 96. Limbu
97. Limbu 98. Limbu 99. Lingala
100. Lithuanian 101. Lozi 102. Luba_lulua
103. Luo_kenya_and_tanzania 104. Luxembourgish 105. Macedonian
106. Malagasy 107. Malay 108. Malayalam
109. Maltese 110. Manx 111. Maori
112. Marathi 113. Mauritian_creole 114. Romanian
115. Mongolian 116. Montenegrin 117. Montenegrin
118. Montenegrin 119. Montenegrin 120. Nauru
121. Ndebele 122. Nepali 123. Newari
124. Norwegian 125. Norwegian 126. Norwegian_n
127. Nyanja 128. Occitan 129. Oriya
130. Oromo 131. Ossetian 132. Pampanga
133. Pashto 134. Pedi 135. Persian
136. Polish 137. Portuguese 138. Punjabi
139. Quechua 140. Rajasthani 141. Rhaeto_romance
142. Romanian 143. Rundi 144. Russian
145. Samoan 146. Sango 147. Sanskrit
148. Scots 149. Scots_gaelic 150. Serbian
151. Serbian 152. Seselwa 153. Seselwa
154. Sesotho 155. Shona 156. Sindhi
157. Sinhalese 158. Siswant 159. Slovak
160. Slovenian 161. Somali 162. Spanish
163. Sundanese 164. Swahili 165. Swedish
166. Syriac 167. Tagalog 168. Tajik
169. Tamil 170. Tatar 171. Telugu
172. Thai 173. Tibetan 174. Tigrinya
175. Tonga 176. Tsonga 177. Tswana
178. Tumbuka 179. Turkish 180. Turkmen
181. Twi 182. Uighur 183. Ukrainian
184. Urdu 185. Uzbek 186. Venda
187. Vietnamese 188. Volapuk 189. Waray_philippines
190. Welsh 191. Wolof 192. Xhosa
193. Yiddish 194. Yoruba 195. Zhuang
196. Zulu
Tokenization¶
Tokenization is the process that identifies the text boundaries of words and sentences. We can identify the boundaries of sentences first then tokenize each sentence to identify the words that compose the sentence. Of course, we can do word tokenization first and then segment the token sequence into sentneces. Tokenization in polyglot relies on the Unicode Text Segmentation algorithm as implemented by the ICU Project.
You can use C/C++ ICU library by installing the required package
libicu-dev
. For example, on ubuntu/debian systems you should use
apt-get
utility as the following:
sudo apt-get install libicu-dev
from polyglot.text import Text
Word Tokenization¶
To call our word tokenizer, first we need to construct a Text object.
blob = u"""
两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放,法国内政部长以及超市的管理者都表示,这显示了生命力要比野蛮行为更强大。
该超市1月9日遭受枪手袭击,导致4人死亡,据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。
"""
text = Text(blob)
The property words will call the word tokenizer.
text.words
WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', ',', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', ',', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。', '该', '超市', '1', '月', '9', '日', '遭受', '枪手', '袭击', ',', '导致', '4', '人', '死亡', ',', '据悉', '这', '起', '事件', '与', '法国', '《', '查理', '周刊', '》', '杂志', '社', '恐怖', '袭击', '案', '有关', '。'])
Since ICU boundary break algorithms are language aware, polyglot will detect the language used first before calling the tokenizer
print(text.language)
name: code: zh confidence: 99.0 read bytes: 1920
Sentence Segmentation¶
If we are interested in segmenting the text first into sentences, we can
query the sentences
property
text.sentences
[Sentence("两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放,法国内政部长以及超市的管理者都表示,这显示了生命力要比野蛮行为更强大。"),
Sentence("该超市1月9日遭受枪手袭击,导致4人死亡,据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。")]
Sentence
class inherits Text
, therefore, we can tokenize each
sentence into words using the same property words
first_sentence = text.sentences[0]
first_sentence.words
WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', ',', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', ',', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。'])
Command Line¶
The subcommand tokenize does by default sentence segmentation and word tokenization.
! polyglot tokenize --help
usage: polyglot tokenize [-h] [--only-sent | --only-word] [--input [INPUT [INPUT ...]]]
optional arguments:
-h, --help show this help message and exit
--only-sent Segment sentences without word tokenization
--only-word Tokenize words without sentence segmentation
--input [INPUT [INPUT ...]]
Each line represents a sentence where the words are split by spaces.
!polyglot --lang en tokenize --input testdata/cricket.txt
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .
Australia's score surpassed the 413 - 5 India made against Bermuda in 2007 .
It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages , following South Africa's 408 - 5 and 411 - 4 against West Indies and Ireland respectively .
The winning margin beats the 257 - run amount by which India beat Bermuda in Port of Spain in 2007 , which was equalled five days ago by South Africa in their victory over West Indies in Sydney .
Command Line Interface¶
polyglot package offer a command line interface along with the library
access. For each task in polyglot, there is a subcommand with specific
options for that task. Common options are gathered under the main
command polyglot
!polyglot --help
usage: polyglot [-h] [--lang LANG] [--delimiter DELIMITER] [--workers WORKERS] [-l LOG] [--debug]
{detect,morph,tokenize,download,count,cat,ner,pos,transliteration,sentiment} ...
optional arguments:
-h, --help show this help message and exit
--lang LANG Language to be processed
--delimiter DELIMITER
Delimiter that seperates documents, records or even sentences.
--workers WORKERS Number of parallel processes.
-l LOG, --log LOG log verbosity level
--debug drop a debugger if an exception is raised.
tools:
multilingual tools for all languages
{detect,morph,tokenize,download,count,cat,ner,pos,transliteration,sentiment}
detect Detect the language(s) used in text.
tokenize Tokenize text into sentences and words.
download Download polyglot resources and models.
count Count words frequency in a corpus.
cat Print the contents of the input file to the screen.
ner Named entity recognition chunking.
pos Part of Speech tagger.
transliteration Rewriting the input in the target language script.
sentiment Classify text to positive and negative polarity.
Notice that most of the operations are language specific. For example,
tokenization rules and part of speech taggers differ between languages.
Therefore, it is important that the lanaguage of the input is detected
or given. The --lang
option allows you to tell polyglot which
language the input is written in.
!polyglot --lang en tokenize --input testdata/cricket.txt | head -n 3
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .
In case the user did not supply the the language code, polyglot will peek ahead and read the first 1KB of data to detect the language used in the input.
!polyglot tokenize --input testdata/cricket.txt | head -n 3
2015-03-15 17:06:45 INFO __main__.py: 276 Language English is detected while reading the first 1128 bytes.
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .
Input formats¶
Polyglot will process the input contents line by line assuming that the
lines are separated by “\n
”. If the file is formatted differently,
you can use the polyglot main command option delimiter
to specify
any string other than “\n
”.
You can pass text to the polyglot subcommands in several ways:
- Standard input: This is, usually, useful for building processing pipelines.
- Text file: The file contents will be processed line by line.
- Collection of text files: Polyglot will iterate over the files
one by one. If the polyglot main command option
workers
is activated, the execution will be parallelized and each file will be processed by a different process.
Word Count Example¶
This example will demonstrate how to use the polyglot main command options and the subcommand count to generate a count of the words appearing in a collection of text files.
First, let us examine the subcommand count
options
!polyglot count --help
usage: polyglot count [-h] [--min-count MIN_COUNT | --most-freq MOST_FREQ] [--input [INPUT [INPUT ...]]]
optional arguments:
-h, --help show this help message and exit
--min-count MIN_COUNT
Ignore all words that appear <= min_freq.
--most-freq MOST_FREQ
Consider only the most frequent k words.
--input [INPUT [INPUT ...]]
To avoid long output, we will restrict the count to the words that appeared at least twice
!polyglot count --input testdata/cricket.txt --min-count 2
in 10
the 6
by 3
and 3
of 3
Bermuda 2
West 2
Mitchell 2
South 2
Indies 2
against 2
beat 2
as 2
India 2
which 2
score 2
Afghanistan 2
Let us consider the scenario where we have hundreds of files that
contains words we want to count. Notice, that we can parallelize the
process by passing a number higher than 1 to the polyglot main command
option workers
.
!polyglot --log debug --workers 5 count --input testdata/cricket.txt testdata/cricket.txt --min-count 3
in 20
the 12
of 6
by 6
and 6
West 4
Afghanistan 4
India 4
beat 4
which 4
Indies 4
Bermuda 4
as 4
South 4
Mitchell 4
against 4
score 4
Building Pipelines¶
The previous subcommand count
assumed that the words are separted by
spaces. Given that we never tokenized the text file, that may result in
suboptimal word counting. Let us take a closer look at the tail of the
word counts
!polyglot count --input testdata/cricket.txt | tail -n 10
Ireland 1
surpassed 1
amount 1
equalled 1
a 1
The 1
413-5 1
Africa's 1
tournament 1
Johnson 1
Observe that words like “2007.” could have been considered two words “2007” and “.” and the same for “Africa’s”. To fix this issue, we can use the polyglot subcommand tokenize to deal with these cases. We can stage the counting to happen after the tokenization using the stdin to build a simple pipe.
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot count --min-count 2
in 10
the 6
. 6
- 5
, 4
of 3
and 3
by 3
South 2
5 2
2007 2
Bermuda 2
which 2
score 2
against 2
Mitchell 2
as 2
West 2
India 2
beat 2
Afghanistan 2
Indies 2
Notice, that the word “2007” started appearing in the words counts list.
Downloading Models¶
Polyglot requires a model for each task and language. These models are essential for the library to function. Given the large size of some of the models, we distribute the models through a download manager separately. The download manager has several modes of operation.
Modes of Operation¶
Command Line Mode¶
The subcommand download
takes a package or more as an argument and
download the specified packages in the polyglot_data
directory.
!polyglot download --help
usage: polyglot download [-h] [--dir DIR] [--quiet] [--force] [--exit-on-error] [--url SERVER_INDEX_URL] [packages [packages ...]]
positional arguments:
packages packages to be downloaded
optional arguments:
-h, --help show this help message and exit
--dir DIR download package to directory DIR
--quiet work quietly
--force download even if already installed
--exit-on-error exit if an error occurs
--url SERVER_INDEX_URL
download server index url
!polyglot download morph2.en
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
Interactive Mode¶
You can reach this mode by not supplying any arguments to the command line.
!polyglot download
Polyglot Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader>
Library Interface¶
from polyglot.downloader import downloader
downloader.download("embeddings2.en")
Collections¶
You noticed, by now, that we can install a specific model by specifying its name and the target language.
Package name format is task_name.language_code
Packages are grouped by language. For example, if we want to download
all the models that are specific to Arabic, the arabic collection of
models name is LANG: followed by the language code of Arabic which
is ar
.
Therefore, we can just run:
!polyglot download LANG:ar
[polyglot_data] Downloading collection u'LANG:ar'
[polyglot_data] |
[polyglot_data] | Downloading package tsne2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package tsne2.ar is already up-to-date!
[polyglot_data] | Downloading package transliteration2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package transliteration2.ar is already up-to-
[polyglot_data] | date!
[polyglot_data] | Downloading package morph2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package morph2.ar is already up-to-date!
[polyglot_data] | Downloading package counts2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package counts2.ar is already up-to-date!
[polyglot_data] | Downloading package sentiment2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package sentiment2.ar is already up-to-date!
[polyglot_data] | Downloading package embeddings2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package embeddings2.ar is already up-to-date!
[polyglot_data] | Downloading package ner2.ar to
[polyglot_data] | /home/rmyeid/polyglot_data...
[polyglot_data] | Package ner2.ar is already up-to-date!
[polyglot_data] |
[polyglot_data] Done downloading collection LANG:ar
Packages are grouped by task. For example, if we want to download all the models that perform transliteration. The collection name is TASK: followed by the task name.
Therefore, we can just run:
downloader.download("TASK:transliteration2", quiet=True)
True
Langauge & Task Support¶
We can query our download manager for which tasks are supported by polyglot, as the following:
downloader.supported_tasks(lang="en")
[u'embeddings2',
u'counts2',
u'pos2',
u'ner2',
u'sentiment2',
u'morph2',
u'tsne2']
We can query our download manager for which languages are supported by polyglot named entity recognition subsystem, as the following:
print(downloader.supported_languages_table(task="ner2"))
1. Polish 2. Turkish 3. Russian
4. Indonesian 5. Czech 6. Arabic
7. Korean 8. Catalan; Valencian 9. Italian
10. Thai 11. Romanian, Moldavian, ... 12. Tagalog
13. Danish 14. Finnish 15. German
16. Persian 17. Dutch 18. Chinese
19. French 20. Portuguese 21. Slovak
22. Hebrew (modern) 23. Malay 24. Slovene
25. Bulgarian 26. Hindi 27. Japanese
28. Hungarian 29. Croatian 30. Ukrainian
31. Serbian 32. Lithuanian 33. Norwegian
34. Latvian 35. Swedish 36. English
37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese
40. Estonian
You can view all the available and/or installed collections or packages through the list function
downloader.list(show_packages=False)
Using default data directory (/home/rmyeid/polyglot_data)
=========================================
Data server index for <polyglot-models>
=========================================
Collections:
[ ] LANG:af............. Afrikaans packages and models
[ ] LANG:als............ als packages and models
[ ] LANG:am............. Amharic packages and models
[ ] LANG:an............. Aragonese packages and models
[ ] LANG:ar............. Arabic packages and models
[ ] LANG:arz............ arz packages and models
[ ] LANG:as............. Assamese packages and models
[ ] LANG:ast............ Asturian packages and models
[ ] LANG:az............. Azerbaijani packages and models
[ ] LANG:ba............. Bashkir packages and models
[ ] LANG:bar............ bar packages and models
[ ] LANG:be............. Belarusian packages and models
[ ] LANG:bg............. Bulgarian packages and models
[ ] LANG:bn............. Bengali packages and models
[ ] LANG:bo............. Tibetan packages and models
[ ] LANG:bpy............ bpy packages and models
[ ] LANG:br............. Breton packages and models
[ ] LANG:bs............. Bosnian packages and models
[ ] LANG:ca............. Catalan packages and models
[ ] LANG:ce............. Chechen packages and models
[ ] LANG:ceb............ Cebuano packages and models
[ ] LANG:cs............. Czech packages and models
[ ] LANG:cv............. Chuvash packages and models
[ ] LANG:cy............. Welsh packages and models
[ ] LANG:da............. Danish packages and models
[ ] LANG:de............. German packages and models
[ ] LANG:diq............ diq packages and models
[ ] LANG:dv............. Divehi packages and models
[ ] LANG:el............. Greek packages and models
[P] LANG:en............. English packages and models
[ ] LANG:eo............. Esperanto packages and models
[ ] LANG:es............. Spanish packages and models
[ ] LANG:et............. Estonian packages and models
[ ] LANG:eu............. Basque packages and models
[ ] LANG:fa............. Persian packages and models
[ ] LANG:fi............. Finnish packages and models
[ ] LANG:fo............. Faroese packages and models
[ ] LANG:fr............. French packages and models
[ ] LANG:fy............. Western Frisian packages and models
[ ] LANG:ga............. Irish packages and models
[ ] LANG:gan............ gan packages and models
[ ] LANG:gd............. Scottish Gaelic packages and models
[ ] LANG:gl............. Galician packages and models
[ ] LANG:gu............. Gujarati packages and models
[ ] LANG:gv............. Manx packages and models
[ ] LANG:he............. Hebrew packages and models
[ ] LANG:hi............. Hindi packages and models
[ ] LANG:hif............ hif packages and models
[ ] LANG:hr............. Croatian packages and models
[ ] LANG:hsb............ Upper Sorbian packages and models
[ ] LANG:ht............. Haitian packages and models
[ ] LANG:hu............. Hungarian packages and models
[ ] LANG:hy............. Armenian packages and models
[ ] LANG:ia............. Interlingua packages and models
[ ] LANG:id............. Indonesian packages and models
[ ] LANG:ilo............ Iloko packages and models
[ ] LANG:io............. Ido packages and models
[ ] LANG:is............. Icelandic packages and models
[ ] LANG:it............. Italian packages and models
[ ] LANG:ja............. Japanese packages and models
[ ] LANG:jv............. Javanese packages and models
[ ] LANG:ka............. Georgian packages and models
[ ] LANG:kk............. Kazakh packages and models
[ ] LANG:km............. Khmer packages and models
[ ] LANG:kn............. Kannada packages and models
[ ] LANG:ko............. Korean packages and models
[ ] LANG:ku............. Kurdish packages and models
[ ] LANG:ky............. Kyrgyz packages and models
[ ] LANG:la............. Latin packages and models
[ ] LANG:lb............. Luxembourgish packages and models
[ ] LANG:li............. Limburgish packages and models
[ ] LANG:lmo............ lmo packages and models
[ ] LANG:lt............. Lithuanian packages and models
[ ] LANG:lv............. Latvian packages and models
[ ] LANG:mg............. Malagasy packages and models
[ ] LANG:mk............. Macedonian packages and models
[ ] LANG:ml............. Malayalam packages and models
[ ] LANG:mn............. Mongolian packages and models
[ ] LANG:mr............. Marathi packages and models
[ ] LANG:ms............. Malay packages and models
[ ] LANG:mt............. Maltese packages and models
[ ] LANG:my............. Burmese packages and models
[ ] LANG:ne............. Nepali packages and models
[ ] LANG:nl............. Dutch packages and models
[ ] LANG:nn............. Norwegian Nynorsk packages and models
[ ] LANG:no............. Norwegian packages and models
[ ] LANG:oc............. Occitan packages and models
[ ] LANG:or............. Oriya packages and models
[ ] LANG:os............. Ossetic packages and models
[ ] LANG:pa............. Punjabi packages and models
[ ] LANG:pam............ Pampanga packages and models
[ ] LANG:pl............. Polish packages and models
[ ] LANG:pms............ pms packages and models
[ ] LANG:ps............. Pashto packages and models
[ ] LANG:pt............. Portuguese packages and models
[ ] LANG:qu............. Quechua packages and models
[ ] LANG:rm............. Romansh packages and models
[ ] LANG:ro............. Romanian packages and models
[ ] LANG:ru............. Russian packages and models
[ ] LANG:sa............. Sanskrit packages and models
[ ] LANG:sah............ Sakha packages and models
[ ] LANG:scn............ Sicilian packages and models
[ ] LANG:sco............ Scots packages and models
[ ] LANG:se............. Northern Sami packages and models
[ ] LANG:sh............. Serbo-Croatian packages and models
[ ] LANG:si............. Sinhala packages and models
[ ] LANG:sk............. Slovak packages and models
[ ] LANG:sl............. Slovenian packages and models
[ ] LANG:sq............. Albanian packages and models
[ ] LANG:sr............. Serbian packages and models
[ ] LANG:su............. Sundanese packages and models
[ ] LANG:sv............. Swedish packages and models
[ ] LANG:sw............. Swahili packages and models
[ ] LANG:szl............ szl packages and models
[ ] LANG:ta............. Tamil packages and models
[ ] LANG:te............. Telugu packages and models
[ ] LANG:tg............. Tajik packages and models
[ ] LANG:th............. Thai packages and models
[ ] LANG:tk............. Turkmen packages and models
[ ] LANG:tl............. Tagalog packages and models
[ ] LANG:tr............. Turkish packages and models
[ ] LANG:tt............. Tatar packages and models
[ ] LANG:ug............. Uyghur packages and models
[ ] LANG:uk............. Ukrainian packages and models
[ ] LANG:ur............. Urdu packages and models
[ ] LANG:uz............. Uzbek packages and models
[ ] LANG:vec............ vec packages and models
[ ] LANG:vi............. Vietnamese packages and models
[ ] LANG:vls............ vls packages and models
[ ] LANG:vo............. Volapük packages and models
[ ] LANG:wa............. Walloon packages and models
[ ] LANG:war............ Waray packages and models
[ ] LANG:yi............. Yiddish packages and models
[ ] LANG:yo............. Yoruba packages and models
[ ] LANG:zh............. Chinese packages and models
[ ] LANG:zhc............ Chinese Character packages and models
[ ] LANG:zhw............ zhw packages and models
[ ] TASK:counts2........ counts2
[ ] TASK:embeddings2.... embeddings2
[ ] TASK:ner2........... ner2
[P] TASK:sentiment2..... sentiment2
[ ] TASK:tsne2.......... tsne2
([*] marks installed packages; [P] marks partially installed collections)
Word Embeddings¶
Word embedding is a mapping of a word to a d-dimensional vector space. This real valued vector representation captures semantic and syntactic features. Polyglot offers a simple interface to load several formats of word embeddings.
from polyglot.mapping import Embedding
Formats¶
The Embedding class can read word embeddings from different sources:
- Gensim word2vec objects: (
from_gensim
method) - Word2vec binary/text models: (
from_word2vec
method) - GloVe models (
from_glove
method) - polyglot pickle files: (
load
method)
embeddings = Embedding.load("/home/rmyeid/polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2")
Nearest Neighbors¶
A common way to investigate the space capture by the embeddings is to query for the nearest neightbors of any word.
neighbors = embeddings.nearest_neighbors("green")
neighbors
[u'blue',
u'white',
u'red',
u'yellow',
u'black',
u'grey',
u'purple',
u'pink',
u'light',
u'gray']
to calculate the distance between a word and the nieghbors, we can call
the distances
method
embeddings.distances("green", neighbors)
array([ 1.34894466, 1.37864077, 1.39504588, 1.39524949, 1.43183875,
1.68007386, 1.75897062, 1.88401115, 1.89186132, 1.902614 ], dtype=float32)
The word embeddings are not unit vectors, actually the more frequent the word is the larger the norm of its own vector.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
norms = np.linalg.norm(embeddings.vectors, axis=1)
window = 300
smooth_line = np.convolve(norms, np.ones(window)/float(window), mode='valid')
plt.plot(smooth_line)
plt.xlabel("Word Rank"); _ = plt.ylabel("$L_2$ norm")
This could be problematic for some applications and training algorithms. We can normalize them by \(L_2\) norms to get unit vectors to reduce effects of word frequency, as the following
embeddings = embeddings.normalize_words()
neighbors = embeddings.nearest_neighbors("green")
for w,d in zip(neighbors, embeddings.distances("green", neighbors)):
print("{:<8}{:.4f}".format(w,d))
white 0.4261
blue 0.4451
black 0.4591
red 0.4786
yellow 0.4947
grey 0.6072
purple 0.6392
light 0.6483
pink 0.6574
colour 0.6824
Vocabulary Expansion¶
from polyglot.mapping import CaseExpander, DigitExpander
Not all the words are available in the dictionary defined by the word embeddings. Sometimes it would be useful to map new words to similar ones that we have embeddings for.
Case Expansion¶
For example, the word GREEN
is not available in the embeddings,
"GREEN" in embeddings
False
we would like to return the vector that represents the word Green
,
to do that we apply a case expansion:
embeddings.apply_expansion(CaseExpander)
"GREEN" in embeddings
True
embeddings.nearest_neighbors("GREEN")
[u'White',
u'Black',
u'Brown',
u'Blue',
u'Diamond',
u'Wood',
u'Young',
u'Hudson',
u'Cook',
u'Gold']
Digit Expansion¶
We reduce the size of the vocabulary while training the embeddings by
grouping special classes of words. Once common case of such grouping is
digits. Every digit in the training corpus get replaced by the symbol
#
. For example, a number like 123.54
becomes ###.##
.
Therefore, querying the embedding for a new number like 434
will
result in a failure
"434" in embeddings
False
To fix that, we apply another type of vocabulary expansion
DigitExpander
. It will map any number to a sequence of #
s.
embeddings.apply_expansion(DigitExpander)
"434" in embeddings
True
As expected, the neighbors of the new number 434
will be other
numbers:
embeddings.nearest_neighbors("434")
[u'##',
u'#',
u'3',
u'#####',
u'#,###',
u'##,###',
u'##EN##',
u'####',
u'###EN###',
u'n']
Demo¶
Demo is available here.
Citation¶
This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@InProceedings{polyglot:2013:ACL-CoNLL,
author = {Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven},
title = {Polyglot: Distributed Word Representations for Multilingual NLP},
booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
month = {August},
year = {2013},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {183--192},
url = {http://www.aclweb.org/anthology/W13-3520}
}
Part of Speech Tagging¶
Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.
Polyglot recognizes 17 parts of speech, this set is called the
universal part of speech tag set
:
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary verb
- CONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other
Languages Coverage¶
The models were trained on a combination of:
- Original CONLL datasets after the tags were converted using the universal POS tables.
- Universal Dependencies 1.0 corpora whenever they are available.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))
1. German 2. Italian 3. Danish
4. Czech 5. Slovene 6. French
7. English 8. Swedish 9. Bulgarian
10. Spanish; Castilian 11. Indonesian 12. Portuguese
13. Finnish 14. Irish 15. Hungarian
16. Dutch
Download Necessary Models¶
%%bash
polyglot download embeddings2.en pos2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package pos2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package pos2.en is already up-to-date!
Example¶
We tag each word in the text with one part of speech.
from polyglot.text import Text
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)
# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')
We can query all the tagged words
text.pos_tags
[(u'We', u'PRON'),
(u'will', u'AUX'),
(u'meet', u'VERB'),
(u'at', u'ADP'),
(u'eight', u'NUM'),
(u"o'clock", u'NOUN'),
(u'on', u'ADP'),
(u'Thursday', u'PROPN'),
(u'morning', u'NOUN'),
(u'.', u'PUNCT')]
After calling the pos_tags property once, the words objects will carry the POS tags.
text.words[0].pos_tag
u'PRON'
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en pos | tail -n 30
which DET
India PROPN
beat VERB
Bermuda PROPN
in ADP
Port PROPN
of ADP
Spain PROPN
in ADP
2007 NUM
, PUNCT
which DET
was AUX
equalled VERB
five NUM
days NOUN
ago ADV
by ADP
South PROPN
Africa PROPN
in ADP
their PRON
victory NOUN
over ADP
West PROPN
Indies PROPN
in ADP
Sydney PROPN
. PUNCT
This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@InProceedings{polyglot:2013:ACL-CoNLL,
author = {Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven},
title = {Polyglot: Distributed Word Representations for Multilingual NLP},
booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
month = {August},
year = {2013},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {183--192},
url = {http://www.aclweb.org/anthology/W13-3520}
}
Named Entity Extraction¶
Named entity extraction task aims to extract phrases from plain text that correpond to entities. Polyglot recognizes 3 categories of entities:
- Locations (Tag:
I-LOC
): cities, countries, regions, continents, neighborhoods, administrative divisions … - Organizations (Tag:
I-ORG
): sports teams, newspapers, banks, universities, schools, non-profits, companies, … - Persons (Tag:
I-PER
): politicians, scientists, artists, atheletes …
Languages Coverage¶
The models were trained on datasets extracted automatically from Wikipedia. Polyglot currently supports 40 major languages.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2", 3))
1. Polish 2. Turkish 3. Russian
4. Indonesian 5. Czech 6. Arabic
7. Korean 8. Catalan; Valencian 9. Italian
10. Thai 11. Romanian, Moldavian, ... 12. Tagalog
13. Danish 14. Finnish 15. German
16. Persian 17. Dutch 18. Chinese
19. French 20. Portuguese 21. Slovak
22. Hebrew (modern) 23. Malay 24. Slovene
25. Bulgarian 26. Hindi 27. Japanese
28. Hungarian 29. Croatian 30. Ukrainian
31. Serbian 32. Lithuanian 33. Norwegian
34. Latvian 35. Swedish 36. English
37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese
40. Estonian
Download Necessary Models¶
%%bash
polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package ner2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package ner2.en is already up-to-date!
Example¶
Entities inside a text object or a sentence are represented as chunks. Each chunk identifies the start and the end indices of the word subsequence within the text.
from polyglot.text import Text
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
text = Text(blob)
# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')
We can query all entities mentioned in a text.
text.entities
[I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])]
Or, we can query entites per sentence
for sent in text.sentences:
print(sent, "\n")
for entity in sent.entities:
print(entity.tag, entity)
The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world".
I-ORG [u'Israeli']
I-PER [u'Benjamin', u'Netanyahu']
I-LOC [u'Iran']
By doing more careful inspection of the second entity
Benjamin Netanyahu
, we can locate the position of the entity within
the sentence.
benjamin = sent.entities[1]
sent.words[benjamin.start: benjamin.end]
WordList([u'Benjamin', u'Netanyahu'])
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20
, O
which O
was O
equalled O
five O
days O
ago O
by O
South I-LOC
Africa I-LOC
in O
their O
victory O
over O
West I-ORG
Indies I-ORG
in O
Sydney I-LOC
. O
Demo¶
This work is a direct implementation of the research being described in the Polyglot-NER: Multilingual Named Entity Recognition paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@article{polyglotner,
author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},
title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},
journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}},
month = {April},
year = {2015},
publisher = {SIAM}
}
References¶
Morphological Analysis¶
Polyglot offers trained morfessor models to generate morphemes from words. The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.
Languages Coverage¶
Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
1. Piedmontese language 2. Lombard language 3. Gan Chinese
4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz
7. Pashto, Pushto 8. Kurdish 9. Portuguese
10. Kannada 11. Korean 12. Khmer
13. Kazakh 14. Ilokano 15. Polish
16. Panjabi, Punjabi 17. Georgian 18. Chuvash
19. Alemannic 20. Czech 21. Welsh
22. Chechen 23. Catalan; Valencian 24. Northern Sami
25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese
28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish 32. Swahili 33. Sundanese
34. Serbian 35. Albanian 36. Japanese
37. Western Frisian 38. French 39. Finnish
40. Upper Sorbian 41. Faroese 42. Persian
43. Sinhala, Sinhalese 44. Italian 45. Amharic
46. Aragonese 47. Volapük 48. Icelandic
49. Sakha 50. Afrikaans 51. Indonesian
52. Interlingua 53. Azerbaijani 54. Ido
55. Arabic 56. Assamese 57. Yoruba
58. Yiddish 59. Waray-Waray 60. Croatian
61. Hungarian 62. Haitian; Haitian Creole 63. Quechua
64. Armenian 65. Hebrew (modern) 66. Silesian
67. Hindi 68. Divehi; Dhivehi; Mald... 69. German
70. Danish 71. Occitan 72. Tagalog
73. Turkmen 74. Thai 75. Tajik
76. Greek, Modern 77. Telugu 78. Tamil
79. Oriya 80. Ossetian, Ossetic 81. Tatar
82. Turkish 83. Kapampangan 84. Venetian
85. Manx 86. Gujarati 87. Galician
88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali
91. Cebuano 92. Zazaki 93. Walloon
94. Dutch 95. Norwegian 96. Norwegian Nynorsk
97. West Flemish 98. Chinese 99. Bosnian
100. Breton 101. Belarusian 102. Bulgarian
103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib...
106. Bengali 107. Burmese 108. Romansh
109. Marathi (Marāṭhī) 110. Malay 111. Maltese
112. Russian 113. Macedonian 114. Malayalam
115. Mongolian 116. Malagasy 117. Vietnamese
118. Spanish; Castilian 119. Estonian 120. Basque
121. Bishnupriya Manipuri 122. Asturian 123. English
124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan...
130. Latvian 131. Urdu 132. Lithuanian
133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ...
Download Necessary Models¶
%%bash
polyglot download morph2.en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.ar is already up-to-date!
Example¶
from polyglot.text import Text, Word
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
w = Word(w, language="en")
print("{:<20}{}".format(w, w.morphemes))
preprocessing ['pre', 'process', 'ing']
processor ['process', 'or']
invaluable ['in', 'valuable']
thankful ['thank', 'ful']
crossed ['cross', 'ed']
If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
text.morphemes
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30
which which
India In_dia
beat beat
Bermuda Ber_mud_a
in in
Port Port
of of
Spain Spa_in
in in
2007 2007
, ,
which which
was wa_s
equalled equal_led
five five
days day_s
ago ago
by by
South South
Africa Africa
in in
their t_heir
victory victor_y
over over
West West
Indies In_dies
in in
Sydney Syd_ney
. .
Demo¶
This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor
This is an interface to the implementation being described in the Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline technical report.
@InProceedings{morfessor2,
title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
author: {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
year: {2013},
publisher: {Department of Signal Processing and Acoustics, Aalto University},
booktitle:{Aalto University publication series}
}
Transliteration¶
Transliteration is the conversion of a text from one script to another. For instance, a Latin transliteration of the Greek phrase “Ελληνική Δημοκρατία”, usually translated as ‘Hellenic Republic’, is “Ellēnikḗ Dēmokratía”.
from polyglot.transliteration import Transliterator
Languages Coverage¶
from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))
1. Haitian; Haitian Creole 2. Tamil 3. Vietnamese
4. Telugu 5. Croatian 6. Hungarian
7. Thai 8. Kannada 9. Tagalog
10. Armenian 11. Hebrew (modern) 12. Turkish
13. Portuguese 14. Belarusian 15. Norwegian Nynorsk
16. Norwegian 17. Dutch 18. Japanese
19. Albanian 20. Bulgarian 21. Serbian
22. Swahili 23. Swedish 24. French
25. Latin 26. Czech 27. Yiddish
28. Hindi 29. Danish 30. Finnish
31. German 32. Bosnian-Croatian-Serbian 33. Slovak
34. Persian 35. Lithuanian 36. Slovene
37. Latvian 38. Bosnian 39. Gujarati
40. Italian 41. Icelandic 42. Spanish; Castilian
43. Ukrainian 44. Georgian 45. Urdu
46. Indonesian 47. Marathi (Marāṭhī) 48. Korean
49. Galician 50. Khmer 51. Catalan; Valencian
52. Romanian, Moldavian, ... 53. Basque 54. Macedonian
55. Russian 56. Azerbaijani 57. Chinese
58. Estonian 59. Welsh 60. Arabic
61. Bengali 62. Amharic 63. Irish
64. Malay 65. Afrikaans 66. Polish
67. Greek, Modern 68. Esperanto 69. Maltese
Downloading Necessary Models¶
%%bash
polyglot download embeddings2.en transliteration2.ar
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package transliteration2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package transliteration2.ar is already up-to-date!
Example¶
We tag each word in the text with one part of speech.
from polyglot.text import Text
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)
We can query all the tagged words
for x in text.transliterate("ar"):
print(x)
وي
ويل
ميت
ات
ييايت
أوكلوك
ون
ثورسداي
مورنينغ
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en transliteration --target ar | tail -n 30
which ويكه
India ينديا
beat بيت
Bermuda بيرمودا
in ين
Port بورت
of وف
Spain سباين
in ين
2007
,
which ويكه
was واس
equalled يكالليد
five فيفي
days دايس
ago اغو
by بي
South سووث
Africa افريكا
in ين
their ثير
victory فيكتوري
over وفير
West ويست
Indies يندييس
in ين
Sydney سيدني
.
Citation¶
This work is a direct implementation of the research being described in the False-Friend Detection and Entity Matching via Unsupervised Transliteration paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@article{chen2016false,
title={False-Friend Detection and Entity Matching via Unsupervised Transliteration},
author={Chen, Yanqing and Skiena, Steven},
journal={arXiv preprint arXiv:1611.06722},
year={2016}}
Sentiment¶
Polyglot has polarity lexicons for 136 languages. The scale of the words’ polarity consisted of three degrees: +1 for positive words, and -1 for negatives words. Neutral words will have a score of 0.
Languages Coverage¶
from polyglot.downloader import downloader
print(downloader.supported_languages_table("sentiment2", 3))
1. Turkmen 2. Thai 3. Latvian
4. Zazaki 5. Tagalog 6. Tamil
7. Tajik 8. Telugu 9. Luxembourgish, Letzeb...
10. Alemannic 11. Latin 12. Turkish
13. Limburgish, Limburgan... 14. Egyptian Arabic 15. Tatar
16. Lithuanian 17. Spanish; Castilian 18. Basque
19. Estonian 20. Asturian 21. Greek, Modern
22. Esperanto 23. English 24. Ukrainian
25. Marathi (Marāṭhī) 26. Maltese 27. Burmese
28. Kapampangan 29. Uighur, Uyghur 30. Uzbek
31. Malagasy 32. Yiddish 33. Macedonian
34. Urdu 35. Malayalam 36. Mongolian
37. Breton 38. Bosnian 39. Bengali
40. Tibetan Standard, Tib... 41. Belarusian 42. Bulgarian
43. Bashkir 44. Vietnamese 45. Volapük
46. Gan Chinese 47. Manx 48. Gujarati
49. Yoruba 50. Occitan 51. Scottish Gaelic; Gaelic
52. Irish 53. Galician 54. Ossetian, Ossetic
55. Oriya 56. Walloon 57. Swedish
58. Silesian 59. Lombard language 60. Divehi; Dhivehi; Mald...
61. Danish 62. German 63. Armenian
64. Haitian; Haitian Creole 65. Hungarian 66. Croatian
67. Bishnupriya Manipuri 68. Hindi 69. Hebrew (modern)
70. Portuguese 71. Afrikaans 72. Pashto, Pushto
73. Amharic 74. Aragonese 75. Bavarian
76. Assamese 77. Panjabi, Punjabi 78. Polish
79. Azerbaijani 80. Italian 81. Arabic
82. Icelandic 83. Ido 84. Scots
85. Sicilian 86. Indonesian 87. Chinese Word
88. Interlingua 89. Waray-Waray 90. Piedmontese language
91. Quechua 92. French 93. Dutch
94. Norwegian Nynorsk 95. Norwegian 96. Western Frisian
97. Upper Sorbian 98. Nepali 99. Persian
100. Ilokano 101. Finnish 102. Faroese
103. Romansh 104. Javanese 105. Romanian, Moldavian, ...
106. Malay 107. Japanese 108. Russian
109. Catalan; Valencian 110. Fiji Hindi 111. Chinese
112. Cebuano 113. Czech 114. Chuvash
115. Welsh 116. West Flemish 117. Kirghiz, Kyrgyz
118. Kurdish 119. Kazakh 120. Korean
121. Kannada 122. Khmer 123. Georgian
124. Sakha 125. Serbian 126. Albanian
127. Swahili 128. Chechen 129. Sundanese
130. Sanskrit (Saṁskṛta) 131. Venetian 132. Northern Sami
133. Slovak 134. Sinhala, Sinhalese 135. Bosnian-Croatian-Serbian
136. Slovene
from polyglot.text import Text
Polarity¶
To inquiry the polarity of a word, we can just call its own attribute
polarity
text = Text("The movie was really good.")
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in text.words:
print("{:<16}{:>2}".format(w, w.polarity))
Word Polarity
------------------------------
The 0
movie 0
was 0
really 0
good 1
. 0
Entity Sentiment¶
We can calculate a more sphosticated sentiment score for an entity that is mentioned in text as the following:
blob = ("Barack Obama gave a fantastic speech last night. "
"Reports indicate he will move next to New Hampshire.")
text = Text(blob)
First, we need split the text into sentneces, this will limit the words tha affect the sentiment of an entity to the words mentioned in the sentnece.
first_sentence = text.sentences[0]
print(first_sentence)
The movie was really good.
Second, we extract the entities
first_entity = first_sentence.entities[0]
print(first_entity)
[u'Obama']
Finally, for each entity we identified, we can calculate the strength of the positive or negative sentiment it has on a scale from 0-1
first_entity.positive_sentiment
0.9375
first_entity.negative_sentiment
0
Citation¶
This work is a direct implementation of the research being described in the Building sentiment lexicons for all major languages paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@inproceedings{chen2014building,
title={Building sentiment lexicons for all major languages},
author={Chen, Yanqing and Skiena, Steven},
booktitle={Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers)},
pages={383--389},
year={2014}}
polyglot¶
polyglot package¶
Subpackages¶
polyglot.detect package¶
Submodules¶
polyglot.detect.base module¶
Detecting languages
-
class
polyglot.detect.base.
Detector
(text, quiet=False)[source]¶ Bases:
object
Detect the language used in a snippet of text.
-
detect
(text)[source]¶ Decide which language is used to write the text.
The method tries first to detect the language with high reliability. If that is not possible, the method switches to best effort strategy.
Parameters: text (string) – A snippet of text, the longer it is the more reliable we can detect the language used to write the text.
-
quiet
= None¶ If true, exceptions will be silenced.
-
reliable
= None¶ False if the detector used Best Effort strategy in detection.
-
-
exception
polyglot.detect.base.
Error
[source]¶ Bases:
exceptions.Exception
Base exception class for this class.
-
exception
polyglot.detect.base.
UnknownLanguage
[source]¶ Bases:
polyglot.detect.base.Error
Raised if we can not detect the language of a text snippet.
polyglot.detect.langids module¶
Module contents¶
-
class
polyglot.detect.
Detector
(text, quiet=False)[source]¶ Bases:
object
Detect the language used in a snippet of text.
-
detect
(text)[source]¶ Decide which language is used to write the text.
The method tries first to detect the language with high reliability. If that is not possible, the method switches to best effort strategy.
Parameters: text (string) – A snippet of text, the longer it is the more reliable we can detect the language used to write the text.
-
polyglot.mapping package¶
Subpackages¶
Submodules¶
polyglot.mapping.base module¶
Supports word embeddings.
-
class
polyglot.mapping.base.
CountedVocabulary
(word_count=None)[source]¶ Bases:
polyglot.mapping.base.OrderedVocabulary
List of words and counts sorted according to word count.
-
classmethod
from_textfile
(textfile, workers=1, job_size=1000)[source]¶ Count the set of words appeared in a text file.
Parameters: - textfile (string) – The name of the text file or TextFile object.
- min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
- workers (integer) – Number of parallel workers to read the file simulatenously.
- job_size (integer) – Size of the batch send to each worker.
- most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.
Returns: A vocabulary of the most frequent words appeared in the document.
-
static
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1 count1
- word2 count2
-
classmethod
-
class
polyglot.mapping.base.
OrderedVocabulary
(words=None)[source]¶ Bases:
polyglot.mapping.base.VocabularyBase
An ordered list of words/tokens according to their frequency.
Note
The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
-
class
polyglot.mapping.base.
VocabularyBase
(words=None)[source]¶ Bases:
object
A set of words/tokens that have consistent IDs.
Note
Words will be sorted according to their lexicographic order.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
classmethod
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1
- word2
-
sanitize_words
(words)[source]¶ Guarantees that all textual symbols are unicode.
Note
We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.
-
words
¶ Ordered list of words according to their IDs.
-
polyglot.mapping.embeddings module¶
Defines classes related to mapping vocabulary to n-dimensional points.
-
class
polyglot.mapping.embeddings.
Embedding
(vocabulary, vectors)[source]¶ Bases:
object
Mapping a vocabulary to a d-dimensional points.
-
distances
(word, words)[source]¶ Calculate eucledean pairwise distances between word and words.
Parameters: - word (string) – single word.
- words (list) – list of strings.
Returns: numpy array of the distances.
Note
L2 metric is used to calculate distances.
-
static
from_word2vec
(fname, fvocab=None, binary=False)[source]¶ Load the input-hidden weight matrix from the original C word2vec-tool format.
Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
-
most_frequent
(k, inplace=False)[source]¶ Only most frequent k words to be included in the embeddings.
-
nearest_neighbors
(word, top_k=10)[source]¶ Return the nearest k words to the given word.
Parameters: - word (string) – single word.
- top_k (integer) – decides how many neighbors to report.
Returns: A list of words sorted by the distances. The closest is the first.
Note
L2 metric is used to calculate distances.
-
normalize_words
(ord=2, inplace=False)[source]¶ Normalize embeddings matrix row-wise.
Parameters: ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}
-
shape
¶
-
words
¶
-
polyglot.mapping.expansion module¶
Module contents¶
-
class
polyglot.mapping.
CountedVocabulary
(word_count=None)[source]¶ Bases:
polyglot.mapping.base.OrderedVocabulary
List of words and counts sorted according to word count.
-
classmethod
from_textfile
(textfile, workers=1, job_size=1000)[source]¶ Count the set of words appeared in a text file.
Parameters: - textfile (string) – The name of the text file or TextFile object.
- min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
- workers (integer) – Number of parallel workers to read the file simulatenously.
- job_size (integer) – Size of the batch send to each worker.
- most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.
Returns: A vocabulary of the most frequent words appeared in the document.
-
static
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1 count1
- word2 count2
-
classmethod
-
class
polyglot.mapping.
OrderedVocabulary
(words=None)[source]¶ Bases:
polyglot.mapping.base.VocabularyBase
An ordered list of words/tokens according to their frequency.
Note
The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
-
class
polyglot.mapping.
VocabularyBase
(words=None)[source]¶ Bases:
object
A set of words/tokens that have consistent IDs.
Note
Words will be sorted according to their lexicographic order.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
classmethod
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1
- word2
-
sanitize_words
(words)[source]¶ Guarantees that all textual symbols are unicode.
Note
We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.
-
words
¶ Ordered list of words according to their IDs.
-
-
class
polyglot.mapping.
Embedding
(vocabulary, vectors)[source]¶ Bases:
object
Mapping a vocabulary to a d-dimensional points.
-
distances
(word, words)[source]¶ Calculate eucledean pairwise distances between word and words.
Parameters: - word (string) – single word.
- words (list) – list of strings.
Returns: numpy array of the distances.
Note
L2 metric is used to calculate distances.
-
static
from_word2vec
(fname, fvocab=None, binary=False)[source]¶ Load the input-hidden weight matrix from the original C word2vec-tool format.
Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
-
most_frequent
(k, inplace=False)[source]¶ Only most frequent k words to be included in the embeddings.
-
nearest_neighbors
(word, top_k=10)[source]¶ Return the nearest k words to the given word.
Parameters: - word (string) – single word.
- top_k (integer) – decides how many neighbors to report.
Returns: A list of words sorted by the distances. The closest is the first.
Note
L2 metric is used to calculate distances.
-
normalize_words
(ord=2, inplace=False)[source]¶ Normalize embeddings matrix row-wise.
Parameters: ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}
-
shape
¶
-
words
¶
-
polyglot.tokenize package¶
Subpackages¶
Submodules¶
polyglot.tokenize.base module¶
Basic text segmenters.
-
class
polyglot.tokenize.base.
SentenceTokenizer
(locale='en')[source]¶ Bases:
polyglot.tokenize.base.Breaker
Segment text to sentences.
-
class
polyglot.tokenize.base.
WordTokenizer
(locale='en')[source]¶ Bases:
polyglot.tokenize.base.Breaker
Segment text to words or tokens.
Module contents¶
-
class
polyglot.tokenize.
WordTokenizer
(locale='en')[source]¶ Bases:
polyglot.tokenize.base.Breaker
Segment text to words or tokens.
-
class
polyglot.tokenize.
SentenceTokenizer
(locale='en')[source]¶ Bases:
polyglot.tokenize.base.Breaker
Segment text to sentences.
Submodules¶
polyglot.base module¶
Basic data types.
-
class
polyglot.base.
Sequence
(text)[source]¶ Bases:
object
Text with indices indicates boundaries.
-
text
¶
-
-
class
polyglot.base.
TextFile
(file, delimiter=u'n')[source]¶ Bases:
object
Wrapper around text files.
It uses io.open to guarantee reading text files with unicode encoding. It has an iterator that supports arbitrary delimiter instead of only new lines.-
delimiter
¶ string – A string that defines the limit of each chunk.
-
file
¶ string – A path to a file.
-
buf
¶ StringIO – a buffer to store the results of peeking into the file.
-
apply
(func, workers=1, job_size=10000)[source]¶ Apply func to lines of text in parallel or sequential.
Parameters: func – a function that takes a list of lines.
-
iter_delimiter
(byte_size=8192)[source]¶ Generalization of the default iter file delimited by ‘ ‘.
- Note:
- The newline string can be arbitrarily long; it need not be restricted to a single character. You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to ‘’ is particularly good for use with an input file created with something like “os.popen(‘find -print0’)”.
- Args:
- byte_size (integer): Number of bytes to be read at each time.
-
-
class
polyglot.base.
TextFiles
(files, delimiter=u'n')[source]¶ Bases:
polyglot.base.TextFile
Interface for a sequence of files.
-
names
¶
-
polyglot.decorators module¶
polyglot.downloader module¶
The Polyglot corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with polyglot.
Downloading Packages¶
If called with no arguments, download()
will display an interactive
interface which can be used to download and install new packages.
If Tkinter is available, then a graphical interface will be shown,
otherwise a simple text interface will be provided.
Individual packages can be downloaded by calling the download()
function with a single argument, giving the package identifier for the
package that should be downloaded:
>>> download('treebank')
[polyglot_data] Downloading package 'treebank'...
[polyglot_data] Unzipping corpora/treebank.zip.
Polyglot also provides a number of “package collections”, consisting of
a group of related packages. To download all packages in a
colleciton, simply call download()
with the collection’s
identifier:
>>> download('all-corpora')
[polyglot_data] Downloading package 'abc'...
[polyglot_data] Unzipping corpora/abc.zip.
[polyglot_data] Downloading package 'alpino'...
[polyglot_data] Unzipping corpora/alpino.zip.
...
[polyglot_data] Downloading package 'words'...
[polyglot_data] Unzipping corpora/words.zip.
Download Directory¶
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user’s home directory. However, the download_dir
argument may be
used to specify a different installation target, if desired.
See Downloader.default_download_dir()
for more a detailed
description of how the default download directory is chosen.
Polyglot Download Server¶
Before downloading any packages, the corpus and module downloader
contacts the Polyglot download server, to retrieve an index file
describing the available packages. By default, this index file is
loaded from http://nltk.googlecode.com/svn/trunk/polyglot_data/index.xml
.
If necessary, it is possible to create a new Downloader
object,
specifying a different URL for the package index file.
Usage:
python polyglot/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
or:
python -m polyglot.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
-
class
polyglot.downloader.
Collection
(id, children, name=None, **kw)[source]¶ Bases:
object
A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by
Downloader
.-
children
= None¶ A list of the
Collections
orPackages
directly contained by this collection.
-
id
= None¶ A unique identifier for this collection.
-
name
= None¶ A string name for this collection.
-
packages
= None¶ A list of
Packages
contained by this collection or any collections it recursively contains.
-
-
class
polyglot.downloader.
Downloader
(server_index_url=None, source=None, download_dir=None)[source]¶ Bases:
object
A class used to access the Polyglot data server, which can be used to download corpora and other data packages.
-
DEFAULT_SOURCE
= u'mirror'¶ The source for index and other data files. Two values are supported: ‘mirror’ or ‘google’.
For ‘mirror’, the DEFAULT_URL should be set as a prefix of mirrored directory, like ‘http://address.of.mirror/dir/’, and the downloader expects a file named ‘index.json’ as index file.
For ‘google’, the DEFAULT_URL should be the bucket of google cloud, and the downloader expects index from google api.
So set the following DEFAULT_URL properly.
-
DEFAULT_URL
= u'http://polyglot.cs.stonybrook.edu/~polyglot/'¶ The default URL for the Polyglot data server’s index. An alternative URL can be specified when creating a new
Downloader
object.For ‘google’ as DEFAULT_SOURCE, ‘polyglot-models’ is the default place. For ‘mirror’ as DEFAULT_SOURCE, use an proper mirror.
-
INDEX_TIMEOUT
= 3600¶ The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.
-
INSTALLED
= u'installed'¶ A status string indicating that a package or collection is installed and up-to-date.
-
LANG_PREFIX
= u'LANG:'¶ Collection ID prefix for collections that gathers models of a specific task.
-
NOT_INSTALLED
= u'not installed'¶ A status string indicating that a package or collection is not installed.
-
PARTIAL
= u'partial'¶ A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)
-
STALE
= u'out of date'¶ A status string indicating that a package or collection is corrupt or out-of-date.
-
TASK_PREFIX
= u'TASK:'¶ Collection ID prefix for collections that gathers models of a specific task.
-
default_download_dir
()[source]¶ Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the
download_dir
argument when callingdownload()
.On all other platforms, the default directory is
~/polyglot_data
.
-
download
(info_or_id=None, download_dir=None, quiet=False, force=False, prefix=u'[polyglot_data] ', halt_on_error=True, raise_on_error=False)[source]¶
-
download_dir
¶ The default directory to which packages will be downloaded. This defaults to the value returned by
default_download_dir()
. To override this default on a case-by-case basis, use thedownload_dir
argument when callingdownload()
.
-
get_collection
(lang=None, task=None)[source]¶ Return the collection that represents a specific language or task.
Parameters: - lang (string) – Language code.
- task (string) – Task name.
-
index
()[source]¶ Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.
-
list
(download_dir=None, show_packages=False, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]¶
-
status
(info_or_id, download_dir=None)[source]¶ Return a constant describing the status of the given package or collection. Status can be one of
INSTALLED
,NOT_INSTALLED
,STALE
, orPARTIAL
.
-
supported_language
(lang)[source]¶ Return True if polyglot supports the language.
Parameters: lang (string) – Language code.
-
supported_languages
(task=None)[source]¶ Languages that are covered by a specific task.
Parameters: task (string) – Task name.
-
supported_tasks
(lang=None)[source]¶ Languages that are covered by a specific task.
Parameters: lang (string) – Language code name.
-
update
(quiet=False, prefix=u'[polyglot_data] ')[source]¶ Re-download any packages whose status is STALE.
-
url
¶ The URL for the data server’s index file.
-
-
class
polyglot.downloader.
DownloaderMessage
[source]¶ Bases:
object
A status message object, used by
incr_download
to communicate its progress.
-
class
polyglot.downloader.
ErrorMessage
(package, message)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server encountered an error
-
exception
polyglot.downloader.
ExceptionBase
[source]¶ Bases:
exceptions.Exception
General base exception for the downloader module.
-
class
polyglot.downloader.
FinishCollectionMessage
(collection)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished working on a collection of packages.
-
class
polyglot.downloader.
FinishDownloadMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished downloading a package.
-
class
polyglot.downloader.
FinishPackageMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished working on a package.
-
class
polyglot.downloader.
FinishUnzipMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished unzipping a package.
-
exception
polyglot.downloader.
LanguageNotSupported
[source]¶ Bases:
polyglot.downloader.ExceptionBase
Raised if the language is not covered by polyglot.
-
class
polyglot.downloader.
Package
(id, url, name=None, subdir=u'', size=None, filename=u'', task=u'', language=u'', attrs=None, **kw)[source]¶ Bases:
object
A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by
Downloader
. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.-
attrs
= None¶ Extra attributes generated by Google Cloud Storage.
-
filename
= None¶ The filename that should be used for this package’s file.
-
id
= None¶ A unique identifier for this package.
-
language
= None¶ The langauge code this package belongs to.
-
name
= None¶ A string name for this package.
-
size
= None¶ The filesize (in bytes) of the package file.
-
subdir
= None¶ The subdirectory where this package should be installed. E.g.,
'corpora'
or'taggers'
.
-
task
= None¶ The task this package is serving.
-
url
= None¶ A URL that can be used to download this package’s file.
-
-
class
polyglot.downloader.
ProgressMessage
(progress)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Indicates how much progress the data server has made
-
class
polyglot.downloader.
SelectDownloadDirMessage
(download_dir)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Indicates what download directory the data server is using
-
class
polyglot.downloader.
StaleMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
The package download file is out-of-date or corrupt
-
class
polyglot.downloader.
StartCollectionMessage
(collection)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started working on a collection of packages.
-
class
polyglot.downloader.
StartDownloadMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started downloading a package.
-
class
polyglot.downloader.
StartPackageMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started working on a package.
-
class
polyglot.downloader.
StartUnzipMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started unzipping a package.
-
exception
polyglot.downloader.
TaskNotSupported
[source]¶ Bases:
polyglot.downloader.ExceptionBase
Raised if the task is not covered by polyglot.
-
class
polyglot.downloader.
UpToDateMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
The package download file is already up-to-date
-
polyglot.downloader.
build_index
(root, base_url)[source]¶ Create a new data.xml index file, by combining the xml description files for various packages and collections.
root
should be the path to a directory containing the package xml and zip files; and the collection xml files. Theroot
directory is expected to have the following subdirectories:root/ packages/ .................. subdirectory for packages corpora/ ................. zip & xml files for corpora grammars/ ................ zip & xml files for grammars taggers/ ................. zip & xml files for taggers tokenizers/ .............. zip & xml files for tokenizers etc. collections/ ............... xml files for collections
For each package, there should be two files:
package.zip
(where package is the package name) which contains the package itself as a compressed zip file; andpackage.xml
, which is an xml description of the package. The zipfilepackage.zip
should expand to a single subdirectory namedpackage/
. The base filenamepackage
must match the identifier given in the package’s xml file.For each collection, there should be a single file
collection.zip
describing the collection, where collection is the name of the collection.All identifiers (for both packages and collections) must be unique.
polyglot.load module¶
polyglot.mixins module¶
-
class
polyglot.mixins.
BlobComparableMixin
[source]¶ Bases:
polyglot.mixins.ComparableMixin
Allow blob objects to be comparable with both strings and blobs.
-
class
polyglot.mixins.
ComparableMixin
[source]¶ Bases:
object
Implements rich operators for an object.
-
class
polyglot.mixins.
StringlikeMixin
[source]¶ Bases:
object
Make blob objects behave like Python strings.
Expects that classes that use this mixin to have a _strkey() method that returns the string to apply string methods to. Using _strkey() instead of __str__ ensures consistent behavior between Python 2 and 3.
-
ends_with
(suffix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob ends with the given suffix.
-
endswith
(suffix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob ends with the given suffix.
-
find
(sub, start=0, end=9223372036854775807)[source]¶ Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].
-
format
(*args, **kwargs)[source]¶ Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.
-
index
(sub, start=0, end=9223372036854775807)[source]¶ Like blob.find() but raise ValueError when the substring is not found.
-
join
(iterable)[source]¶ Behaves like the built-in str.join(iterable) method, except returns a blob object.
Returns a blob which is the concatenation of the strings or blobs in the iterable.
-
replace
(old, new, count=9223372036854775807)[source]¶ Return a new blob object with all the occurence of old replaced by new.
-
rfind
(sub, start=0, end=9223372036854775807)[source]¶ Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].
-
rindex
(sub, start=0, end=9223372036854775807)[source]¶ Like blob.rfind() but raise ValueError when substring is not found.
-
starts_with
(prefix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob starts with the given prefix.
-
startswith
(prefix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob starts with the given prefix.
-
polyglot.text module¶
polyglot.utils module¶
Collection of general utilities.