KoNLPy is a Python package for natural language processing (NLP) of the Korean language. For installation directions, see here.
>>> from konlpy.tag import Kkma
>>> from konlpy.utils import pprint
>>> kkma = Kkma()
>>> pprint(kkma.sentences(u'저는 대학생이구요. 소프트웨어 관련학과 입니다.'))
[저는 대학생이구요.,
소프트웨어 관련학과 입니다.]
>>> pprint(kkma.nouns(u'대학에서 DB, 통계학, 이산수학 등을 배웠지만...'))
[대학,
통계학,
이산,
이산수학,
수학,
등]
>>> pprint(kkma.pos(u'자주 사용을 안하다보니 모두 까먹은 상태입니다.'))
[(자주, MAG),
(사용, NNG),
(을, JKO),
(안하, VV),
(다, ECS),
(보, VXV),
(니, ECD),
(모두, MAG),
(까먹, VV),
(은, ETD),
(상태, NNG),
(이, VCP),
(ㅂ니다, EFN),
(., SF)]
For more on how to use KoNLPy, go see the API.
Korean, the 13th most widely spoken language in the world, is a beautiful, yet complex language. Myriad Korean NLP engines were built by numerous researchers, to computationally extract meaningful features from the labyrinthine text.
KoNLPy is not just to create another, but to unify and build upon their shoulders, and see one step further. It is built particularly in the Python (programming) language, not only because of the language’s simplicity and elegance, but also the powerful string processing modules and applicability to various tasks - including crawling, Web programming, and data analysis.
The three main philosophies of this project are:
Please report when you think any have gone stale.
KoNLPy isn’t perfect, but it will continuously evolve and you are invited to participate!
Found a bug? Have a good idea for improving KoNLPy? Visit the KoNLPy GitHub page and suggest an idea or make a pull request.
KoNLPy is not registered in PyPI yet. Please install from source until further notice.
$ pip install git+https://github.com/e9t/konlpy.git
or
$ git clone https://github.com/e9t/konlpy.git
$ cd konlpy
$ python setup.py install
In order to use the MeCab morpheme analyzer in KoNLPy, install the followings:
$ wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.1.tar.gz
$ tar zxfv mecab-0.996-ko-0.9.1.tar.gz
$ cd mecab-0.996-ko-0.9.1
$ ./configure
$ make
$ make check
$ sudo make install
$ wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz
$ tar zxfv mecab-ko-dic-1.6.1-20140814.tar.gz
$ cd mecab-ko-dic-1.6.1-20140814
$ ./configure
$ sudo ldconfig
$ make
$ sudo sh -c 'echo "dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic" > /usr/local/etc/mecabrc'
$ sudo make install
$ git clone https://github.com/HiroyukiHaga/mecab-python.git
$ cd mecab-python
$ python setup.py build
$ sudo python setup.py install
KoNLPy’s compatibility with Windows is not stable yet. However, if you would still like to put in some effort you can try the following.
Download the most recent release of KoNLPy from GitHub, and extract files. (Otherwise, you can just clone it).
- Current release: 0.3.0
In a terminal, go to the KoNLPy directory, and type python setup.py install.
Note
If you have installation errors with dependency packages, try installing them with Christoph Gohlke’s Windows Binaries [1]:
(Optional) Download, extract [2], and install the most recent version of MeCab from the following links:
[1] | win-amd64 for 64-bit Windows, win32 for 32-bit Windows. |
[2] | Having MinGW/MSYS or Cygwin installed may be more convenient. Otherwise, you can use 7zip for the extraction of tar files. |
Morphological analysis is the identification of the structure of morphemes and other linguistic units, such as root words, affixes, or parts of speech.
POS (part-of-speech) tagging is the process of marking up morphemes in a phrase, based on their definitions and contexts. For example.:
가방에 들어가신다 -> 가방/NNG + 에/JKM + 들어가/VV + 시/EPH + ㄴ다/EFN
In KoNLPy, there are several different options you can choose for POS tagging. All have the same input-output structure; the input is a phrase, and the output is a list of tagged morphemes.
For detailed usage instructions see the tag Package.
Now, we do time and performation analysis for executing the pos method for each of the modules in the tag Package.
The performance evaluation is replaced with result comparisons for several sample sentences.
“저는 대학생이구요. 소프트웨어 관련학과 입니다.”
“롯데마트의 흑마늘양념치킨이 논란이 되고 있는 가운데, 자연주의쇼핑몰에서부터만큼은 안정적으로 운영되고 있다.”
kkma hannanum mecab 롯데 / NNP 롯데마트 / N 롯데마트 / NNP 마트 / NNG 의 / JKG 의 / J 의 / JKG 흑 / NNG 흑마늘양념치킨 / N 흑마 / NNG 마늘 / NNG 늘 / MAG 양념 / NNG 양념치킨 / NNP 치킨 / NNG 이 / JKS 이 / J 이 / JKS 논란 / NNG 논란 / N 논란 / NNG 이 / JKC 이 / J 이 / JKS 되 / VV 되 / P 되 / VV 고 / ECE 고 / E 고 / EC 있 / VXV 있 / P 있 / VX 는 / ETD 는 / E 는 / ETM 가운데 / NNG 가운데 / N 가운데 / NNG , / SP , / S , / SC 자연주의 / NNG 자연주의쇼핑몰 / N 자연 / NNG 주 / NNG 의 / JKG 쇼핑몰 / NNG 쇼핑몰 / NNG 에서 / JKM 에서부터만큼은 / J 에서부터 / JKB 부터 / JX 만큼 / NNG 만큼 / JKB 은 / JX 은 / JX 안정적 / NNG 안정적 / N 안정 / NNG 적 / XSN 으로 / JKM 으로 / J 으로 / JKB 운영 / NNG 운영 / N 운영 / NNG 되 / XSV 되 / X 되 / XSV 고 / ECE 고 / E 고 / EC 있 / VXV 있 / P 있 / VX 다 / EFN 다 / E 다 / EF . / SF . / S . / SF
[1] | All time analyses in this document were performed with time on a Thinkpad X1 Carbon (2013) and KoNLPy v0.3. |
[2] | Average of five consecutive runs. |
[3] | Average of ten consecutive runs. |
[4] | The current hannanum module raises a java.lang.ArrayIndexOutOfBoundsException: 10000 exception if the number of characters is too large. |
See also
Korean POS tags comparison chart
Compare POS tags between several Korean analytic projects. (In Korean)
Dictionaries are used for Morphological analysis and POS tagging, and are built with Corpora.
A dictionary created with the KAIST corpus. (4.7MB)
Located at ./konlpy/java/data/kE/dic_system.txt. Part of this file is shown below.:
...
나라경제 ncn
나라기획 nqq
나라기획회장 ncn
나라꽃 ncn
나라님 ncn
나라도둑 ncn
나라따르 pvg
나라링링프로덕션 ncn
나라말 ncn
나라망신 ncn
나라박물관 ncn
나라발전 ncpa
나라별 ncn
나라부동산 nqq
나라사랑 ncn
나라살림 ncpa
나라시 nqq
나라시마 ncn
...
You can add your own terms, modify ./konlpy/java/data/kE/dic_user.txt.
A dictionary created with the Sejong corpus. (32MB)
It is included within the Kkma .jar file, so in order to see dictionary files, check out the KKMA’s mirror. Part of kcc.dic is shown below.:
아니/IC
후우/IC
그래서/MAC
그러나/MAC
그러니까/MAC
그러면/MAC
그러므로/MAC
그런데/MAC
그리고/MAC
따라서/MAC
하지만/MAC
...
A CSV formatted dictionary created with the Sejong corpus. (346MB)
The compiled version is located at /usr/local/lib/mecab/dic/mecab-ko-dic (or the path you assigned during installation), and you can see the original files in the source code. Part of CoinedWord.csv is shown below.:
가오티,0,0,0,NNG,*,F,가오티,*,*,*,*,*
갑툭튀,0,0,0,NNG,*,F,갑툭튀,*,*,*,*,*
강퇴,0,0,0,NNG,*,F,강퇴,*,*,*,*,*
개드립,0,0,0,NNG,*,T,개드립,*,*,*,*,*
갠소,0,0,0,NNG,*,F,갠소,*,*,*,*,*
고퀄,0,0,0,NNG,*,T,고퀄,*,*,*,*,*
광삭,0,0,0,NNG,*,T,광삭,*,*,*,*,*
광탈,0,0,0,NNG,*,T,광탈,*,*,*,*,*
굉천,0,0,0,NNG,*,T,굉천,*,*,*,*,*
국을,0,0,0,NNG,*,T,국을,*,*,*,*,*
귀요미,0,0,0,NNG,*,F,귀요미,*,*,*,*,*
...
To add your own terms, see here.
Note
You can add new words either to the system dictionaries or user dictionaries. However, there is a slight difference in the two choices.:
Below shows a code example that crawls a National Assembly bill from the web, extract nouns and draws a word cloud - from head to tail in Python.
You can change the bill number (i.e., bill_num), and see how the word clouds differ per bill. (ex: ‘1904882’, ‘1904883’, ‘ZZ19098’, etc)
#! /usr/bin/python2.7
# -*- coding: utf-8 -*-
from collections import Counter
import urllib
import random
import webbrowser
from konlpy.tag import Hannanum
from lxml import html
import pytagcloud # requires Korean font support
r = lambda: random.randint(0,255)
color = lambda: (r(), r(), r())
def get_bill_text(billnum):
url = 'http://pokr.kr/bill/%s/text' % billnum
response = urllib.urlopen(url).read().decode('utf-8')
page = html.fromstring(response)
text = page.xpath(".//div[@id='bill-sections']/pre/text()")[0]
return text
def get_tags(text, ntags=50, multiplier=10):
h = Hannanum()
nouns = h.nouns(text)
count = Counter(nouns)
return [{ 'color': color(), 'tag': n, 'size': c*multiplier }\
for n, c in count.most_common(ntags)]
def draw_cloud(tags, filename, fontname='Noto Sans CJK', size=(800, 600)):
pytagcloud.create_tag_image(tags, filename, fontname=fontname, size=size)
webbrowser.open(filename)
bill_num = '1904882'
text = get_bill_text(bill_num)
tags = get_tags(text)
print tags
draw_cloud(tags, 'wordcloud.png')
Note
The PyTagCloud installed in PyPI may not be sufficient for drawing wordclouds in Korean. You may add eligible fonts - that support the Korean language - manually, or install the Korean supported version here.
KoNLPy has tests to evaulate its quality. To perform a test, use the code below.
$ pip install pytest
$ cd konlpy
$ py.test
Note
Please report if you know any other NLP engines or corpora that are not included in this list. Last updated at August 24, 2014.
K-LIWC, Ajou University
Korean XTAG, University of Pennsylvania
Speller, Pusan National University
UTagger, University of Ulsan
(No name), Korea University
KAIST corpus, KAIST, 1997-2005.
Sejong corpus, National Institute of the Korean Language, 1998-2007.
Initializes the Java virtual machine (JVM).
Parameters: | jvmpath – The path of the JVM. If left empty, inferred by jpype.getDefaultJVMPath(). |
---|
Bases: pprint.PrettyPrinter
Overrided method to enable Unicode pretty print.
Converts a unicode character to hex.
>>> char2hex(u'음')
'0xc74c'
Concatenates lines into a unified string.
Find concordances of a phrase in a text.
The farmost left numbers are indices, that indicate the location of the phrase in the text (by means of tokens). The following string, is part of the text surrounding the phrase for the given index.
>>> from konlpy.corpus import kolaw
>>> from konlpy.tag import Mecab
>>> from konlpy import utils
>>> constitution = kolaw.open('constitution.txt').read()
>>> idx = utils.concordance(u'대한민국', constitution)
0 대한민국헌법 유구한 역사와
9 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에
98 총강 제1조 ① 대한민국은 민주공화국이다. ②대한민국의
100 ① 대한민국은 민주공화국이다. ②대한민국의 주권은 국민에게
110 나온다. 제2조 ① 대한민국의 국민이 되는
126 의무를 진다. 제3조 대한민국의 영토는 한반도와
133 부속도서로 한다. 제4조 대한민국은 통일을 지향하며,
147 추진한다. 제5조 ① 대한민국은 국제평화의 유지에
787 군무원이 아닌 국민은 대한민국의 영역안에서는 중대한
1836 파견 또는 외국군대의 대한민국 영역안에서의 주류에
3620 경제 제119조 ① 대한민국의 경제질서는 개인과
>>> idx
[0, 9, 98, 100, 110, 126, 133, 147, 787, 1836, 3620]
Converts a hex character to unicode.
>>> print hex2char('c74c')
음
>>> print hex2char('0xc74c')
음
Text file loader.
Partitions a list to several parts using indices.
Parameters: |
|
---|
Unicode pretty printer.
>>> import pprint, konlpy
>>> pprint.pprint([u"Print", u"유니코드", u"easily"])
[u'Print', u'\uc720\ub2c8\ucf54\ub4dc', u'easily']
>>> konlpy.utils.pprint([u"Print", u"유니코드", u"easily"])
['Print', '유니코드', 'easily']
Replaces some ambiguous punctuation marks to simpler ones.
Note
Initial runs of each class method may require some time to load dictionaries (< 1 min). Second runs should be faster.
Wrapper for JHannanum.
JHannanum is a morphological analyzer and POS tagger written in Java, and developed by the Semantic Web Research Center (SWRC) at KAIST since 1999.
from konlpy.tag import Hannanum
hannanum = Hannanum()
print hannanum.morph(u'롯데마트의 흑마늘 양념 치킨이 논란이 되고 있다.')
print hannanum.nouns(u'다람쥐 헌 쳇바퀴에 타고파')
print hannanum.pos(u'웃으면 더 행복합니다!')
Parameters: | jvmpath – The path of the JVM passed to init_jvm(). |
---|
Morphological analyzer.
This analyzer consists of two parts: 1) Dictionary search (chart), 2) Unclassified term segmentation.
Noun extractor.
POS tagger.
This tagger is HMM based, and calculates the probability of tags.
Parameters: | ntags – The number of tags. It can be either 9 or 22. |
---|
Wrapper for Kkma.
Kkma is a morphological analyzer and natural language processing system written in Java, developed by the Intelligent Data Systems (IDS) Laboratory at SNU.
from konlpy.tag import Kkma
kkma = Kkma()
print kkma.sentences(u'저는 대학생이구요. 소프트웨어 관련학과 입니다.')
print kkma.nouns(u'대학에서 DB, 통계학, 이산수학 등을 배웠지만...')
print kkma.pos(u'자주 사용을 안하다보니 모두 까먹은 상태입니다.')
Parameters: | jvmpath – The path of the JVM passed to init_jvm(). |
---|
Noun extractor.
POS tagger.
Sentence detection.
Wrapper for MeCab-ko morphological analyzer.
MeCab, originally a Japanese morphological analyzer and a POS tagger developed by the Graduate School of Informatics in Kyoto University, was modified to MeCab-ko by the Eunjeon Project to adapt to the Korean language.
In order to use MeCab-ko within KoNLPy, follow the directions in Optional installations.
from konlpy.tag import Mecab
# MeCab installation needed
mecab = Mecab()
print mecab.nouns(u'우리나라에는 무릎 치료를 잘하는 정형외과가 없는가!')
print mecab.pos(u'자연주의 쇼핑몰은 어떤 곳인가?')
Parameters: | dicpath – The path of the MeCab-ko dictionary. |
---|
Noun extractor.
POS tagger.
See also
Korean POS tags comparison chart
Compare POS tags between several Korean analytic projects. (In Korean)
Loader for corpora. The following corpora are currently available:
>>> from konlpy.corpus import kolaw
>>> fids = kolaw.fileids()
>>> fobj = kolaw.open(fids[0])
>>> print fobj.read(140)
대한민국헌법
유구한 역사와 전통에 빛나는 우리 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에 항거한 4·19민주이념을 계승하고, 조국의 민주개혁과 평화적 통일의 사명에 입각하여 정의·인도와 동포애로써 민족의 단결을 공고히 하고, 모든 사회적 폐습과 불의를 타파하며, 자율과 조화를 바 바
Absolute path of corpus file. If filename is None, returns absolute path of corpus.
Parameters: | filename – Name of a particular file in the corpus. |
---|
List of file IDs in the corpus.
Method to open a file in the corpus. Returns a file object.
Parameters: | filename – Name of a particular file in the corpus. |
---|
[1] | With clear and brief documents. |
[2] | No, I’m not extremely fond of this either. However, some important depedencies - such as Hannanum, Kkma, MeCab-ko - are GPL licensed, and we want to honor their licenses. (It is also an inevitable choice. We hope things may change in the future.) |