unihan-etl¶
unihan-etl - ETL tool for Unicode’s Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode’s website to a flat, tabular or structured, tree-like format.
unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.
Part of the cihai project. Similar project: libUnihan.
UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).
UNIHAN’s data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin
maps Unicode codepoints to Hànyǔ Dà Zìdiǎn,
where 10019.020:tiàn
represents an entry. Complicating it further,
more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, “Flat” output¶
CSV (default), $ unihan-etl
:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand
:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With $ unihan-etl -F json --no-expand
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
“Structured” output¶
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
Why not CSV?
Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.
JSON, $ unihan-etl -F json
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML $ unihan-etl -F yaml
:
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features¶
automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN’s database design
export to JSON, CSV and YAML (requires pyyaml) via
-F
configurable to export specific fields via
-f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please create an issue.
Usage¶
unihan-etl
offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout¶
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
About unihan-etl¶
unihan-etl provides configurable, self-serve data exports of the About UNIHAN database.
Retrieval¶
unihan-etl will download and cache the raw database files for the user.
No encoding headaches¶
Dealing with unicode encodings can be cumbersome across platforms. unihan-etl deals with handling output encoding issues that could come up if you were to try to export the data yourself.
Python 2 and 3¶
Designed and tested to work across Python versions. View the travis test matrix for what this software is tested against.
Customizable output¶
“Structured” output¶
JSON, YAML, and python dict only
Support for structured output of information in fields. unihan-etl refers to this as expansion.
Users can opt-out via --no-expand
. This will preserve the values in
each field as they are in the raw database.
Filters out empty values by default, opt-out via --no-prune
.
Filtering¶
Support for filtering by fields and files.
To specify which fields to output, use -f
/ --fields
and separate
them in spaces. -f kDefinition kCantonese kHanyuPinyin
.
For files, -i
/ --input-files
. Example: -i
Unihan_DictionaryLikeData.txt Unihan_Readings.txt
.
About UNIHAN¶
Languages, Computers, and You¶
There are many languages and writing systems around the world. Computers internally use numbers to represent characters in writing systems. As computers became more prominent, hundreds of encoding systems were developed to handle writing systems from different regions.
No single encoding system covered all languages. Adding to the complexity, encodings conflicted with each other on the numbers assigned to characters. Any data decoded with the wrong standard would turn up as gibberish.
Unicode is a standard devised to provide a unique number for every character.
This entails pulling together minds from around the world to assign codepoints.
The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard.
What is UNIHAN?¶
UNIHAN, short for Han unification, is the effort of the consortium assign codepoints to CJK characters. Any single han character can multiple historical or regional variants to account for, hence “unification”.

To do this, various sources of information are pulled together and cross-referenced to detail characteristics of the glyphs, and vet them through a thorough proofreading process. It’s an international effort, hallmarked by between researchers and groups like the Ideographic Rapporteur Group. Glyphs once only noted in dictionaries and antiquity are set in stone with their own codepoints, carefully cross-referenced with information from, often multiple, distinct sources.
The advantage that UNIHAN provides to east asian researchers, including sinologists and japanologists, linguists, anaylsts, language learners, and hobbyists cannot be understated. Unbeknownst to users, its used under the hood in many applications and websites.
The resulting standard has industrial ramifications downstream to software developers and computer users. When a version of Unicode is released, it is then incorporated downstream in software projects.
The database¶
UNIHAN provides a database of its information, which is the culmination of CJK information that has been vetted and proofed painstakingly over years.
You can view the UNIHAN Database documentation to see where information on each field of information is derived from. For instance:
kCantonese: The Cantonese pronunciation(s) for this character using the jyutping romanization.
Bibliography:
Casey, G. Hugh, S.J. Ten Thousand Characters: An Analytic Dictionary. Hong Kong: Kelley and Walsh,1980 (kPhonetic).
Cheung Kwan-hin and Robert S. Bauer, The Representation of Cantonese with Chinese Characters, Journal of Chinese Linguistics Monograph Series Number 18, 2002.
Roy T. Cowles, A Pocket Dictionary of Cantonese, Hong Kong: University Press, 1999 (kCowles).
Sidney Lau, A Practical Cantonese-English Dictionary, Hong Kong: Government Printer, 1977 (kLau).
Bernard F. Meyer and Theodore F. Wempe, Student’s Cantonese-English Dictionary, Maryknoll, New York: Catholic Foreign Mission Society of America, 1947 (kMeyerWempe).
饒秉才, ed. 廣州音字典, Hong Kong: Joint Publishing (H.K.) Co., Ltd., 1989.
中華新字典, Hong Kong:中華書局, 1987.
黃港生, ed. 商務新詞典, Hong Kong: The Commercial Press, 1991.
朗文初級中文詞典, Hong Kong: Longman, 2001.
kHanYu: The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary.
Bibliography:
<Hanyu Da Zidian> [‘Great Chinese Character Dictionary’ (in 8 Volumes)]. XU Zhongshu (Editor in Chief). Wuhan, Hubei Province (PRC): Hubei and Sichuan Dictionary Publishing Collectives, 1986-1990. ISBN: 7-5403-0030-2/H.16.
kHanyuPinyin: The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典 Hànyǔ Dà Zìdiǎn (HDZ) specified in the “kHanYu” property description (q.v.).
Bibliography:
This data was originally input by 井作恆 Jǐng Zuòhéng
proofed by 聃媽歌 Dān Māgē (Magda Danish, using software donated by 文林 Wénlín Institute, Inc. and tables prepared by 曲理查 Qū Lǐchá),
and proofed again and prepared for the Unicode Consortium by 曲理查 Qū Lǐchá (2008-01-14).
Han Unification is a global effort. And it’s available free to the world.
The problem¶
It’s difficult to readily take advantage of UNIHAN database in its raw form.
UNIHAN comprises over 20 MB of character information, separated across multiple files. Within these files is 90 fields, spanning 8 general categories of data. Within some of fields, there are specific considerations to take account of to use the data correctly, for instance:
UNIHAN’s values place references to its own codepoints, such as kDefinition:
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
And also by spaces, such as in kCantonese:
U+342B kCantonese gun3 hung1 zung1
And by spaces which specify different sources, like kMandarin, “When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both.”:
U+7E43 kMandarin běng bēng
Another, values are delimited in various ways, for instance, by rules, like kDefinition, “Major definitions are separated by semicolons, and minor definitions by commas.”:
U+3402 kDefinition (J) non-standard form of U+559C 喜, to like, love, enjoy; a joyful thing
More complicated yet, kHanyuPinyin: “multiple locations for a given pīnyīn reading are separated by “,” (comma). The list of locations is followed by “:” (colon), followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality). The following are representative records.”:
U+3FCE kHanyuPinyin 42699.050:fèn,fén
U+34D8 kHanyuPinyin 10278.080,10278.090:sù
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
Data could be exported to a CSV, but users wouldn’t be able to handle delimited values and structured information held within.
Since CSV does not support structured information, another format that supports needs to be found.
Even then, users may not want an export that expands the structured
output of fields. So if a tool exists, exports should be configurable. Users
could then export a field with gun3 hung1 zung1
pristinely without
turning it into list form.
Command Line Interface¶
Export UNIHAN to Python, Data Package, CSV, JSON and YAML
usage: unihan-etl [-h] [-v] [-s SOURCE] [-z ZIP_PATH] [-d DESTINATION]
[-w WORK_DIR] [-F {json,csv}] [--no-expand] [--no-prune]
[-f [FIELDS [FIELDS ...]]]
[-i [INPUT_FILES [INPUT_FILES ...]]]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Named Arguments¶
- -v, --version
show program’s version number and exit
- -s, --source
URL or path of zipfile. Default: http://www.unicode.org/Public/UNIDATA/Unihan.zip
- -z, --zip-path
Path the zipfile is downloaded to. Default: /home/docs/.cache/unihan_etl/downloads/Unihan.zip
- -d, --destination
Output of .csv. Default: /home/docs/.local/share/unihan_etl/unihan.{json,csv,yaml}
- -w, --work-dir
Default: /home/docs/.cache/unihan_etl/downloads
- -F, --format
Possible choices: json, csv
Default: csv
- --no-expand
Don’t expand values to lists in multi-value UNIHAN fields. Doesn’t apply to CSVs.
Default: True
- --no-prune
Don’t prune fields with empty keysDoesn’t apply to CSVs.
Default: True
- -f, --fields
Fields to use in export. Separated by spaces. All fields used by default. Fields: kAccountingNumeric, kBigFive, kCCCII, kCNS1986, kCNS1992, kCangjie, kCantonese, kCheungBauer, kCheungBauerIndex, kCihaiT, kCompatibilityVariant, kCowles, kDaeJaweon, kDefinition, kEACC, kFenn, kFennIndex, kFourCornerCode, kFrequency, kGB0, kGB1, kGB3, kGB5, kGB7, kGB8, kGSR, kGradeLevel, kHDZRadBreak, kHKGlyph, kHKSCS, kHanYu, kHangul, kHanyuPinlu, kHanyuPinyin, kIBMJapan, kIICore, kIRGDaeJaweon, kIRGDaiKanwaZiten, kIRGHanyuDaZidian, kIRGKangXi, kIRG_GSource, kIRG_HSource, kIRG_JSource, kIRG_KPSource, kIRG_KSource, kIRG_MSource, kIRG_TSource, kIRG_USource, kIRG_VSource, kJIS0213, kJa, kJapaneseKun, kJapaneseOn, kJinmeiyoKanji, kJis0, kJis1, kJoyoKanji, kKPS0, kKPS1, kKSC0, kKSC1, kKangXi, kKarlgren, kKorean, kKoreanEducationHanja, kKoreanName, kLau, kMainlandTelegraph, kMandarin, kMatthews, kMeyerWempe, kMorohashi, kNelson, kOtherNumeric, kPhonetic, kPrimaryNumeric, kPseudoGB1, kRSAdobe_Japan1_6, kRSJapanese, kRSKanWa, kRSKangXi, kRSKorean, kRSUnicode, kSBGY, kSemanticVariant, kSimplifiedVariant, kSpecializedSemanticVariant, kTGH, kTaiwanTelegraph, kTang, kTotalStrokes, kTraditionalVariant, kVietnamese, kXHC1983, kXerox, kZVariant
- -i, --input-files
Files inside zip to pull data from. Separated by spaces. All files used by default. Files: Unihan_DictionaryIndices.txt, Unihan_DictionaryLikeData.txt, Unihan_IRGSources.txt, Unihan_NumericValues.txt, Unihan_OtherMappings.txt, Unihan_RadicalStrokeCounts.txt, Unihan_Readings.txt, Unihan_Variants.txt
- -l, --log_level
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL
API¶
Build Unihan into tabular / structured format and export it.
-
class
unihan_etl.process.
Packager
(options)[source]¶ Download and generate a tabular release of UNIHAN.
-
unihan_etl.process.
ALLOWED_EXPORT_TYPES
= ['json', 'csv']¶ Allowed export types
-
unihan_etl.process.
DESTINATION_DIR
= '/home/docs/.local/share/unihan_etl'¶ Filepath to output built CSV file to.
-
class
unihan_etl.process.
Packager
(options)[source]¶ Download and generate a tabular release of UNIHAN.
-
unihan_etl.process.
UNIHAN_FIELDS
= ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTGH', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')¶ Default Unihan fields
-
unihan_etl.process.
UNIHAN_FILES
= dict_keys(['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'])¶ Default Unihan Files
-
unihan_etl.process.
UNIHAN_URL
= 'http://www.unicode.org/Public/UNIDATA/Unihan.zip'¶ URI of Unihan.zip data.
-
unihan_etl.process.
UNIHAN_ZIP_PATH
= '/home/docs/.cache/unihan_etl/downloads/Unihan.zip'¶ Filepath to download Zip file.
-
unihan_etl.process.
WORK_DIR
= '/home/docs/.cache/unihan_etl/downloads'¶ Directory to use for processing intermittent files.
-
unihan_etl.process.
download
(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)[source]¶ Download file at URL to a destination.
-
unihan_etl.process.
expand_delimiters
(normalized_data)[source]¶ Return expanded multi-value fields in UNIHAN.
- Parameters
normalized_data (
list
ofdict
) – Expects data in list of hashes, perprocess.normalize()
- Returns
Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).
- Return type
list of dict
-
unihan_etl.process.
extract_zip
(zip_path, dest_dir)[source]¶ Extract zip file. Return
zipfile.ZipFile
instance.- Parameters
- Returns
The extracted zip.
- Return type
-
unihan_etl.process.
files_exist
(path, files)[source]¶ Return True if all files exist in specified path.
-
unihan_etl.process.
filter_manifest
(files)[source]¶ Return filtered
UNIHAN_MANIFEST
from list of file names.
-
unihan_etl.process.
get_fields
(d)[source]¶ Return list of fields from dict of {filename: [‘field’, ‘field1’]}.
-
unihan_etl.process.
get_parser
()[source]¶ Return
argparse.ArgumentParser
instance for CLI.- Returns
argument parser for CLI use.
- Return type
-
unihan_etl.process.
normalize
(raw_data, fields)[source]¶ Return normalized data from a UNIHAN data files.
-
unihan_etl.process.
setup_logger
(logger=None, level='DEBUG')[source]¶ Setup logging for CLI use.
- Parameters
logger (
Logger
) – instance of loggerlevel (
str
) – logging level, e.g. ‘DEBUG’
-
unihan_etl.process.
zip_has_files
(files, zip_file)[source]¶ Return True if zip has the files inside.
- Parameters
zip_file (
zipfile.ZipFile
) –
- Returns
True if files inside of :py:meth:`zipfile.ZipFile.namelist()
- Return type
Constants¶
-
unihan_etl.constants.
CUSTOM_DELIMITED_FIELDS
= ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')¶ FIELDS with multiple values via custom delimiters
-
unihan_etl.constants.
INDEX_FIELDS
= ('ucn', 'char')¶ Default index fields for unihan csv’s. You probably want these.
-
unihan_etl.constants.
SPACE_DELIMITED_DICT_FIELDS
= ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶ Fields with multiple values UNIHAN delimits by spaces -> dict
-
unihan_etl.constants.
SPACE_DELIMITED_FIELDS
= ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')¶ Any space delimited field regardless of expanded form
-
unihan_etl.constants.
SPACE_DELIMITED_LIST_FIELDS
= ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')¶ Fields with multiple values UNIHAN delimits by spaces -> list
-
unihan_etl.constants.
UNIHAN_MANIFEST
= {'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJIS0213', 'kJoyoKanji', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kTGH', 'kXerox'), 'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant')}¶ Dictionary of tuples mapping locations of files to fields
Expansion¶
Functions to uncompact details inside field values.
Notes
re.compile()
operations are inside of expand functions:
readability
module-level function bytecode is cached in python
the last used compiled regexes are cached
-
unihan_etl.expansion.
N_DIACRITICS
= 'ńňǹ'¶ diacritics from kHanyuPinlu
Utilities and test helpers¶
Utility and helper methods for script.
util¶
-
unihan_etl.util.
ucn_to_unicode
(ucn)[source]¶ Return a python unicode value from a UCN.
Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)
Test helpers functions for downloading and processing Unihan data.
Frequently Asked Questions¶
- … Why are some fields, e.g. kTotalStrokes, in lists when there’s seemingly not any multi-value data?
The word back from the developers of UNIHAN is they keep some fields multi-valued for future use.
Apparently at the moment there is only one record with two values for the kTotalStrokes field in the Unihan database. However, the maintainers of the data intend to populate the kTotalStrokes field as needed in the future, and as documented in UAX #38.
May 30, 2017 (Unicode 9.0)
unihan-etl is designed to handle fields correctly and consistently according to the documentation in the database.
History¶
unihan-etl 0.10.4 (2020-08-05)¶
Update CHANGES headings to produce working links
Relax
appdirs
version constraint#228 Move from Pipfile to poetry
unihan-etl 0.10.3 (2019-08-18)¶
Fix flicker in download progress bar
unihan-etl 0.10.2 (2019-08-17)¶
Add
project_urls
to setup.pyUse plain reStructuredText for CHANGES
Use
collections
that’s compatible with python 2 and 3PEP8 tweaks
unihan-etl 0.10.1 (2017-09-08)¶
Add code links in API
Add
__version__
tounihan_etl
unihan-etl 0.10.0 (2017-08-29)¶
#91 New fields from UNIHAN Revision 25.
kJinmeiyoKanji
kJoyoKanji
kKoreanEducationHanja
kKoreanName
kTGH
UNIHAN Revision 25 was released 2018-05-18 and issued for Unicode 11.0:
Add tests and example corpus for kCCCII
Add configuration / make tests for isort, flake8
Switch tmuxp config to use pipenv
Add Pipfile
Add
make sync_pipfile
task to sync requirements/*.txt files with PipfileUpdate and sync Pipfile
Developer package updates (linting / docs / testing)
isort 4.2.15 to 4.3.4
flake8 3.3.0 to 3.5.0
vulture 0.14 to 0.27
sphinx 1.6.2 to 1.7.6
alagitpull 0.0.12 to 0.0.21
releases 1.3.1 to 1.6.1
sphinx-argparse 0.2.1 to 1.6.2
pytest 3.1.2 to 3.6.4
Move documentation over to numpy-style
Add sphinxcontrib-napoleon 0.6.1
Update LICENSE New BSD to MIT
All future commits and contributions are licensed to the cihai software foundation. This includes commits by Tony Narlock (creator).
unihan-etl 0.9.5 (2017-06-26)¶
Enhance support for locations on kHDZRadBreak fields.
unihan-etl 0.9.4 (2017-06-05)¶
Fix kIRG_GSource without location
Fix kFenn output
Fix kHanyuPinlu support output for n diacritics
unihan-etl 0.9.3 (2017-05-31)¶
Add expansion for kIRGKangXi
unihan-etl 0.9.2 (2017-05-31)¶
Normalize Radical-Stroke expansion for kRSUnicode
Migrate more fields to regular expressions
Normalize character field for kDaeJaweon, kHanyuPinyin, and kCheungBauer, kFennIndex, kCheungBauerIndex, kIICore, kIRGHanyuDaZidian
unihan-etl 0.9.1 (2017-05-27)¶
Support for expanding kGSR
Convert some field expansions to use regexes
unihan-etl 0.9.0 (2017-05-26)¶
Fix bug where destination file was made into directory on first run
Rename from unihan-tabular to unihan-etl
Support for expanding multi-value fields
Support for pruning empty fields
Improve help dialog
Added a page about UNIHAN and the project to documentation
Split constant values into their own module
Split functionality for expanding unstructured values into its own module
unihan-etl 0.8.1 (2017-05-20)¶
Update to add kJa and adjust source file of kCompatibilityVariant per Unicode 8.0.0.
unihan-etl 0.8.0 (2017-05-17)¶
Support for configuring logging via options and CLI
Convert all print statements to use logger
unihan-etl 0.7.4 (2017-05-14)¶
Allow for local / file system sources for Unihan.zip
Only extract zip if unextracted
unihan-etl 0.7.3 (2017-05-13)¶
Update package classifiers
unihan-etl 0.7.2 (2017-05-13)¶
Add back datapackage
unihan-etl 0.7.1 (2017-05-12)¶
Fix python 2 CSV output
Default to CSV output
unihan-etl 0.7.0 (2017-05-12)¶
Move unicodecsv module to dependency package
Support for XDG directory specification
Support for custom destination output, including replacing template variable
{ext}
unihan-etl 0.6.3 (2017-05-11)¶
Move __about__.py to module level
unihan-etl 0.6.2 (2017-05-11)¶
Fix python package import
unihan-etl 0.6.1 (2017-05-10)¶
Fix readme bug on pypi
unihan-etl 0.6.0 (2017-05-10)¶
Support for exporting in YAML and JSON
More internal factoring and simplification
Return data as list
unihan-etl 0.5.1 (2017-05-08)¶
Drop python 3.3 an 3.4 support
unihan-etl 0.5.0 (2017-05-08)¶
Rename from cihaidata_unihan unihan_tabular
Drop datapackages in favor of a universal JSON, YAML and CSV export.
Only use UnicodeWriter in Python 2, fixes issue with python would encode b in front of values
unihan-etl 0.4.2 (2017-05-07)¶
Rename scripts/ to cihaidata_unihan/
unihan-etl 0.4.1 (2017-05-07)¶
Enable invoking tool via
$ cihaidata_unihan
unihan-etl 0.4.0 (2017-05-07)¶
Major internal refactor and simplification
Convert to pytest
assert
statementsConvert full test suite to pytest functions and fixtures
Get CLI documentation up again
Improve test coverage
Lint code, remove unused imports
Switch license BSD -> MIT
unihan-etl 0.3.0 (2017-04-17)¶
Rebooted
Modernize Makefile in docs
Add Makefile to main project
Modernize package metadata to use __about__.py
Update requirements to use requirements/ folder for base, testing and doc dependencies.
Update sphinx theme to alabaster with new logo.
Update travis to use coverall
Update links on README to use https
Update travis to test up to python 3.6
Add support for pypy (why not)
Lock base dependencies
Add dev dependencies for isort, vulture and flake8