unihan-etl

unihan-etl - ETL tool for Unicode’s Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode’s website to a flat, tabular or structured, tree-like format.

unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.

Part of the cihai project. Similar project: libUnihan.

UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).

Python Package Docs Build Status Code Coverage License

UNIHAN’s data is dispersed across multiple files in the format of:

U+3400      kCantonese      jau1
U+3400      kDefinition     (same as U+4E18 ) hillock or mound
U+3400      kMandarin       qiū
U+3401      kCantonese      tim2
U+3401      kDefinition     to lick; to taste, a mat, bamboo bark
U+3401      kHanyuPinyin    10019.020:tiàn
U+3401      kMandarin       tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

U+5EFE      kHanyuPinyin    10513.110,10514.010,10514.020:gǒng
U+5364      kHanyuPinyin    10093.130:, 74609.020:,

kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, “Flat” output

CSV (default), $ unihan-etl:

char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
,U+3400,jau1,(same as U+4E18 ) hillock or mound,,qiū
,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

With $ unihan-etl -F yaml --no-expand:

- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

With $ unihan-etl -F json --no-expand:

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kCantonese": "jau1",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kCantonese": "tim2",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

“Structured” output

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV?

Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.

JSON, $ unihan-etl -F json:

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": [
      "(same as U+4E18 丘) hillock or mound"
    ],
    "kCantonese": [
      "jau1"
    ],
    "kMandarin": {
      "zh-Hans": "qiū",
      "zh-Hant": "qiū"
    }
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": [
      "to lick",
      "to taste, a mat, bamboo bark"
    ],
    "kCantonese": [
      "tim2"
    ],
    "kHanyuPinyin": [
      {
        "locations": [
          {
            "volume": 1,
            "page": 19,
            "character": 2,
            "virtual": 0
          }
        ],
        "readings": [
          "tiàn"
        ]
      }
    ],
    "kMandarin": {
      "zh-Hans": "tiàn",
      "zh-Hant": "tiàn"
    }
  }
 ]

YAML $ unihan-etl -F yaml:

- char: 
  kCantonese:
  - jau1
  kDefinition:
  - (same as U+4E18 丘) hillock or mound
  kMandarin:
    zh-Hans: qiū
    zh-Hant: qiū
  ucn: U+3400
- char: 
  kCantonese:
  - tim2
  kDefinition:
  - to lick
  - to taste, a mat, bamboo bark
  kHanyuPinyin:
  - locations:
    - character: 2
      page: 19
      virtual: 0
      volume: 1
    readings:
    - tiàn
  kMandarin:
    zh-Hans: tiàn
    zh-Hant: tiàn
  ucn: U+3401

Features

  • automatically downloads UNIHAN from the internet

  • strives for accuracy with the specifications described in UNIHAN’s database design

  • export to JSON, CSV and YAML (requires pyyaml) via -F

  • configurable to export specific fields via -f

  • accounts for encoding conflicts due to the Unicode-heavy content

  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets

  • core component and dependency of cihai, a CJK library

  • data package support

  • expansion of multi-value delimited fields in YAML, JSON and python dictionaries

  • supports python 2.7, >= 3.5 and pypy

If you encounter a problem or have a question, please create an issue.

Usage

unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.

To download and build your own UNIHAN export:

$ pip install --user unihan-etl

To output CSV, the default format:

$ unihan-etl

To output JSON:

$ unihan-etl -F json

To output YAML:

$ pip install --user pyyaml
$ unihan-etl -F yaml

To only output the kDefinition field in a csv:

$ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-etl -f kCantonese kDefinition

To output to a custom file:

$ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout

# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/

# output dir
{XDG data dir}/unihan_etl/
  unihan.json
  unihan.csv
  unihan.yaml   # (requires pyyaml)

# package dir
unihan_etl/
  process.py    # argparse, download, extract, transform UNIHAN's data
  constants.py  # immutable data vars (field to filename mappings, etc)
  expansion.py  # extracting details baked inside of fields
  _compat.py    # python 2/3 compatibility module
  util.py       # utility / helper functions

# test suite
tests/*

About unihan-etl

unihan-etl provides configurable, self-serve data exports of the About UNIHAN database.

Retrieval

unihan-etl will download and cache the raw database files for the user.

No encoding headaches

Dealing with unicode encodings can be cumbersome across platforms. unihan-etl deals with handling output encoding issues that could come up if you were to try to export the data yourself.

Python 2 and 3

Designed and tested to work across Python versions. View the travis test matrix for what this software is tested against.

Customizable output

Formats
  • CSV

  • JSON

  • YAML (requires pyyaml)

  • Python dict (via API)

“Structured” output

JSON, YAML, and python dict only

Support for structured output of information in fields. unihan-etl refers to this as expansion.

Users can opt-out via --no-expand. This will preserve the values in each field as they are in the raw database.

Filters out empty values by default, opt-out via --no-prune.

Filtering

Support for filtering by fields and files.

To specify which fields to output, use -f / --fields and separate them in spaces. -f kDefinition kCantonese kHanyuPinyin.

For files, -i / --input-files. Example: -i Unihan_DictionaryLikeData.txt Unihan_Readings.txt.

About UNIHAN

Languages, Computers, and You

There are many languages and writing systems around the world. Computers internally use numbers to represent characters in writing systems. As computers became more prominent, hundreds of encoding systems were developed to handle writing systems from different regions.

No single encoding system covered all languages. Adding to the complexity, encodings conflicted with each other on the numbers assigned to characters. Any data decoded with the wrong standard would turn up as gibberish.

Unicode is a standard devised to provide a unique number for every character.

This entails pulling together minds from around the world to assign codepoints.

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard.

What is UNIHAN?

UNIHAN, short for Han unification, is the effort of the consortium assign codepoints to CJK characters. Any single han character can multiple historical or regional variants to account for, hence “unification”.

_images/sword_variants.png

To do this, various sources of information are pulled together and cross-referenced to detail characteristics of the glyphs, and vet them through a thorough proofreading process. It’s an international effort, hallmarked by between researchers and groups like the Ideographic Rapporteur Group. Glyphs once only noted in dictionaries and antiquity are set in stone with their own codepoints, carefully cross-referenced with information from, often multiple, distinct sources.

The advantage that UNIHAN provides to east asian researchers, including sinologists and japanologists, linguists, anaylsts, language learners, and hobbyists cannot be understated. Unbeknownst to users, its used under the hood in many applications and websites.

The resulting standard has industrial ramifications downstream to software developers and computer users. When a version of Unicode is released, it is then incorporated downstream in software projects.

The database

UNIHAN provides a database of its information, which is the culmination of CJK information that has been vetted and proofed painstakingly over years.

You can view the UNIHAN Database documentation to see where information on each field of information is derived from. For instance:

  • kCantonese: The Cantonese pronunciation(s) for this character using the jyutping romanization.

    Bibliography:

    1. Casey, G. Hugh, S.J. Ten Thousand Characters: An Analytic Dictionary. Hong Kong: Kelley and Walsh,1980 (kPhonetic).

    2. Cheung Kwan-hin and Robert S. Bauer, The Representation of Cantonese with Chinese Characters, Journal of Chinese Linguistics Monograph Series Number 18, 2002.

    3. Roy T. Cowles, A Pocket Dictionary of Cantonese, Hong Kong: University Press, 1999 (kCowles).

    4. Sidney Lau, A Practical Cantonese-English Dictionary, Hong Kong: Government Printer, 1977 (kLau).

    5. Bernard F. Meyer and Theodore F. Wempe, Student’s Cantonese-English Dictionary, Maryknoll, New York: Catholic Foreign Mission Society of America, 1947 (kMeyerWempe).

    6. 饒秉才, ed. 廣州音字典, Hong Kong: Joint Publishing (H.K.) Co., Ltd., 1989.

    7. 中華新字典, Hong Kong:中華書局, 1987.

    8. 黃港生, ed. 商務新詞典, Hong Kong: The Commercial Press, 1991.

    9. 朗文初級中文詞典, Hong Kong: Longman, 2001.

  • kHanYu: The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary.

    Bibliography:

    1. <Hanyu Da Zidian> [‘Great Chinese Character Dictionary’ (in 8 Volumes)]. XU Zhongshu (Editor in Chief). Wuhan, Hubei Province (PRC): Hubei and Sichuan Dictionary Publishing Collectives, 1986-1990. ISBN: 7-5403-0030-2/H.16.

  • kHanyuPinyin: The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典 Hànyǔ Dà Zìdiǎn (HDZ) specified in the “kHanYu” property description (q.v.).

    Bibliography:

    • This data was originally input by 井作恆 Jǐng Zuòhéng

    • proofed by 聃媽歌 Dān Māgē (Magda Danish, using software donated by 文林 Wénlín Institute, Inc. and tables prepared by 曲理查 Qū Lǐchá),

    • and proofed again and prepared for the Unicode Consortium by 曲理查 Qū Lǐchá (2008-01-14).

Han Unification is a global effort. And it’s available free to the world.

The problem

It’s difficult to readily take advantage of UNIHAN database in its raw form.

UNIHAN comprises over 20 MB of character information, separated across multiple files. Within these files is 90 fields, spanning 8 general categories of data. Within some of fields, there are specific considerations to take account of to use the data correctly, for instance:

UNIHAN’s values place references to its own codepoints, such as kDefinition:

U+3400       kDefinition     (same as U+4E18 ) hillock or mound

And also by spaces, such as in kCantonese:

U+342B       kCantonese      gun3 hung1 zung1

And by spaces which specify different sources, like kMandarin, “When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both.”:

U+7E43        kMandarin       běng bēng

Another, values are delimited in various ways, for instance, by rules, like kDefinition, “Major definitions are separated by semicolons, and minor definitions by commas.”:

U+3402       kDefinition     (J) non-standard form of U+559C , to like, love, enjoy; a joyful thing

More complicated yet, kHanyuPinyin: “multiple locations for a given pīnyīn reading are separated by “,” (comma). The list of locations is followed by “:” (colon), followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality). The following are representative records.”:

U+3FCE  kHanyuPinyin    42699.050:fèn,fén
U+34D8  kHanyuPinyin    10278.080,10278.090:
U+5364  kHanyuPinyin    10093.130:, 74609.020:,
U+5EFE  kHanyuPinyin    10513.110,10514.010,10514.020:gǒng

Data could be exported to a CSV, but users wouldn’t be able to handle delimited values and structured information held within.

Since CSV does not support structured information, another format that supports needs to be found.

Even then, users may not want an export that expands the structured output of fields. So if a tool exists, exports should be configurable. Users could then export a field with gun3 hung1 zung1 pristinely without turning it into list form.

Command Line Interface

Export UNIHAN to Python, Data Package, CSV, JSON and YAML

usage: unihan-etl [-h] [-v] [-s SOURCE] [-z ZIP_PATH] [-d DESTINATION]
                  [-w WORK_DIR] [-F {json,csv}] [--no-expand] [--no-prune]
                  [-f [FIELDS [FIELDS ...]]]
                  [-i [INPUT_FILES [INPUT_FILES ...]]]
                  [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Named Arguments

-v, --version

show program’s version number and exit

-s, --source

URL or path of zipfile. Default: http://www.unicode.org/Public/UNIDATA/Unihan.zip

-z, --zip-path

Path the zipfile is downloaded to. Default: /home/docs/.cache/unihan_etl/downloads/Unihan.zip

-d, --destination

Output of .csv. Default: /home/docs/.local/share/unihan_etl/unihan.{json,csv,yaml}

-w, --work-dir

Default: /home/docs/.cache/unihan_etl/downloads

-F, --format

Possible choices: json, csv

Default: csv

--no-expand

Don’t expand values to lists in multi-value UNIHAN fields. Doesn’t apply to CSVs.

Default: True

--no-prune

Don’t prune fields with empty keysDoesn’t apply to CSVs.

Default: True

-f, --fields

Fields to use in export. Separated by spaces. All fields used by default. Fields: kAccountingNumeric, kBigFive, kCCCII, kCNS1986, kCNS1992, kCangjie, kCantonese, kCheungBauer, kCheungBauerIndex, kCihaiT, kCompatibilityVariant, kCowles, kDaeJaweon, kDefinition, kEACC, kFenn, kFennIndex, kFourCornerCode, kFrequency, kGB0, kGB1, kGB3, kGB5, kGB7, kGB8, kGSR, kGradeLevel, kHDZRadBreak, kHKGlyph, kHKSCS, kHanYu, kHangul, kHanyuPinlu, kHanyuPinyin, kIBMJapan, kIICore, kIRGDaeJaweon, kIRGDaiKanwaZiten, kIRGHanyuDaZidian, kIRGKangXi, kIRG_GSource, kIRG_HSource, kIRG_JSource, kIRG_KPSource, kIRG_KSource, kIRG_MSource, kIRG_TSource, kIRG_USource, kIRG_VSource, kJIS0213, kJa, kJapaneseKun, kJapaneseOn, kJinmeiyoKanji, kJis0, kJis1, kJoyoKanji, kKPS0, kKPS1, kKSC0, kKSC1, kKangXi, kKarlgren, kKorean, kKoreanEducationHanja, kKoreanName, kLau, kMainlandTelegraph, kMandarin, kMatthews, kMeyerWempe, kMorohashi, kNelson, kOtherNumeric, kPhonetic, kPrimaryNumeric, kPseudoGB1, kRSAdobe_Japan1_6, kRSJapanese, kRSKanWa, kRSKangXi, kRSKorean, kRSUnicode, kSBGY, kSemanticVariant, kSimplifiedVariant, kSpecializedSemanticVariant, kTGH, kTaiwanTelegraph, kTang, kTotalStrokes, kTraditionalVariant, kVietnamese, kXHC1983, kXerox, kZVariant

-i, --input-files

Files inside zip to pull data from. Separated by spaces. All files used by default. Files: Unihan_DictionaryIndices.txt, Unihan_DictionaryLikeData.txt, Unihan_IRGSources.txt, Unihan_NumericValues.txt, Unihan_OtherMappings.txt, Unihan_RadicalStrokeCounts.txt, Unihan_Readings.txt, Unihan_Variants.txt

-l, --log_level

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL

API

Build Unihan into tabular / structured format and export it.

class unihan_etl.process.Packager(options)[source]

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)[source]

Download raw UNIHAN data if not exists.

Parameters

urlretrieve_fn (function) – function to download file

export()[source]

Extract zip and process information into CSV’s.

classmethod from_cli(argv)[source]

Create Packager instance from CLI argparse arguments.

Parameters

argv (list) – Arguments passed in via CLI.

Returns

builder

Return type

Packager

unihan_etl.process.ALLOWED_EXPORT_TYPES = ['json', 'csv']

Allowed export types

unihan_etl.process.DESTINATION_DIR = '/home/docs/.local/share/unihan_etl'

Filepath to output built CSV file to.

class unihan_etl.process.Packager(options)[source]

Download and generate a tabular release of UNIHAN.

download(urlretrieve_fn=<function urlretrieve>)[source]

Download raw UNIHAN data if not exists.

Parameters

urlretrieve_fn (function) – function to download file

export()[source]

Extract zip and process information into CSV’s.

classmethod from_cli(argv)[source]

Create Packager instance from CLI argparse arguments.

Parameters

argv (list) – Arguments passed in via CLI.

Returns

builder

Return type

Packager

unihan_etl.process.UNIHAN_FIELDS = ('kAccountingNumeric', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHKSCS', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTGH', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kXerox', 'kZVariant')

Default Unihan fields

unihan_etl.process.UNIHAN_FILES = dict_keys(['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'])

Default Unihan Files

unihan_etl.process.UNIHAN_URL = 'http://www.unicode.org/Public/UNIDATA/Unihan.zip'

URI of Unihan.zip data.

unihan_etl.process.UNIHAN_ZIP_PATH = '/home/docs/.cache/unihan_etl/downloads/Unihan.zip'

Filepath to download Zip file.

unihan_etl.process.WORK_DIR = '/home/docs/.cache/unihan_etl/downloads'

Directory to use for processing intermittent files.

unihan_etl.process.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)[source]

Download file at URL to a destination.

Parameters
  • url (str) – URL to download from.

  • dest (str) – file path where download is to be saved.

  • urlretrieve_fn (callable) – function to download file

  • reporthook (function) – Function to write progress bar to stdout buffer.

Returns

destination where file downloaded to.

Return type

str

unihan_etl.process.expand_delimiters(normalized_data)[source]

Return expanded multi-value fields in UNIHAN.

Parameters

normalized_data (list of dict) – Expects data in list of hashes, per process.normalize()

Returns

Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).

Return type

list of dict

unihan_etl.process.extract_zip(zip_path, dest_dir)[source]

Extract zip file. Return zipfile.ZipFile instance.

Parameters
  • zip_file (str) – filepath to extract.

  • dest_dir (str) – directory to extract to.

Returns

The extracted zip.

Return type

zipfile.ZipFile

unihan_etl.process.files_exist(path, files)[source]

Return True if all files exist in specified path.

unihan_etl.process.filter_manifest(files)[source]

Return filtered UNIHAN_MANIFEST from list of file names.

unihan_etl.process.get_fields(d)[source]

Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

unihan_etl.process.get_parser()[source]

Return argparse.ArgumentParser instance for CLI.

Returns

argument parser for CLI use.

Return type

argparse.ArgumentParser

unihan_etl.process.has_valid_zip(zip_path)[source]

Return True if valid zip exists.

Parameters

zip_path (str) – absolute path to zip

Returns

True if valid zip exists at path

Return type

bool

unihan_etl.process.in_fields(c, fields)[source]

Return True if string is in the default fields.

unihan_etl.process.listify(data, fields)[source]

Convert tabularized data to a CSV-friendly list.

Parameters
  • data (list of dict) –

  • params (list of str) – keys/columns, e.g. [‘kDictionary’]

unihan_etl.process.load_data(files)[source]

Extract zip and process information into CSV’s.

Parameters

files (list of str) –

Returns

combined data from files

Return type

str

unihan_etl.process.normalize(raw_data, fields)[source]

Return normalized data from a UNIHAN data files.

Parameters
  • raw_data (str) – combined text files from UNIHAN

  • fields (list of str) – list of columns to pull

Returns

list of unihan character information

Return type

list

unihan_etl.process.not_junk(line)[source]

Return False on newlines and C-style comments.

unihan_etl.process.setup_logger(logger=None, level='DEBUG')[source]

Setup logging for CLI use.

Parameters
  • logger (Logger) – instance of logger

  • level (str) – logging level, e.g. ‘DEBUG’

unihan_etl.process.zip_has_files(files, zip_file)[source]

Return True if zip has the files inside.

Parameters
Returns

True if files inside of :py:meth:`zipfile.ZipFile.namelist()

Return type

bool

Constants

unihan_etl.constants.CUSTOM_DELIMITED_FIELDS = ('kDefinition', 'kDaeJaweon', 'kHDZRadBreak', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource')

FIELDS with multiple values via custom delimiters

unihan_etl.constants.INDEX_FIELDS = ('ucn', 'char')

Default index fields for unihan csv’s. You probably want these.

unihan_etl.constants.SPACE_DELIMITED_DICT_FIELDS = ('kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')

Fields with multiple values UNIHAN delimits by spaces -> dict

unihan_etl.constants.SPACE_DELIMITED_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant', 'kHanYu', 'kXHC1983', 'kMandarin', 'kTotalStrokes')

Any space delimited field regardless of expanded form

unihan_etl.constants.SPACE_DELIMITED_LIST_FIELDS = ('kAccountingNumberic', 'kCantonese', 'kCCCII', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCowles', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kGSR', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kHKGlyph', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kJa', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJIS0213', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kLua', 'kMainlandTelegraph', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode', 'kSBGY', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTaiwanTelegraph', 'kTang', 'kTGH', 'kTraditionalVariant', 'kVietnamese', 'kXerox', 'kZVariant')

Fields with multiple values UNIHAN delimits by spaces -> list

unihan_etl.constants.UNIHAN_MANIFEST = {'Unihan_DictionaryIndices.txt': ('kCheungBauerIndex', 'kCowles', 'kDaeJaweon', 'kFennIndex', 'kGSR', 'kHanYu', 'kIRGDaeJaweon', 'kIRGDaiKanwaZiten', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kKangXi', 'kKarlgren', 'kLau', 'kMatthews', 'kMeyerWempe', 'kMorohashi', 'kNelson', 'kSBGY'), 'Unihan_DictionaryLikeData.txt': ('kCangjie', 'kCheungBauer', 'kCihaiT', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kPhonetic', 'kTotalStrokes'), 'Unihan_IRGSources.txt': ('kCompatibilityVariant', 'kIICore', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_TSource', 'kIRG_USource', 'kIRG_VSource'), 'Unihan_NumericValues.txt': ('kAccountingNumeric', 'kOtherNumeric', 'kPrimaryNumeric'), 'Unihan_OtherMappings.txt': ('kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kEACC', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kHKSCS', 'kIBMJapan', 'kJa', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJIS0213', 'kJoyoKanji', 'kKoreanEducationHanja', 'kKoreanName', 'kKPS0', 'kKPS1', 'kKSC0', 'kKSC1', 'kMainlandTelegraph', 'kPseudoGB1', 'kTaiwanTelegraph', 'kTGH', 'kXerox'), 'Unihan_RadicalStrokeCounts.txt': ('kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKangXi', 'kRSKanWa', 'kRSKorean', 'kRSUnicode'), 'Unihan_Readings.txt': ('kCantonese', 'kDefinition', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kTang', 'kVietnamese', 'kXHC1983'), 'Unihan_Variants.txt': ('kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTraditionalVariant', 'kZVariant')}

Dictionary of tuples mapping locations of files to fields

Expansion

Functions to uncompact details inside field values.

Notes

re.compile() operations are inside of expand functions:

  1. readability

  2. module-level function bytecode is cached in python

  3. the last used compiled regexes are cached

unihan_etl.expansion.N_DIACRITICS = 'ńňǹ'

diacritics from kHanyuPinlu

unihan_etl.expansion.expand_field(field, fvalue)[source]

Return structured value of information in UNIHAN field.

Parameters
  • field (str) – field name

  • fvalue (str) – value of field

Returns

expanded field information per UNIHAN’s documentation

Return type

list or dict

Utilities and test helpers

Utility and helper methods for script.

util
unihan_etl.util.ucn_to_unicode(ucn)[source]

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

unihan_etl.util.ucnstring_to_python(ucn_string)[source]

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

unihan_etl.util.ucnstring_to_unicode(ucn_string)[source]

Return ucnstring as Unicode.

Test helpers functions for downloading and processing Unihan data.

unihan_etl.test.assert_dict_contains_subset(subset, dictionary, msg=None)[source]

Ported assertion for dict subsets in py.test.

Parameters
  • subset (dict) – needle

  • dictionary (dict) – haystack

  • msg (str, optional) – message display if assertion fails

Frequently Asked Questions

… Why are some fields, e.g. kTotalStrokes, in lists when there’s seemingly not any multi-value data?

The word back from the developers of UNIHAN is they keep some fields multi-valued for future use.

Apparently at the moment there is only one record with two values for the kTotalStrokes field in the Unihan database. However, the maintainers of the data intend to populate the kTotalStrokes field as needed in the future, and as documented in UAX #38.

May 30, 2017 (Unicode 9.0)

unihan-etl is designed to handle fields correctly and consistently according to the documentation in the database.

History

unihan-etl 0.10.4 (2020-08-05)

  • Update CHANGES headings to produce working links

  • Relax appdirs version constraint

  • #228 Move from Pipfile to poetry

unihan-etl 0.10.3 (2019-08-18)

  • Fix flicker in download progress bar

unihan-etl 0.10.2 (2019-08-17)

  • Add project_urls to setup.py

  • Use plain reStructuredText for CHANGES

  • Use collections that’s compatible with python 2 and 3

  • PEP8 tweaks

unihan-etl 0.10.1 (2017-09-08)

  • Add code links in API

  • Add __version__ to unihan_etl

unihan-etl 0.10.0 (2017-08-29)

  • #91 New fields from UNIHAN Revision 25.

    • kJinmeiyoKanji

    • kJoyoKanji

    • kKoreanEducationHanja

    • kKoreanName

    • kTGH

    UNIHAN Revision 25 was released 2018-05-18 and issued for Unicode 11.0:

  • Add tests and example corpus for kCCCII

  • Add configuration / make tests for isort, flake8

  • Switch tmuxp config to use pipenv

  • Add Pipfile

  • Add make sync_pipfile task to sync requirements/*.txt files with Pipfile

  • Update and sync Pipfile

  • Developer package updates (linting / docs / testing)

    • isort 4.2.15 to 4.3.4

    • flake8 3.3.0 to 3.5.0

    • vulture 0.14 to 0.27

    • sphinx 1.6.2 to 1.7.6

    • alagitpull 0.0.12 to 0.0.21

    • releases 1.3.1 to 1.6.1

    • sphinx-argparse 0.2.1 to 1.6.2

    • pytest 3.1.2 to 3.6.4

  • Move documentation over to numpy-style

  • Add sphinxcontrib-napoleon 0.6.1

  • Update LICENSE New BSD to MIT

  • All future commits and contributions are licensed to the cihai software foundation. This includes commits by Tony Narlock (creator).

unihan-etl 0.9.5 (2017-06-26)

  • Enhance support for locations on kHDZRadBreak fields.

unihan-etl 0.9.4 (2017-06-05)

  • Fix kIRG_GSource without location

  • Fix kFenn output

  • Fix kHanyuPinlu support output for n diacritics

unihan-etl 0.9.3 (2017-05-31)

  • Add expansion for kIRGKangXi

unihan-etl 0.9.2 (2017-05-31)

  • Normalize Radical-Stroke expansion for kRSUnicode

  • Migrate more fields to regular expressions

  • Normalize character field for kDaeJaweon, kHanyuPinyin, and kCheungBauer, kFennIndex, kCheungBauerIndex, kIICore, kIRGHanyuDaZidian

unihan-etl 0.9.1 (2017-05-27)

  • Support for expanding kGSR

  • Convert some field expansions to use regexes

unihan-etl 0.9.0 (2017-05-26)

  • Fix bug where destination file was made into directory on first run

  • Rename from unihan-tabular to unihan-etl

  • Support for expanding multi-value fields

  • Support for pruning empty fields

  • Improve help dialog

  • Added a page about UNIHAN and the project to documentation

  • Split constant values into their own module

  • Split functionality for expanding unstructured values into its own module

unihan-etl 0.8.1 (2017-05-20)

  • Update to add kJa and adjust source file of kCompatibilityVariant per Unicode 8.0.0.

unihan-etl 0.8.0 (2017-05-17)

  • Support for configuring logging via options and CLI

  • Convert all print statements to use logger

unihan-etl 0.7.4 (2017-05-14)

  • Allow for local / file system sources for Unihan.zip

  • Only extract zip if unextracted

unihan-etl 0.7.3 (2017-05-13)

  • Update package classifiers

unihan-etl 0.7.2 (2017-05-13)

  • Add back datapackage

unihan-etl 0.7.1 (2017-05-12)

  • Fix python 2 CSV output

  • Default to CSV output

unihan-etl 0.7.0 (2017-05-12)

  • Move unicodecsv module to dependency package

  • Support for XDG directory specification

  • Support for custom destination output, including replacing template variable {ext}

unihan-etl 0.6.3 (2017-05-11)

  • Move __about__.py to module level

unihan-etl 0.6.2 (2017-05-11)

  • Fix python package import

unihan-etl 0.6.1 (2017-05-10)

  • Fix readme bug on pypi

unihan-etl 0.6.0 (2017-05-10)

  • Support for exporting in YAML and JSON

  • More internal factoring and simplification

  • Return data as list

unihan-etl 0.5.1 (2017-05-08)

  • Drop python 3.3 an 3.4 support

unihan-etl 0.5.0 (2017-05-08)

  • Rename from cihaidata_unihan unihan_tabular

  • Drop datapackages in favor of a universal JSON, YAML and CSV export.

  • Only use UnicodeWriter in Python 2, fixes issue with python would encode b in front of values

unihan-etl 0.4.2 (2017-05-07)

  • Rename scripts/ to cihaidata_unihan/

unihan-etl 0.4.1 (2017-05-07)

  • Enable invoking tool via $ cihaidata_unihan

unihan-etl 0.4.0 (2017-05-07)

  • Major internal refactor and simplification

  • Convert to pytest assert statements

  • Convert full test suite to pytest functions and fixtures

  • Get CLI documentation up again

  • Improve test coverage

  • Lint code, remove unused imports

  • Switch license BSD -> MIT

unihan-etl 0.3.0 (2017-04-17)

  • Rebooted

  • Modernize Makefile in docs

  • Add Makefile to main project

  • Modernize package metadata to use __about__.py

  • Update requirements to use requirements/ folder for base, testing and doc dependencies.

  • Update sphinx theme to alabaster with new logo.

  • Update travis to use coverall

  • Update links on README to use https

  • Update travis to test up to python 3.6

  • Add support for pypy (why not)

  • Lock base dependencies

  • Add dev dependencies for isort, vulture and flake8