Abkhazia’s documentation

Note

The source code is available at https://www.github.com/bootphon/abkhazia.

The abkhazia project makes it easy to obtain simple baselines for supervised ASR (using Kaldi) and ABX tasks (using ABXpy) on the large corpora of speech recordings typically used in speech engineering, linguistics or cognitive science research. To this end, abkhazia provides the following:

  • the definition of a standard format for speech corpora, largely inspired from the typical format used for kaldi recipes
  • a abkhazia command-line tool for importing speech corpora to that standard format and performing various tasks on it,
  • a Python library that can be extended to new corpora and new ASR models
    • verifying the internal consistency of the data
    • extracting some standard statistics about the composition of the corpus
    • training supervised ASR models on the corpus with kaldi
    • computing ABX discriminability scores in various ABX tasks defined on the corpus

Abkhazia also comes with a set of recipes for specific corpora, which can be applied to the raw distribution of these corpora directly to obtain a version in standard format. The only requirement is to have access to the raw distributions. Unfortunately, unlike most other domains, large speech corpora are most of the time not freely available. List the list of corpora supported in abkhazia with abkhazia prepare --help

Installation and configuration

Note

First of all, clone the Abkhazia github repository and go to the created abkhazia directory:

git clone git@github.com:bootphon/abkhazia.git
cd ./abkhazia

Note

Abkhazia have been succesfully installed on various Unix flavours (Debian, Ubuntu, CentOS) and on Mac OS. It should be possible to install it on a Windows system as well (through CygWin), but this has not been tested.

Use in docker

The most simple way of deploy abkhazia is to use it under a docker container. Once you have docker installed on your machine, build the container with:

docker build -t abkhazia .

Then you can run it for instance with:

docker run -it --rm abkhazia bash

The need to mount your corpus data inside the container (using the -v option of docker run and modify the abkhazia configuration (see Configuration files). Read the docker documentation here.

Install dependencies

Before deploying Abkahzia on your system, you need to install the following dependencies: Kaldi, sox, shorten and festival.

C++ compilers

You need to have both gcc and clang-3.9 installed. On Debian/Ubuntu just have a:

sudo apt-get install gcc gfortran clang-3.9

Flac, sox and festival

  • Abkhazia relies on flac and sox for audio conversion from various file formats to wav.

    They should be in repositories of every standard Unix distribution, for exemple in Debian/Ubuntu:

    sudo apt-get install flac sox
    
  • Abkhazia also needs festival to phonemize the transcriptions of the Childes Brent corpus. Visit this link for installation guidelines, or on Ubuntu/Debian use:

    sudo apt-get install festival
    

Shorten

shorten is used for wav conversion from the original shn audio files, it must be installed manually. Follow these steps to download, compile and install it:

wget http://shnutils.freeshell.org/shorten/dist/src/shorten-3.6.1.tar.gz
tar xzf shorten-3.6.1.tar.gz
cd shorten-3.6.1
./configure
make
sudo make install

Kaldi

  • Kaldi is an open-source toolkit for HMM based ASR. It is a collection of low-level C++ programs and high-level bash scripts. See here.

  • In brief: use install_kaldi.sh

    The ./install_kaldi.sh script will download, configure and compile Kaldi to ./kaldi. This should work on any standard Unix distribution and fail on the first encountered error. If so, install Kaldi manually as detailed in the following steps.

  • If install_kaldi.sh failed

    If so you need to install Kaldi manually with the following steps:

    • Because Kaldi is developed under continuous integration, there is no published release to rely on. To ensure the abkhazia stability, we therefore maintain a Kaldi fork that is guaranteed to work. Clone it in a directory of your choice and ensure the active branch is abkhazia (this should be the default):

      git clone git@github.com:bootphon/kaldi.git
      
    • Once cloned, you have to install Kaldi (configuration and compilation). Follow those instructions. Basically, you have to do (from the kaldi directory):

      cd tools
      ./extras/check_dependancies.sh
      make -j  # -j does a parallel build on multiple cores
      cd ../src
      ./configure
      make depend -j
      make -j
      
    • Install Kaldi extras tools (SRILM and IRSTLM librairies) required by abkhazia. From your local kaldi directory, type:

      cd ./tools
      ./extras/install_irstlm.sh
      ./extras/install_srilm.sh
      

Install Abkhazia

  • It will check the dependancies for you and will initialize a default configuration file in share/abkhazia.conf:

    ./configure
    

    The install_kaldi.sh setup the KALDI_PATH environment variable to point to the installed Kaldi root directory. If you have installed Kaldi manually, or if the configure script complains for a missing KALDI_PATH, you need to specify it with for exemple:

    KALDI_PATH=./kaldi ./configure
    

    Rerun this script and correct the prompted configuration errors until it succed. At least you are asked to specify the path to Kaldi (as installed in previous step) in the configuration file.

  • Finally install abkhazia:

    python setup.py build
    [sudo] python setup.py install
    

    In case you want to modify and test the code inplace, replace the last step by python setup.py develop.

  • To build the documentation (the one you are actually reading), install sphinx (with pip install Sphinx) and, from the Abkhazia root directory, have a:

    sphinx-build -b html ./docs/source ./docs/html
    

    Then open the file ./docs/html/index.html with your favorite browser.

Configuration files

Abkhazia relies on two configuration files, abkhazia.conf and queue.conf. Those files are generated at install time (during the configuration step) and wrote in the share directory. But those files are usually copied in the installation directory (e.g. in /usr/bin)

Note

To know where are located the configuration files on your setup, have a:

abkhazia --help

abkhazia.conf

The abkhazia.conf configuration file defines a set of general parameters that are used by Abkhazia.

  • abkhazia.data-directory is the directory where abkhazia write its data (corpora, recipes and results are stored here). During installation, the data directory is configured to point in a data folder of the abkhazia source tree. You can specify a path to another dircetory, maybe on another partition.
  • kaldi.kaldi-directory is the path to an installed kaldi distribution. This path is configured during abkhazia installation.
  • kaldi.{train, decode, highmem}-cmd setup the parallelization to run the Kaldi recipes. Choose either run.pl to run locally or queue.pl to use a cluster managed with the Sun GridEngine.
  • raw corpora directories can be specified in the corpus section of the configuration file.

queue.conf

You should adapt this file to your cluster configuration

The queue.conf configuration file is related to parallel processing in Kaldi when queue.pl is used. It allows to forward options to the Sun GridEngine when submitting jobs. See this page for details.

Run the tests

Note

The tests are actually based on the Buckeye corpus, so you must provide the path to the raw Buckeye distribution before launching the tests. Enter this path in the buckeye-directory entry in the Abkhazia configuration file.

  • Abkhazia comes with a test suite, from the abkhazia root directory run it using:

    pytest ./test
    
  • To install the pytest package, simply have a:

    [sudo] pip install pytest
    
  • If you run the tests on a cluster and you configured Abkhazia to use Sun GridEngine, you must specify the temp directory to be in a shared filesystem with pytest ./test --basetemp=mydir/tmp.

Speech corpus

Format definition

A standardized corpus is stored as a directory composed of the following:

  • wavs: subfolder containing the speech recordings in wav, either as files or symbolic links
  • segments.txt: list of utterances with a description of their location in the wavefiles
  • utt2spk.txt: list containing the speaker associated to each utterance
  • text.txt: transcription of each utterance in word units
  • phones.txt: phone inventory mapped to IPA
  • lexicon.txt: phonetic dictionary using that inventory
  • silences.txt: list of silence symbols

Supported corpora

Supported corpora are (see also abkhazia prepare --help:

  • Articulation Index Corpus LSCP
  • Buckeye Corpus of conversational speech
  • Child Language Data Exchange System (only Brent part for now)
  • Corpus of Interactional Data
  • Corpus of Spontaneous Japanese
  • GlobalPhone multilingual read speech corpus (Vietnamese and Mandarin)
  • LibriSpeech ASR Corpus
  • Wall Street Journal ASR Corpus
  • NCHLT Xitsonga Speech Corpus

Once you have the raw data, you can import any of the above corpora in the standardized Abkhazia format using the abkhazia prepare command, for exemple:

abkhazia prepare csj /path/to/raw/csj -o ./prepared_csj --verbose

Note that many corpora do not form a homogeneous whole, but are constituted from several homogenous subparts. For example in the core subset of the CSJ corpus, spontaneous presentations from academics (files whos names starts with an ‘A’), spontaneous presentations from laymen (‘S’ files), readings (‘R’ files) and dialogs (‘D’ files) form homogeneous sub-corpora. If you expect the differences between the different subparts to have an impact on the results of standard ABX and kaldi analyses, you should generate a separate standardized corpus for each of them.

Adding new corpora

  • Make a new Python class which inherit from abkhazia.corpus.prepare.abstract_preparator. So far, you need to implement few methods to populate the transcriptions, lexicon, etc… See the section below and the absctract preparator code for detailed specifications, and the existing preparators for exemples.
  • To access your new corpus from the command line, register it in abkhazia.commands.abkhazia_prepare. An intermediate factory class can be defined to define additional command line arguments, or the default AbstractFactory class can be used (if your corpus prepration relies on the CMU dictionary, use instead AbstractFactoryWithCMU.

Detailed files format

Note

File formats are often, but not always, identical to Kaldi standard file formats.

1. Speech recordings

A folder called wavs containing all the wavefiles of the corpus in a standard mono 16-bit PCM wav format sampled at 16KHz. The standard kaldi and ABX analyses might work with other kinds of wavefiles (in particular other sampling frequencies) but this has not been tested. The wavs can be either links or files.

2. List of utterances

A text file called segments.txt containing the list of all utterances with the name of the associated wavefiles (just the filename, not the entire path) and if there is more than one utterance per file, the start and end of the utterance in that wavefile expressed in seconds (the designated segment of the audio file can include some silence before and after the utterance).

The file should contain one entry per line in the following format:

<utterance-id> <wav-filename> <segment-begin> <segment-end>

or if there is only one utterance in a given wavefile:

<utterance-id> <wav-filename>

Each utterance should have its unique utterance-id. Moreover, all utterance ids must begin by a unique identifier (the speaker-id) for the speaker of the utterance. In addition, all speaker ids must have the same length.

Here is an example file with three utterances:

sp001-sentence001 sp001.wav 53.2 55.4
sp001-sentence005 sp001.wav 65.1 66.9
sp109-sentence003 sp109-sentence003.wav

3. List of speakers

A text file called utt2spk.txt containing the list of all utterances with a unique identifier for the associated speaker (the speaker-id mentionned in the previous section). As said previously, all utterance ids must be prefixed by the corresponding speaker-id. In addition, all speaker-ids must have the same length.

The file should contain one entry per line in the following format:

<utterance-id> <speaker-id>

Here is an example file with three utterances:

sp001-sentence001 sp001
sp001-sentence005 sp001
sp109-sentence003 sp109

If you don’t have this information, or wish to hide this information to kaldi but still conform to this dataset format, you should set each utterance to its own unique speaker ID

(as explained here), e.g:

sentence001 sp001
sentence002 sp002
sentence003 sp003
sentence004 sp004
....

4. Transcription

A text file called text.txt, containing the transcription in word units for each utterance. Word units should correspond to elements in the phonetic dictionary (having a few out-of-vocabulary words is not a problem). The file should contain one entry per line in the following format:

<utterance-id> <word1> <word2> ... <wordn>

Here is an example file with two utterances:

sp001-sentence001 ON THE OTHER HAND
sp003-sentence002 YOU HAVE DIFFERENT FINGERS

5. Phone inventory

An UTF-8 encoded text file called phones.txt and an optional text file called silences.txt also UTF-8 encoded.

phones.txt contains a list of each symbol used in the pronunciation dictionary (cf. next section) with the associated IPA transcription (https://en.wikipedia.org/wiki/International_Phonetic_Alphabet). The idea is to use IPA transcription as consistent as possible throughout the different corpora, speaking style, languages etc. To this effect when mapping a knew corpus to IPA you can take inspiration from previously mapped corpora.

In addition to the phonetic annotations, if noise or silence markers are used in your corpus (if your using a standard pronunciation dictionary with some read text, there won’t be any silence or noise marker in the transcriptions), you must provide the list of these markers in a file called silences.txt. Two markers will be added automatically in all cases if they aren’t already present: SIL for optional short pauses inside or between words and SPN for spoken noise (any out-of-vocabulary item that would be encountered during training would automatically be transcribed by kaldi to SPN). If your corpus already contains other markers for short pauses or for spoken noise, convert them to SIL and SPN and reciprocally, make sure that SIL or SPN aren’t already used for something else your corpus.

The file phones.txt should contain one entry per line in the following format:

<phone-symbol> <ipa-symbol>

The file silences.txt should contain one entry per line in the following format:

<marker-symbol>

Here is an example for phones.txt:

a a
sh ʃ
q ʔ

An example for silences.txt:

SIL
Noise

In this example SIL could have been ommited since it would have been automatically added. SPN will be automatically added.

Phones with tonal, stress or other variants

Having variants of a given phone such as stress or tonal variants: an additional file is needed. By default kaldi allows parameter-tying between HMM states of all the contextual variants of a given phone when training triphone models. To allow parameter-tying between HMM states of other variants of a given phone such as tonal or stress variants you need two things:

  • First, all the variants must be listed as separate items in the phones.txt file

  • Second, you must provide a variants.txt file containing one line for each group of phones with tonal or stress variants in the following format:

    <phone_variant_1 phone_variant_2 phone_variant_n>
    

Note that you can also use the variants.txt file to allow parameter-tying between states of some or all of the HMM models for silences and noises.

For example here is a phones.txt containing 5 vowels, two of which have tonal variants:

a1 a˥
a2 a˥˩
e ə
i i
o1 o˧
o2 o˩
o3 o˥
u u

An associated silences.txt defining a marker for speechless singing (SIL and SPN markers will be added automatically):

SING

An the variants.txt grouping tonal variants and also allowing parameter sharing between the models for spoken noise and speechless singing:

a1 a2
o1 o2 o3
SPN SING

6. Phonetic dictionary

A text file lexicon.txt containing a list of words with their phonetic transcription. The words should correspond to the words used in the utterance transcriptions of the corpus; the phones should correspond to the phones used in the original phoneset (not IPA) of the corpus (see previous sections). The dictionary can contain more words than necessary. Any word from the transcriptions that is not in the dictionary will be ignored for ABX analyses and will be mapped by kaldi to an out-of-vocabulary special item <unk> transcribed as SPN (spoken noise, see previous section). If no entry <unk> is present in the dictionary it will be automatically added.

Depending on your purposes, the unit in the dictionary can be lexical words (e.g. for a corpus of read speech without detailed phonetic transcription), detailed pronunciation variants of words (e.g. for a corpus of spontaneous speech with detailed phonetic transcription), phonemes… The dictionary can also contain special entries for noise and silence if they are explicitly transcribed in the corpus, as in TIMIT for example.

Each line of the file contains the entry for a particular word, in the following format:

<word> <phone_1> <phone_2> ... <phone_n>

Here is an example lexicon containing two words and using the TIMIT phoneset:

anyone eh n iy w ah n
monitor m aa n ah t er

7. Time-alignments (Optional)

Not yet supported.

A text file called phone_alignment.txt, containing a beginning and end timestamp for each phone of each utterance in the corpus. The file should contain one entry per line in the following format:

<utterance-id> <phone_start> <phone_end> <phone_symbol>

The timestamps are in seconds and are given relative to the beginning each utterance. The phone symbols correspond to those used in the pronunciation dictionary, (not to the IPA transcriptions).

Here is an example file with two utterances containing three and two phones respectively:

sp001-sentence001 1.211 1.256 a1
sp001-sentence001 1.256 1.284 t
sp001-sentence001 1.284 1.340 o3
sp109-sentence003 0.331 0.371 u
sp109-sentence003 0.371 0.917 sh

8. Language model (Optional)

Not yet supported.

9. Syllabification (Optional)

Not yet supported.

Forced Alignment

This tutorial covers the usage of abkhazia to do phone-level forced alignment on your own corpus of annotated audio files.

Prerequisites

Here’s what you need to have before being able to follow this tutorial:

  • A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment
  • On these audio files, a set of segments corresponding to utterances. For each utterance, you’ll need to have a phonemic transcription (an easy way to get these is by using Phonemizer )

It’s also recommended (yet optional) to have some kind of reference file where you can identify the speaker of each of your phonemized utterance.

Corpus format

The corpus format is the same as the one specified in abkhazia_format, two

corpus files having a bit more specific format, namely text.txt and lexicon.txt. Here, text.txt is composed of your phonemic transcription of each utterance:

<utterance-id> <pho1> <pho2> ... <phoN>

and lexicon.txt is just a “phony” file containg phonemes mapped to themselves:

<pho1> <pho1>
<pho2> <pho2>
<pho3> <pho3>
...

Doing the Forced Alignment

Once you’ve gathered all the required files (cited above) in a corpus/ folder (the name is obviously arbitrary), you’ll want to validate the corpus to check that it is conform to Kaldi’s input format. Abkhazia luckily does that for us:

abhkazia validate corpus/

Then, we’ll compute the language model (actually here a phonetic model) for your dataset. Note that even though we set the model-level (option -l) to “word”, here it’s still working find since all words are phonemes:

abkhazia language corpus/ -l word -n 3 -v

We’ll now extract the MFCC features from the wav files:

abkhazia features mfcc corpus/ --cmvn

Then, using the langage model and the extracted MFCC’s, compute a triphone HMM-GMM acoustic model:

abkhazia acoustic monophone -v corpus/ --force --recipe
abkhazia acoustic triphone -v corpus/

If you specified the speaker for each utterance, you can adapt your model per speaker:

abkhazia acoustic triphone-sa -v corpus/

And the, at last, we can compute the forced phonetic aligments:

abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones
abkhazia align corpus -a corpus/triphone # if you didn't

If everything went right, you should be able to find your alignment in corpus/align/alignments.txt. The file will have the following row structure:

<utt_id> <pho_start> <pho_end> <pho_name> <pho_symbol>
...

Note that the phoneme’s start and end time markers (in seconds) are relative to the utterance in which they were contained, not to the entire audio file.

Command-line tool

Once abkhazia is installed on your system, use it from a terminal with the abkhazia command. For now, type:

abkhazia --help

As you see, the program is made of several subcommands which are detailed below. It also read a configuration file (the default one is installed with abkhazia and you can overload it by specifying the --config <config-file> option.

Commands

For more details on a specific command, type abkhazia <command> --help.

validate: [corpus]

Check if a corpus stored on disk is in a valid format and ready to be used by abkhazia.

prepare: [raw] -> [corpus]

Prepare a speech corpus from its raw distribution format to the abkhazia format. Write the directory <corpus>/data.

split: [corpus] -> [corpus], [corpus]

Split a speech corpus in train and test sets. Write the directories <corpus>/train and <corpus>/test.

language: [corpus] -> [lm]

Generate a language model from a prepared corpus. Write the directory <corpus>/language.

features: [corpus] -> [features]

Compute speech features on the given corpus. Write the directory <corpus>/features.

acoustic: [corpus], [features] -> [model]

Train an acoustic model from a prepared corpus and its features. Write the directory <corpus>/model.

align: [corpus], [model], [lm] -> [result]

Generate a forced-alignment from acoustic and language models. Write the directory <corpus>/align.

decode: [corpus], [model], [lm] -> [result]

Decode a prepared corpus from a HMM-GMM model and a language model. Write the directory <corpus>/decode.

Python API