Abkhazia’s documentation¶
Note
The source code is available at https://www.github.com/bootphon/abkhazia.
The abkhazia project makes it easy to obtain simple baselines for supervised ASR (using Kaldi) and ABX tasks (using ABXpy) on the large corpora of speech recordings typically used in speech engineering, linguistics or cognitive science research. To this end, abkhazia provides the following:
- the definition of a standard format for speech corpora, largely inspired from the typical format used for kaldi recipes
- a
abkhazia
command-line tool for importing speech corpora to that standard format and performing various tasks on it, - a Python library that can be extended to new corpora and new ASR
models
- verifying the internal consistency of the data
- extracting some standard statistics about the composition of the corpus
- training supervised ASR models on the corpus with kaldi
- computing ABX discriminability scores in various ABX tasks defined on the corpus
Abkhazia also comes with a set of recipes for specific corpora, which
can be applied to the raw distribution of these corpora directly to
obtain a version in standard format. The only requirement is to have
access to the raw distributions. Unfortunately, unlike most other
domains, large speech corpora are most of the time not freely
available. List the list of corpora supported in abkhazia with
abkhazia prepare --help
Installation and configuration¶
Note
First of all, clone the Abkhazia github repository and go to the
created abkhazia
directory:
git clone git@github.com:bootphon/abkhazia.git
cd ./abkhazia
Note
Abkhazia have been succesfully installed on various Unix flavours (Debian, Ubuntu, CentOS) and on Mac OS. It should be possible to install it on a Windows system as well (through CygWin), but this has not been tested.
Use in docker¶
The most simple way of deploy abkhazia is to use it under a docker container. Once you have docker installed on your machine, build the container with:
docker build -t abkhazia .
Then you can run it for instance with:
docker run -it --rm abkhazia bash
The need to mount your corpus data inside the container (using the -v option of docker run and modify the abkhazia configuration (see Configuration files). Read the docker documentation here.
Install dependencies¶
Before deploying Abkahzia on your system, you need to install the following dependencies: Kaldi, sox, shorten and festival.
C++ compilers¶
You need to have both gcc
and clang-3.9
installed. On
Debian/Ubuntu just have a:
sudo apt-get install gcc gfortran clang-3.9
Flac, sox and festival¶
Abkhazia relies on flac and sox for audio conversion from various file formats to wav.
They should be in repositories of every standard Unix distribution, for exemple in Debian/Ubuntu:
sudo apt-get install flac sox
Abkhazia also needs festival to phonemize the transcriptions of the Childes Brent corpus. Visit this link for installation guidelines, or on Ubuntu/Debian use:
sudo apt-get install festival
Shorten¶
shorten is used for wav conversion from the original shn audio files, it must be installed manually. Follow these steps to download, compile and install it:
wget http://shnutils.freeshell.org/shorten/dist/src/shorten-3.6.1.tar.gz
tar xzf shorten-3.6.1.tar.gz
cd shorten-3.6.1
./configure
make
sudo make install
Kaldi¶
Kaldi is an open-source toolkit for HMM based ASR. It is a collection of low-level C++ programs and high-level bash scripts. See here.
In brief: use install_kaldi.sh
The
./install_kaldi.sh
script will download, configure and compile Kaldi to./kaldi
. This should work on any standard Unix distribution and fail on the first encountered error. If so, install Kaldi manually as detailed in the following steps.If install_kaldi.sh failed
If so you need to install Kaldi manually with the following steps:
Because Kaldi is developed under continuous integration, there is no published release to rely on. To ensure the abkhazia stability, we therefore maintain a Kaldi fork that is guaranteed to work. Clone it in a directory of your choice and ensure the active branch is abkhazia (this should be the default):
git clone git@github.com:bootphon/kaldi.git
Once cloned, you have to install Kaldi (configuration and compilation). Follow those instructions. Basically, you have to do (from the
kaldi
directory):cd tools ./extras/check_dependancies.sh make -j # -j does a parallel build on multiple cores cd ../src ./configure make depend -j make -j
Install Kaldi extras tools (SRILM and IRSTLM librairies) required by abkhazia. From your local kaldi directory, type:
cd ./tools ./extras/install_irstlm.sh ./extras/install_srilm.sh
Install Abkhazia¶
It will check the dependancies for you and will initialize a default configuration file in
share/abkhazia.conf
:./configure
The
install_kaldi.sh
setup theKALDI_PATH
environment variable to point to the installed Kaldi root directory. If you have installed Kaldi manually, or if theconfigure
script complains for a missingKALDI_PATH
, you need to specify it with for exemple:KALDI_PATH=./kaldi ./configure
Rerun this script and correct the prompted configuration errors until it succed. At least you are asked to specify the path to Kaldi (as installed in previous step) in the configuration file.
Finally install abkhazia:
python setup.py build [sudo] python setup.py install
In case you want to modify and test the code inplace, replace the last step by
python setup.py develop
.To build the documentation (the one you are actually reading), install sphinx (with
pip install Sphinx
) and, from the Abkhazia root directory, have a:sphinx-build -b html ./docs/source ./docs/html
Then open the file
./docs/html/index.html
with your favorite browser.
Configuration files¶
Abkhazia relies on two configuration files, abkhazia.conf
and
queue.conf
. Those files are generated at install time (during the
configuration step) and wrote in the share
directory. But those files
are usually copied in the installation directory (e.g. in /usr/bin
)
Note
To know where are located the configuration files on your setup, have a:
abkhazia --help
abkhazia.conf
¶
The abkhazia.conf
configuration file defines a set of general
parameters that are used by Abkhazia.
- abkhazia.data-directory is the directory where abkhazia write
its data (corpora, recipes and results are stored here). During
installation, the data directory is configured to point in a
data
folder of the abkhazia source tree. You can specify a path to another dircetory, maybe on another partition. - kaldi.kaldi-directory is the path to an installed kaldi distribution. This path is configured during abkhazia installation.
- kaldi.{train, decode, highmem}-cmd setup the parallelization to
run the Kaldi recipes. Choose either
run.pl
to run locally orqueue.pl
to use a cluster managed with the Sun GridEngine. - raw corpora directories can be specified in the
corpus
section of the configuration file.
Run the tests¶
Note
The tests are actually based on the Buckeye corpus, so you must
provide the path to the raw Buckeye distribution before launching the
tests. Enter this path in the buckeye-directory
entry in the
Abkhazia configuration file.
Abkhazia comes with a test suite, from the abkhazia root directory run it using:
pytest ./test
To install the
pytest
package, simply have a:[sudo] pip install pytest
If you run the tests on a cluster and you configured Abkhazia to use Sun GridEngine, you must specify the temp directory to be in a shared filesystem with
pytest ./test --basetemp=mydir/tmp
.
Speech corpus¶
Format definition¶
A standardized corpus is stored as a directory composed of the following:
wavs
: subfolder containing the speech recordings in wav, either as files or symbolic linkssegments.txt
: list of utterances with a description of their location in the wavefilesutt2spk.txt
: list containing the speaker associated to each utterancetext.txt
: transcription of each utterance in word unitsphones.txt
: phone inventory mapped to IPAlexicon.txt
: phonetic dictionary using that inventorysilences.txt
: list of silence symbols
Supported corpora¶
Supported corpora are (see also abkhazia prepare --help
:
- Articulation Index Corpus LSCP
- Buckeye Corpus of conversational speech
- Child Language Data Exchange System (only Brent part for now)
- Corpus of Interactional Data
- Corpus of Spontaneous Japanese
- GlobalPhone multilingual read speech corpus (Vietnamese and Mandarin)
- LibriSpeech ASR Corpus
- Wall Street Journal ASR Corpus
- NCHLT Xitsonga Speech Corpus
Once you have the raw data, you can import any of the above corpora in
the standardized Abkhazia format using the abkhazia prepare
command, for exemple:
abkhazia prepare csj /path/to/raw/csj -o ./prepared_csj --verbose
Note that many corpora do not form a homogeneous whole, but are constituted from several homogenous subparts. For example in the core subset of the CSJ corpus, spontaneous presentations from academics (files whos names starts with an ‘A’), spontaneous presentations from laymen (‘S’ files), readings (‘R’ files) and dialogs (‘D’ files) form homogeneous sub-corpora. If you expect the differences between the different subparts to have an impact on the results of standard ABX and kaldi analyses, you should generate a separate standardized corpus for each of them.
Adding new corpora¶
- Make a new Python class which inherit from
abkhazia.corpus.prepare.abstract_preparator
. So far, you need to implement few methods to populate the transcriptions, lexicon, etc… See the section below and the absctract preparator code for detailed specifications, and the existing preparators for exemples. - To access your new corpus from the command line, register it in
abkhazia.commands.abkhazia_prepare
. An intermediate factory class can be defined to define additional command line arguments, or the defaultAbstractFactory
class can be used (if your corpus prepration relies on the CMU dictionary, use insteadAbstractFactoryWithCMU
.
Detailed files format¶
Note
File formats are often, but not always, identical to Kaldi standard file formats.
1. Speech recordings¶
A folder called wavs
containing all the wavefiles of the corpus in
a standard mono 16-bit PCM wav format sampled at 16KHz. The standard
kaldi and ABX analyses might work with other kinds of wavefiles (in
particular other sampling frequencies) but this has not been tested.
The wavs can be either links or files.
2. List of utterances¶
A text file called segments.txt
containing the list of all
utterances with the name of the associated wavefiles (just the
filename, not the entire path) and if there is more than one utterance
per file, the start and end of the utterance in that wavefile
expressed in seconds (the designated segment of the audio file can
include some silence before and after the utterance).
The file should contain one entry per line in the following format:
<utterance-id> <wav-filename> <segment-begin> <segment-end>
or if there is only one utterance in a given wavefile:
<utterance-id> <wav-filename>
Each utterance should have its unique utterance-id
. Moreover,
all utterance ids must begin by a unique identifier (the
speaker-id
) for the speaker of the utterance. In addition, all
speaker ids must have the same length.
Here is an example file with three utterances:
sp001-sentence001 sp001.wav 53.2 55.4
sp001-sentence005 sp001.wav 65.1 66.9
sp109-sentence003 sp109-sentence003.wav
3. List of speakers¶
A text file called utt2spk.txt
containing the list of all utterances
with a unique identifier for the associated speaker (the speaker-id
mentionned in the previous section). As said previously, all
utterance ids must be prefixed by the corresponding speaker-id
. In
addition, all speaker-ids must have the same length.
The file should contain one entry per line in the following format:
<utterance-id> <speaker-id>
Here is an example file with three utterances:
sp001-sentence001 sp001
sp001-sentence005 sp001
sp109-sentence003 sp109
If you don’t have this information, or wish to hide this information to kaldi but still conform to this dataset format, you should set each utterance to its own unique speaker ID
(as explained here), e.g:
sentence001 sp001 sentence002 sp002 sentence003 sp003 sentence004 sp004 ....
4. Transcription¶
A text file called text.txt
, containing the transcription in word
units for each utterance. Word units should correspond to elements in
the phonetic dictionary (having a few out-of-vocabulary words is not a
problem). The file should contain one entry per line in the following
format:
<utterance-id> <word1> <word2> ... <wordn>
Here is an example file with two utterances:
sp001-sentence001 ON THE OTHER HAND
sp003-sentence002 YOU HAVE DIFFERENT FINGERS
5. Phone inventory¶
An UTF-8 encoded text file called phones.txt
and an optional text
file called silences.txt
also UTF-8 encoded.
phones.txt
contains a list of each symbol used in the pronunciation
dictionary (cf. next section) with the associated IPA transcription
(https://en.wikipedia.org/wiki/International_Phonetic_Alphabet). The
idea is to use IPA transcription as consistent as possible throughout
the different corpora, speaking style, languages etc. To this effect
when mapping a knew corpus to IPA you can take inspiration from
previously mapped corpora.
In addition to the phonetic annotations, if noise or silence markers
are used in your corpus (if your using a standard pronunciation
dictionary with some read text, there won’t be any silence or noise
marker in the transcriptions), you must provide the list of these
markers in a file called silences.txt
. Two markers will be added
automatically in all cases if they aren’t already present: SIL
for
optional short pauses inside or between words and SPN
for spoken
noise (any out-of-vocabulary item that would be encountered during
training would automatically be transcribed by kaldi to SPN
). If
your corpus already contains other markers for short pauses or for
spoken noise, convert them to SIL
and SPN
and reciprocally, make
sure that SIL
or SPN
aren’t already used for something else your
corpus.
The file phones.txt
should contain one entry per line in the
following format:
<phone-symbol> <ipa-symbol>
The file silences.txt
should contain one entry per line in the
following format:
<marker-symbol>
Here is an example for phones.txt:
a a
sh ʃ
q ʔ
An example for silences.txt:
SIL
Noise
In this example SIL
could have been ommited since it would have been
automatically added. SPN
will be automatically added.
Phones with tonal, stress or other variants¶
Having variants of a given phone such as stress or tonal variants: an additional file is needed. By default kaldi allows parameter-tying between HMM states of all the contextual variants of a given phone when training triphone models. To allow parameter-tying between HMM states of other variants of a given phone such as tonal or stress variants you need two things:
First, all the variants must be listed as separate items in the
phones.txt
fileSecond, you must provide a
variants.txt
file containing one line for each group of phones with tonal or stress variants in the following format:<phone_variant_1 phone_variant_2 phone_variant_n>
Note that you can also use the variants.txt
file to allow
parameter-tying between states of some or all of the HMM models for
silences and noises.
For example here is a phones.txt
containing 5 vowels, two of which
have tonal variants:
a1 a˥
a2 a˥˩
e ə
i i
o1 o˧
o2 o˩
o3 o˥
u u
An associated silences.txt
defining a marker for speechless singing
(SIL and SPN markers will be added automatically):
SING
An the variants.txt
grouping tonal variants and also allowing
parameter sharing between the models for spoken noise and speechless
singing:
a1 a2
o1 o2 o3
SPN SING
6. Phonetic dictionary¶
A text file lexicon.txt
containing a list of words with their
phonetic transcription. The words should correspond to the words used
in the utterance transcriptions of the corpus; the phones should
correspond to the phones used in the original phoneset (not IPA) of
the corpus (see previous sections). The dictionary can contain more
words than necessary. Any word from the transcriptions that is not in
the dictionary will be ignored for ABX analyses and will be mapped by
kaldi to an out-of-vocabulary special item <unk>
transcribed as
SPN
(spoken noise, see previous section). If no entry <unk>
is
present in the dictionary it will be automatically added.
Depending on your purposes, the unit in the dictionary can be lexical words (e.g. for a corpus of read speech without detailed phonetic transcription), detailed pronunciation variants of words (e.g. for a corpus of spontaneous speech with detailed phonetic transcription), phonemes… The dictionary can also contain special entries for noise and silence if they are explicitly transcribed in the corpus, as in TIMIT for example.
Each line of the file contains the entry for a particular word, in the following format:
<word> <phone_1> <phone_2> ... <phone_n>
Here is an example lexicon containing two words and using the TIMIT phoneset:
anyone eh n iy w ah n
monitor m aa n ah t er
7. Time-alignments (Optional)¶
Not yet supported.
A text file called phone_alignment.txt
, containing a beginning and
end timestamp for each phone of each utterance in the corpus. The file
should contain one entry per line in the following format:
<utterance-id> <phone_start> <phone_end> <phone_symbol>
The timestamps are in seconds and are given relative to the beginning each utterance. The phone symbols correspond to those used in the pronunciation dictionary, (not to the IPA transcriptions).
Here is an example file with two utterances containing three and two phones respectively:
sp001-sentence001 1.211 1.256 a1
sp001-sentence001 1.256 1.284 t
sp001-sentence001 1.284 1.340 o3
sp109-sentence003 0.331 0.371 u
sp109-sentence003 0.371 0.917 sh
8. Language model (Optional)¶
Not yet supported.
9. Syllabification (Optional)¶
Not yet supported.
Forced Alignment¶
This tutorial covers the usage of abkhazia to do phone-level forced alignment on your own corpus of annotated audio files.
Prerequisites¶
Here’s what you need to have before being able to follow this tutorial:
- A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment
- On these audio files, a set of segments corresponding to utterances. For each utterance, you’ll need to have a phonemic transcription (an easy way to get these is by using Phonemizer )
It’s also recommended (yet optional) to have some kind of reference file where you can identify the speaker of each of your phonemized utterance.
Corpus format¶
- The corpus format is the same as the one specified in abkhazia_format, two
corpus files having a bit more specific format, namely
text.txt
andlexicon.txt
. Here,text.txt
is composed of your phonemic transcription of each utterance:<utterance-id> <pho1> <pho2> ... <phoN>
and lexicon.txt
is just a “phony” file containg phonemes mapped to themselves:
<pho1> <pho1>
<pho2> <pho2>
<pho3> <pho3>
...
Doing the Forced Alignment¶
Once you’ve gathered all the required files (cited above) in a corpus/
folder (the name is
obviously arbitrary), you’ll want to validate the corpus to check that it is conform to Kaldi’s
input format. Abkhazia luckily does that for us:
abhkazia validate corpus/
Then, we’ll compute the language model (actually here a phonetic model) for your dataset.
Note that even though we set the model-level (option -l
) to “word”, here it’s
still working find since all words are phonemes:
abkhazia language corpus/ -l word -n 3 -v
We’ll now extract the MFCC features from the wav files:
abkhazia features mfcc corpus/ --cmvn
Then, using the langage model and the extracted MFCC’s, compute a triphone HMM-GMM acoustic model:
abkhazia acoustic monophone -v corpus/ --force --recipe
abkhazia acoustic triphone -v corpus/
If you specified the speaker for each utterance, you can adapt your model per speaker:
abkhazia acoustic triphone-sa -v corpus/
And the, at last, we can compute the forced phonetic aligments:
abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones
abkhazia align corpus -a corpus/triphone # if you didn't
If everything went right, you should be able to find your alignment in
corpus/align/alignments.txt
. The file will have the following row structure:
<utt_id> <pho_start> <pho_end> <pho_name> <pho_symbol>
...
Note that the phoneme’s start and end time markers (in seconds) are relative to the utterance in which they were contained, not to the entire audio file.
Command-line tool¶
Once abkhazia is installed on your system, use it from a terminal with
the abkhazia
command. For now, type:
abkhazia --help
As you see, the program is made of several subcommands which are
detailed below. It also read a configuration file (the default one is
installed with abkhazia and you can overload it by specifying the
--config <config-file>
option.
Commands¶
For more details on a specific command, type abkhazia <command>
--help
.
validate: [corpus]¶
Check if a corpus stored on disk is in a valid format and ready to be used by abkhazia.
prepare: [raw] -> [corpus]¶
Prepare a speech corpus from its raw distribution format to the
abkhazia format. Write the directory <corpus>/data
.
split: [corpus] -> [corpus], [corpus]¶
Split a speech corpus in train and test sets. Write the directories
<corpus>/train
and <corpus>/test
.
language: [corpus] -> [lm]¶
Generate a language model from a prepared corpus. Write the directory
<corpus>/language
.
features: [corpus] -> [features]¶
Compute speech features on the given corpus. Write the directory
<corpus>/features
.
acoustic: [corpus], [features] -> [model]¶
Train an acoustic model from a prepared corpus and its features. Write
the directory <corpus>/model
.
align: [corpus], [model], [lm] -> [result]¶
Generate a forced-alignment from acoustic and language models. Write
the directory <corpus>/align
.
decode: [corpus], [model], [lm] -> [result]¶
Decode a prepared corpus from a HMM-GMM model and a language
model. Write the directory <corpus>/decode
.
Python API¶
License and copyright¶
This package is developed whithin the Bootphon project.
Copyright 2015, 2016 Thomas Schatz, Mathieu Bernard, Roland Thiolliere, Xuan-Nga Cao
abkhazia is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
abkhazia is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with abkahzia. If not, see http://www.gnu.org/licenses/.