Welcome to Alex Dialogue Systems Framework’s documentation!¶
Alex Dialogue Systems Framework or simply Alex is a set of algorithms, classes, and tools to facilitate building spoken dialogue systems.
Contents¶
Index of manually written in source code tree documentation¶
Building the language model for the Public Transport Info telephone service (Czech)¶
*WARNING* To build the language model, you will need a machine with a lot of memory (more than 16GB RAM).
The data¶
To build the domain specific language model, we use the approach described in Approach to bootstraping the domain specific language models. So far, we have collected this data:
- selected out-of-domain data - more than 2000 sentences
- bootstrap text - 289 sentences
- indomain data - more than 9000 sentences (out of which about 900 of the sentences are used as development data)
Building the models¶
The models are built using the build.py
script.
It requires to set the following variables:
bootstrap_text = "bootstrap.txt"
classes = "../data/database_SRILM_classes.txt"
indomain_data_dir = "indomain_data"
The variables description:
- bootstrap_text - the bootstrap.txt file contains handcrafted in-domain sentences.
- classes - the
../data/database_SRILM_classes.txt
file is created by thedatabase.py
script in thealex/applications/PublicTransportInfoCS/data
directory. - indomain_data_dir - should include links to directories containing
asr_transcribed.xml
files with transcribed audio data
The process of building/re-building the LM is:
cd ../data
./database.py dump
cd ../lm
./build.py
Distributions of the models¶
The final.*
models are large. Therefore, they should be distributed online on-demand using the online_update
function. Please do not forget to place the models generated by the ./build.py
script on the distribution servers.
Reuse of build.py¶
The build.py
script can be easily generalised to a different language or different text data, e.g. the in-domain
data.
Description of resource files for ASR¶
This directory contains acoustic models for different languages and recording conditions. It is assumed that only one acoustic model per language will be build.
However, one can build different acoustic models for different recording settings, e.g. one for VOIP and the other for desktop mic recordings.
Up to now, only VOIP acoustic models have been trained.
Description of resource files for VAD¶
Please note that to simplify deployment of SDSs, the the VAD is trained to be language independent. That means that VAD classifies silence (noise, etc.) vs. all sounds in any language.
At this moment, the alex/resources/vad/
has only VAD models build using VOIP audio signal. The created models
include:
- GMM models
- NN models
More information about the process of creating the VAD models is available in Building a voice activity detector (VAD).
Please note that the NN VAD is much better compared to GMM VAD. Also alex/resources/vad/
stores the models,
but they should not be checked in the repository anymore. Instead, they should be on the online_update server
and downloaded from it when they are updated. More on online update is available in
Online distribution of resource files such as ASR, SLU, NLG models.
Public Transport Info, Czech - telephone service¶
Running the system at UFAL with the full UFAL access¶
There are multiple configuration that can used to run the system. In general, it depends on what components you want go use and on what telephone extension you want to run the system.
Within UFAL, we run the system using the following commands:
vhub_live
- deployment of our live system on our toll-free phone number, with the default configurationvhub_live_b1
- a system deployed to backup the system abovevhub_live_b2
- a system deployed to backup the system abovevhub_live_kaldi
- a version of our live system explicitly using Kaldi ASR
To test the system we use:
vhub_test
- default test version of our system deployed on our test extension, logging locally into../call_logs
vhub_test_google_only
- test version of our system on our test extension, using Google ASR, TTS, Directions, logging locally into../call_logs
vhub_test_google_kaldi
- test version of our system on our test extension, using Google TTS, Directions, and Kaldi ASR, logging locally into../call_logs
vhub_test_hdc_slu
- default test version of our system deployed on our test extension, using HDC SLU, logging locally into../call_logs
vhub_test_kaldi
- default test version of our system deployed on our test extension, using KALDI ASR, logging locally into../call_logs
vhub_test_kaldi_nfs
- default test version of our system deployed on our test extension, using KALDI ASR and logging to NFS
Running the system without the full UFAL access¶
Users outside UFAL can run the system using the following commands:
vhub_private_ext_google_only
- default version of our system deployed on private extension specified inprivate_ext.cfg
, using Google ASR, TTS, Directions, and KALDI ASR, logging locally into../call_logs
vhub_private_ext_google_kaldi
- default version of our system deployed on private extension specified inprivate_ext.cfg
, using Google TTS, Directions, and KALDI ASR, logging locally into../call_logs
If you want to test the system on your private extension, then modify the private_ext.cfg
config. You must set your
SIP domain including the port, user login, and password (You can obtain a free extension at http://www.sipgate.co.uk).
Please make sure that you do not commit your login information into the repository.
config = {
'VoipIO': {
# default testing extesion
'domain': "*:5066",
'user': "*",
'password': "*",
},
}
Also, you will have to create a “private” directory where you can store your private configurations. As the private default configuration is not part of the Git repository, please make your own empty version of the private default configuration as follows.
mkdir alex/resources/private
echo "config = {}" > alex/resources/private/default.cfg
UFAL Dialogue act scheme¶
The purpose of this document is to describe the structure and function of dialogue acts used in spoken dialogue systems developed at UFAL, MFF, UK, Czech Republic.
Definition of dialogue acts¶
In a spoken dialogue system, the observations and the system actions are represented by dialogue acts. Dialogue acts represent basic intents (such as inform, request, etc.) and the semantic content in the input utterance (e.g. type=hotel, area=east). In some cases, the value can be omitted, for example, where the intention is to query the value of a slot e.g. request(food)
.
In the UFAL Dialogue Act Scheme (UDAS), a dialogue act (DA) is composed of one or more dialogue act items (DAI). A dialogue act item is defined as a tuple composed of a dialogue act type, a slot name, and the slot value. Slot names and slot values are domain dependent, therefore they can be many. In the examples which follows, the names of the slots and their values are drown form a information seeking application about restaurants, bars and hotels. For example in a tourist information domain, the slots can include “food” or “pricerange” and the values can be such as “Italian”, “Indian” or ”cheap”, “midpriced”, or “expensive”.
This can be described in more formal way as follows:
DA = (DAI)+
DAI = (DAT, SN, SV)
DAT = (ack, affirm, apology, bye, canthearyou, confirm,
iconfirm, deny, hangup, hello, help, inform, negate,
notunderstood, null, repeat, reqalts, reqmore, request,
restart, select, thankyou)
where SN denotes a slot name and SV denotes a slot value.
The idea of dialogue comes from the information state update (ISU) approach of defining a dialogue state. In ISU, a dialogue act is understood as a set of deterministic operations on a dialogue state which which result in a new updated state. In the UFAL dialogue act scheme, the update is performed on the slot level.
The following explains each dialogue act type:
ack - "Ok" - back channel
affirm - simple "Yes"
apology - apology for misunderstanding
bye - end of a dialogue – simple "Goodbye"
confirm - user tries to confirm some information
canthearyou - system or user does not hear the other party
deny - user denies some information
hangup - the user hangs up
hello - start of a dialogue – simple "Hi"
help - request for help
inform - user provides some information or constraint
negate - simple "No"
null - silence, empty sentence, something that is not possible to
interpret, does nothing
null
It can be also used when converting a dialogue act item confusion network into an N-best list to hold
all the probability mass connected with all dialogue acts which were not added to the N-best list.
In other words probability mass of pruned DA hypotheses.
notunderstood - informs that the last input was not understood
repeat - request to repeat the last utterance
irepeat - repeats the last utterance
reqalts - ask for alternatives
reqmore - ask for more details
request - user requests some information
restart - request a restart of the dialogue
select - user or the system wants the other party to select between
two values for one slot
thankyou - simple "Thank you"
NOTE: Having this set of acts we cannot confirm that something is not equal to something, e.g. confirm(x!=y)
→ confirm(pricerange != 'cheap')
→ “Isn’t it cheap?” If we used confirm(pricerange = 'cheap')
then it means “Is it cheap?” In both cases, it is appropriate to react in the same way e.g. inform(pricerange='cheap')
or deny(pricerange = 'cheap')
.
NOTE: Please note that all slot values are always placed in quotes "
.
Dialogue act examples¶
This section presents examples of dialogue acts:
ack() 'ok give me that one'
'ok great'
affirm() 'correct'
'erm yeah'
appology() 'sorry'
'sorry I did not get that'
bye() 'allright bye'
'allright then bye'
canthearyou() 'hallo'
'are you still there'
confirm(addr='main square') 'erm is that near the central the main square'
'is it on main square'
iconfirm(addr='main square') 'Ack, on main square,'
iconfirm(near='cinema') 'You want something near cinema'
deny(name='youth hostel') 'not the youth hostel'
deny(near='cinema') 'ok it doesn't have to be near the cinema'
hello() 'hello'
'hi'
'hiya please'
help() 'can you help me'
inform(='main square') 'main square'
inform(addr='dontcare') 'i don't mind the address'
inform(food='chinese') 'chinese'
'chinese food'
'do you have chinese food'
negate() 'erm erm no i didn't say anything'
'neither'
'no'
null() '' (empty sentence)
'abraka dabra' (something not interpretable)
repeat() 'can you repeat'
'could you repeat that'
'could you repeat that please'
reqalts() 'and anything else'
'are there any other options'
'are there any others'
reqmore() 'can you give me more dtails'
request(food) 'do you know what food it serves'
'what food does it serve'
request(music) 'and what sort of music would it play'
'and what type of music do they play in these bars'
restart() 'can we start again please'
'could we start again'
select(food="Chinese")&select(food="Italian)
'do you want Chinese or Italian food'
thankyou() 'allright thank you then i'll have to look somewhere else'
'erm great thank you'
If the system wants to inform that no venue is matching provided constraints, e.g. “There is no Chinese restaurant in a cheap price range in the city centre” the system uses the inform(name='none')
dialogue acts as in
Utterance: There is no Chinese restaurant in a cheap price range in the city centre”
Dialogue act: inform(name='none')&inform(venue_type='restaurant')&inform(food_type='Chinese')&inform(price_range='cheap')
There are examples of dialogue acts composed of several DAIs:
reqalts()&thankyou() 'no thank you somewhere else please'
request(price)&thankyou() 'thank you and how much does it cost'
'thank you could you tell me the cost'
affirm()&inform(area='south')&inform(music='jazz')&inform(type='bar')&request(name)
'yes i'd like to know the name of the bar in the south part of town that plays jazz music'
'yes please can you give me the name of the bar in the south part of town that plays jazz music'
confirm(area='central')&inform(name='cinema')
'is the cinema near the centre of town'
deny(music='pop')&inform(music='folk')
'erm i don't want pop music i want folk folk music'
hello()&inform(area='east')&inform(drinks='cocktails')&inform(near='park')&inform(pricerange='dontcare')&inform(type='hotel')
'hi i'd like a hotel in the east of town by the park the price doesn't matter but i'd like to be able to order cocktails'
An example dialogue form tourist information domain is in the following table:
Turn | Transcription | Dialogue act |
System | Hello. How may I help you? | hello() |
User | Hi, I am looking for a restaurant. | inform(venue="restaurant") |
System | What type of food would you like? | request(food) |
User | I want Italian. | inform(food="Italian") |
System | Did you say Italian? | confirm(food="Italian") |
User | Yes | affirm() |
Semantic Decoding and Ambiguity¶
Very often there are many ways as to map (to interpret) a natural utterance into a dialogue act, , some times because of natural ambiguity of a sentence – sometimes because of the speech recognition errors. Therefore, a semantic parser will generate multiple hypotheses. In this case, each hypothesis will be assigned a probability meaning the likelihood of being correct and the dialogue manager will resolve this ambiguity in the context of the dialogue (e.g. other sentences).
For example, the utterance “I wan an Italian restaurant erm no Indian” can be interpreted as:
inform(venue="restaurant")&inform(food="Italian")&deny(food=Indian)
or:
inform(venue="restaurant")&inform(food="Indian")
In the first case, the utterance is interpreted that the user wants Italian restaurant and does not want Indian. However, in the second case, the user corrected what he just mistakenly said (that he wants Indian restaurant).
Please remember that semantic parsers should interpret an utterance only on the information present in the sentence. It is up to the dialogue manager to interpret it in the context of the whole dialogue:
inform(type=restaurant)&inform(food='Chinese')
'I want a Chinese restaurant'
inform(food='Chinese')
'I would like some Chinese food'
In the first case, the user explicitly says that he/she is looking for a restaurant. However, in the second case, the user said that he/she is looking for some venue serving Indian food which can be both a restaurant or only a take-away.
Building a statistical SLU parser for a new domain¶
From experience, it appears that the easiest approach to build a statistical parser for a new domain is to start with build a handcrafted (rule based) parser. There are several practical reasons for that:
- a handcrafted parser can serve as a prototype module for a dialogue system when no data is available,
- a handcrafted parser can serve as a baseline for testing data driven parsers,
- a handcrafted parser in information seeking applications, if well implemented, achieves about 95% accuracy on transcribed speech, which is close to accuracy of what the human annotators achieve,
- a handcrafted parser can be used to obtain automatic SLU annotation which can be later hand corrected by humans.
To build a data driven SLU, the following approach is recommended:
- after some data is collected, e.g. a prototype of dialogue system using a handcrafted parser, the audio from the collected calls is manually transcribed and then parsed using the handcrafted parser,
- the advantage of using automatic SLU annotations is that they are easy to obtain and reasonably accurate only several percent lower to what one can get from human annotators.
- if better accuracy is needed then it is better to fix the automatic semantic annotation by humans,
- then a data driven parser is trained using this annotation
Note that the main benefit of data driven SLU methods comes from the ability to robustly handle erroneous input. Therefore, the data driven SLU should be trained to map the recognised speech to the dialogue acts (e.g. obtained by the handcrafted parser on the transcribed speech and then corrected by human annotator).
Comments¶
The previous sections described the general set of dialogue acts in UFAL dialogue systems. However, exact set of dialogue acts depends on a specific application domain and is defined by the domain specific semantic parser.
The only requirement is that all the output of a parser must be accepted by the dialogue manager developed for the particular domain.
Apendix A: UFAL Dialogue acts¶
Act | Description |
---|---|
ack() |
back channel – simple OK |
affirm() |
acknowledgement - simple “Yes” |
apology() |
apology for misunderstanding |
bye() |
end of a dialogue |
canthearyou() |
signalling problem with communication channel or that there is an unexpected silence |
confirm(x=y) |
confirm that x equals to y |
iconfirm(x=y) |
implicitly confirm that x equals to y |
deny(x=y) |
denies some information, equivalent to inform(x != y) |
hangup() |
end of call because someone hungup |
hello() |
start of a dialogue |
help() |
provide context sensitive help |
inform(x=y) |
inform x equals to y |
inform(name=none) |
inform that “there is no such entity that ... “ |
negate() |
negation - simple “No” |
notuderstood() |
informs that the last input was not understood |
null() |
silence, empty sentence, something that is not possible to interpret, does nothing |
repeat() |
asks to repeat the last utterance |
irepeat() |
repeats the last uttered sentence by the system |
reqalts() |
request for alternative options |
reqmore() |
request for more details bout the current option |
request(x) |
request for information about x |
restart() |
restart the dialogue, forget all provided info |
select(x=y)&select(x=z) |
select between two values of the same slot |
silence() |
user or the system does not say anything and remain silent |
thankyou() |
simply thank you |
RepeatAfterMe (RAM) for Czech - speech data collection¶
This application is useful for bootstraping of speech data. It asks the caller to repeat sentences which are randomly sampled from a set of preselected sentences.
- The Czech sentences (
sentences_es.txt
) are from Karel Capek novels Matka and RUR, and the Prague’s Dependency Treebank. - The Spanish sentences (
sentences_es.txt
) are taken from the Internet
If you want to run ram_hub.py
on some specific phone number than specify the appropriate extension config:
$ ./ram_hub.py -c ram_hub_LANG.cfg ../../resources/private/ext-PHONENUMBER.cfg
After collection desired number of calls, use copy_wavs_for_transcription.py
to extract the wave files from
the call_logs
subdirectory for transcription. The files will be copied into into RAM-WAVs
directory.
These calls must be transcribed by the Transcriber or some similar software.
Building a SLU for the PTIen domain¶
Available data¶
At this moment, we only have data which were automatically generated using our handcrafted SLU (HDC SLU) parser on the transcribed audio. In general, the quality of the automatic annotation is very good.
The data can be prepared using the prapare_data.py
script. It assumes that there exist the indomain_data
directory
with links to directories containing asr_transcribed.xml
files. Then it uses these files to extract transcriptions
and generate automatic SLU annotations using the PTIENHDCSLU parser from the hdc_slu.py
file.
The script generates the following files:
*.trn
: contains manual transcriptions*.trn.hdc.sem
: contains automatic annotation from transcriptions using handcrafted SLU*.asr
: contains ASR 1-best results*.asr.hdc.sem
: contains automatic annotation from 1-best ASR using handcrafted SLU*.nbl
: contains ASR N-best results*.nbl.hdc.sem
: contains automatic annotation from n-best ASR using handcrafted SLU
The script accepts --uniq
parameter for fast generation of unique HDC SLU annotations.
This is useful when tuning the HDC SLU.
The script also accepts --fast
parameter for fast approximate preparation of all data.
It approximates the HDC SLU output from an N-best list using output obtained by parsing the 1-best ASR result.
Building the models¶
First, prepare the data. Link the directories with the in-domain data into the indomain_data
directory. Then run the
following command:
./prepare_data.py
Second, train and test the models.
./train.py && ./test.py && ./test_bootstrap.py
Third, look at the *.score
files or compute the interesting scores by running:
./print_scores.sh
Future work¶
- The
prepare_data.py
will have to use ASR, NBLIST, and CONFNET data generated by the latest ASR system instead of the logged ASR results because the ASR can change over time. - Condition the SLU DialogueActItem decoding on the previous system dialogue act.
Evaluation¶
Evaluation of ASR from the call logs files¶
The current ASR performance computed on from the call logs is as follows:
Please note that the scoring is implicitly ignoring all non-speech events.
Ref: all.trn
Tst: all.asr
|==============================================================================================|
| | # Sentences | # Words | Corr | Sub | Del | Ins | Err |
|----------------------------------------------------------------------------------------------|
| Sum/Avg | 9111 | 24728 | 56.15 | 16.07 | 27.77 | 1.44 | 45.28 |
|==============================================================================================|
The results above were obtained using the Google ASR.
Evaluation of the minimum number of feature counts¶
Using 9111 training examples, we found that pruning should be set to
- min feature count = 3
- min classifier count = 4
to prevent overfitting.
Cheating experiment: train and test on all data¶
Due to sparsity issue, the evaluation on proper test and dev sets suffers from sampling errors. Therefore, here we presents results when all data are used as training data and the metrics are evaluated on the training data!!!
Using the ./print_scores.sh
one can get scores for assessing the quality of trained models. The results from
experiments are stored in the old.scores.*
files. Please look at the results marked as DATA ALL ASR - *
.
If the automatic annotations were correct, we could conclude that the F-measure of the HDC SLU parser on 1-best is higher wne compared to F-measure on N-best%. This is confusing as it looks like that the decoding from n-best lists gives worse results when compared to decoding from 1-best ASR hypothesis.
Evaluation of TRN model on test data¶
The TRN model is trained on transcriptions and evaluated on transcriptions from test data. Please look at the results
marked as DATA TEST TRN - *
. One can see that the performance of the TRN model on TRN test data is NOT
100 % perfect. This is probably due to the mismatch between the train and test data sets. Once more training data will
be available, we can expect better results.
Evaluation of ASR model on test data¶
The ASR model is trained on 1-best ASR output and evaluated on the 1-best ASR output from test data. Please look at
the results marked as DATA TEST ASR - *
. The ASR model scores significantly better on the ASR test data when
compared to the HDC SLU parser when evaluated on the ASR data. The improvement is about 20 % in F-measure (absolute).
This shows that SLU trained on the ASR data can be beneficial.
Evaluation of NBL model on test data¶
The NBL model is trained on N-best ASR output and evaluated on the N-best ASR from test data. Please look at
the results marked as DATA TEST NBL - *
. One can see that using nblists even from Google ASR can help; though
only a little (about 1 %). When more data will be available, more test and more feature engineering can be done.
However, we are more interested in extracting features from lattices or confusion networks.
Now, we have to wait for a working decoder generating good lattices. The OpenJulius decoder is not a suitable as it crashes unexpectedly and therefore it cannot be used in a real system.
Handling non-speech events in Alex¶
The document describes handling non-speech events in Alex.
ASR¶
The ASR can generate either:
- a valid utterance
- the ```` empty sentence word to denote that the input was silence
- the
_noise_
word to denote that the input was some noise or other sound which is not a regular word - the
_laugh_
word to denote that the input was laugh - the
_ehm_hmm_
word to denote that the input was ehm or ehm sounds - the
_inhale_
word to denote that the input was inhale sound - the
_other_
word to denote that the input was something else that was lost during speech processing approximations such as N-best list enumeration or when the ASR did not provided any result. This is because we do not know what the input was and it can be both something important or worth ignoring. As such, it deserves special treatment in the system.
SLU¶
The SLU can generate either:
- a ordinary dialogue act
- the
null()
act which should be ignored by the DM, and the system should respond withsilence()
- the
silence()
act which denote that the user was silent, a probably reasonable system response issilence()
as well - the
other()
act which denote that the input was something else that was lost during processing
The SLU should map:
- ```` to
silence()
- silence will be processed in the DM _noise_
,_laugh_
,_ehm_hmm_
, and_inhale_
tonull()
- noise can be ignored in general_other_
toother()
- other hypotheses will be handled by the DM, mostly by responding “I did not get that. Can you ... ?”
DM¶
The DM can generate either:
- a normal dialogue act
- the
silence()
dialogue act
The DM should map:
null()
tosilence()
- because thenull()
act denote that the input should be ignored; however there is a problem with this, read the note below for current workaround for thissilence()
tosilence()
or a normal dialogue act - the DM should be silent or to ask the user “Are still there?”other()
tonotunderstood()
- to show the user that we did not understood the input and that the input should be rephrased instead of just being repeated.
PROBLEM As of now, both handcrafted and trained SLUs cannot correctly classify the other()
dialogue act. It has
a very low recall for this DA. Instead of the other()
DA it returns the null()
DA. Therefore, the null()
act is processed in DMs as if it was the other()
DA for now.
Public Transport Info, English - telephone service¶
Description¶
This application provides information about public transport connections in New York using English language. Just say origin and destination stops and the application will find and tell you about the available connections. You can also specify a departure or arrival time if necessary. It offers bus, tram and metro city connections, and bus and train inter-city connections.
The application is available at the telephone number 1-855-528-7350.
You can also:
- ask for help
- ask for a “restart” of the dialogue and start the conversation again
- end the call - for example, by saying “Good bye.”
- ask for repetition of the last sentence
- confirm or reject questions
- ask about the departure or destination station, or confirm it
- ask for the number of transits
- ask for the departure or arrival time
- ask for an alternative connection
- ask for a repetition of the previous connection, the first connection, the second connection, etc.
In addition, the application provides also information about:
- weather forecast
- current time
Public Transport Info, Czech - telephone service¶
Description¶
This application provides information about public transport connections in Czech Republic using the Czech language. Just say (in Czech) your origin and destination stops and the application will find and tell you about the available connections. You can also specify a departure or arrival time if necessary. It offers bus, tram and metro city connections, and bus and train inter-city connections.
The application is available at the toll-free telephone number +420 800 899 998.
You can also:
- ask for help
- ask for a “restart” of the dialogue and start the conversation again
- end the call - for example, by saying “Good bye.”
- ask for repetition of the last sentence
- confirm or reject questions
- ask about the departure or destination station, or confirm it
- ask for the number of transits
- ask for the departure or arrival time
- ask for an alternative connection
- ask for a repetition of the previous connection, the first connection, the second connection, etc.
In addition, the application provides also information about:
- weather forecast
- the current time
Representation of semantics¶
Suggestion (MK): It would be better to treat the specification of hours and minutes separately. When they are put together, all ways the whole time expression can be said have to be enumerated in the CLDB manually.
Building of acoustic models using HTK¶
In this document, we describe building of acoustic models using the HTK toolkit using the provided scripts. These acoustic models can be used with the OpenJulius ASR decoder.
We build a different acoustic model for a each language and acoustic condition
pair – LANG_RCOND
. At this time, we provide two sets of scripts for
building English and Czech acoustic models using the VOIP data.
In general, the scripts can be described for the language and acoustic
condition LANG_RCOND
as follows:
./env_LANG_RCOND.sh - includes all necessary training parameters: e.g. the train and test data directories,
training options including cross word or word internal triphones, language model weights
./train_LANG_RCOND.sh - performs the training of acoustic models
./nohup_train_LANG_RCOND.sh - calls the training script using nohup and redirecting the output into the .log_* file
The training process stores some configuration files, the intermediate files, and final models and evaluations in the
model_LANG_RCOND
directory:
model_LANG_RCOND/config - config contains the language or recording specific configuration files
model_LANG_RCOND/temp
model_LANG_RCOND/log
model_LANG_RCOND/train
model_LANG_RCOND/test
Training models for a new language¶
Scripts for Czech and English are already created. If you need models for a
new language, you can start by copying all the original scripts and renaming
them so as to reflect the new language in their name (substitute _en or
_cs with your new language code). You can do this by issuing the following
command (we assume $OLDLANG
is set to either en or cs and
$NEWLANG
to your new language code):
bash htk $ find . -name "*_$OLDLANG*" |
xargs -n1 bash -c "cp -rvn \$1 \${1/_$OLDLANG/_$NEWLANG}" bash
Having done this, references to the new files’ names have to be updated, too:
bash htk $ find . -name "*_$NEWLANG*" -type f -execdir \
sed --in-place s/_$OLDLANG/_$NEWLANG/g '{}' \;
Furthermore, you need to adjust language-specific resources to the new language in the following ways:
htk/model_voip_$NEWLANG/monophones0
- List all the phones to be recognised, and the special sil phone.
htk/model_voip_$NEWLANG/monophones1
- List all the phones to be recognised, and the special sil and sp phones.
htk/model_voip_$NEWLANG/tree_ques.hed
- Specify phonetic questions to be used for building the decision tree for phone clustering (see [HTKBook], Section 10.5).
htk/bin/PhoneticTranscriptionCS.pl
- You can start from this script or use a custom one. The goal is to implement the orthography-to-phonetics mapping to obtain sequences of phones from transcriptions you have.
htk/common/cmudict.0.7a
andhtk/common/cmudict.ext
This is an alternative approach to the previous point – instead of programming the orthography-to-phonetics mapping, you can list it explicitly in a pronouncing dictionary.
Depending on the way you want to implement the mapping, you want to set
$OLDLANG
to either cs or en.
To make the scripts work with your new files, you will have to update
references to scripts you created. All scripts are stored in the htk/bin
,
htk/common
, and htk
directories as immediate children, so you can make
the substitutions only in these files.
Credits and the licence¶
The scripts are based on the HTK Wall Street Journal Training Recipe written by Keith Vertanen (http://www.keithv.com/software/htk/). His code is released under the new BSD licence. The licence note is at http://www.keithv.com/software/htk/. As a result we can re-license the code under the APACHE 2.0 license.
The results¶
- total training data for voip_en is about 20 hours
- total training data for voip_cs is about 8 hours
- mixtures - there is 16 mixtures is slightly better than 8 mixtures for voip_en
- there is no significant difference in alignment of transcriptions with -t 150 and -t 250
- the Julius ASR performance is about the same as of HDecode
- HDecode works well when cross word phones are trained, however the - performance of HVite decreases significantly
- when only word internal triphones are trained then the HDecode works, - however, its performance is worse than the HVite with a bigram LM
- word internal triphones work well with Julius ASR, do not forget disable CCD (it does not need context handling - though it still uses triphones)
- there is not much gain using the trigram LM in the Caminfo domain (about 1%)
[HTKBook] | The HTK Book, version 3.4 |
Public Transport Info (Czech) – data¶
This directory contains the database used by the Czech Public Transport Info system, i.e. a list of public transportation stops, time expressions etc. that are understood by the system.
The main database module is located in database.py
. You may obtain a dump of the database by running ./database.py dump
.
To build all needed generated files that are not versioned, run build_data.sh
.
Contents of additional data files¶
Some of the data (for the less populous slots) is included directly in the code database.py
, but most of the data (e.g., stops and cities) is located in additional list files.
Resources used by public transport direction finders¶
The sources of the data that are loaded by the application are:
cities.expanded.txt
– list of known cities and towns in the Czech Rep. (tab-separated: slot value name + possible surface forms separated by semicolons; lines starting with ‘#’ are ignored)stops.expanded.txt
– list of known stop names (same format)cities_stops.tsv
– “compatibility table”: lists compatible city-stops pairs, one entry per line (city and stop are separated by tabs). Only the primary stop and city names are used here.
The files cities.expanded.txt
and stops.expanded.txt
are generated from cities.txt
and stops.txt
using the expand_stops.py
script (see documentation in the file itself; you need to have Morphodita Python bindings installed to successfully run this script). Please note that the surface forms in them are lowercased and do not include any punctuation (this can be obtained by setting the -l
and -p
parameters of the expand_stops.py
script).
Colloquial stop names’ variants that are added by hand are located in the stops-add.txt
file and are appended to
the stops.txt
before performing the expansion.
Additional resources for the CRWS/IDOS directions finder¶
Since the CRWS/IDOS directions finder uses abbreviated stop names that need to be spelled out in ALEX, there is an additional resource file loaded by the system:
idos_map.tsv
– a mapping from the slot value names (city + stop) to abbreviated CRWS/IDOS names (stop list + stop)
The convert_idos_stops.py
script is used to expand all possible abreviations and produce a mapping from/to the original CRWS/IDOS stop names as they appear, e.g., at the IDOS portal .
Resources used by the weather information service¶
The weather service uses one additional file:
cities_locations.tsv
– this file contains GPS locations of all cities in the Czech Republic.
Building a SLU for the PTIcs domain¶
Available data¶
At this moment, we only have data which were automatically generated using our handcrafted SLU (HDC SLU) parser on the transcribed audio. In general, the quality of the automatic annotation is very good.
The data can be prepared using the prapare_data.py
script. It assumes that there exist the indomain_data
directory
with links to directories containing asr_transcribed.xml
files. Then it uses these files to extract transcriptions
and generate automatic SLU annotations using the PTICSHDCSLU parser from the hdc_slu.py
file.
The script generates the following files:
*.trn
: contains manual transcriptions*.trn.hdc.sem
: contains automatic annotation from transcriptions using handcrafted SLU*.asr
: contains ASR 1-best results*.asr.hdc.sem
: contains automatic annotation from 1-best ASR using handcrafted SLU*.nbl
: contains ASR N-best results*.nbl.hdc.sem
: contains automatic annotation from n-best ASR using handcrafted SLU
The script accepts --uniq
parameter for fast generation of unique HDC SLU annotations.
This is useful when tuning the HDC SLU.
Building the DAILogRegClassifier models¶
First, prepare the data. Link the directories with the in-domain data into the indomain_data
directory. Then run the
following command:
./prepare_data.py
Second, train and test the models.
- ::
cd ./dailogregclassifier
./train.py && ./test_trn.py && ./test_hdc.py && ./test_bootstrap_trn.py && ./test_bootsrap_hdc.py
Third, look at the *.score
files or compute the interesting scores by running:
./print_scores.sh
Future work¶
- Exploit ASR Lattices instead of long NBLists.
- Condition the SLU DialogueActItem decoding on the previous system dialogue act.
Evaluation¶
Evaluation of ASR from the call logs files¶
The current ASR performance computed on from the call logs is as follows:
Please note that the scoring is implicitly ignoring all non-speech events.
Ref: all.trn
Tst: all.asr
|==============================================================================================|
| | # Sentences | # Words | Corr | Sub | Del | Ins | Err |
|----------------------------------------------------------------------------------------------|
| Sum/Avg | 9111 | 24728 | 56.15 | 16.07 | 27.77 | 1.44 | 45.28 |
|==============================================================================================|
The results above were obtained using the Google ASR.
Evaluation of the minimum number of feature counts¶
Using 9111 training examples, we found that pruning should be set to
- min feature count = 3
- min classifier count = 4
to prevent overfitting.
Cheating experiment: train and test on all data¶
Due to sparsity issue, the evaluation on proper test and dev sets suffers from sampling errors. Therefore, here we presents results when all data are used as training data and the metrics are evaluated on the training data!!!
Using the ./print_scores.sh
one can get scores for assessing the quality of trained models. The results from
experiments are stored in the old.scores.*
files. Please look at the results marked as DATA ALL ASR - *
.
If the automatic annotations were correct, we could conclude that the F-measure of the HDC SLU parser on 1-best is higher wne compared to F-measure on N-best%. This is confusing as it looks like that the decoding from n-best lists gives worse results when compared to decoding from 1-best ASR hypothesis.
Evaluation of TRN model on test data¶
The TRN model is trained on transcriptions and evaluated on transcriptions from test data. Please look at the results
marked as DATA TEST TRN - *
. One can see that the performance of the TRN model on TRN test data is NOT
100 % perfect. This is probably due to the mismatch between the train and test data sets. Once more training data will
be available, we can expect better results.
Evaluation of ASR model on test data¶
The ASR model is trained on 1-best ASR output and evaluated on the 1-best ASR output from test data. Please look at
the results marked as DATA TEST ASR - *
. The ASR model scores significantly better on the ASR test data when
compared to the HDC SLU parser when evaluated on the ASR data. The improvement is about 20 % in F-measure (absolute).
This shows that SLU trained on the ASR data can be beneficial.
Evaluation of NBL model on test data¶
The NBL model is trained on N-best ASR output and evaluated on the N-best ASR from test data. Please look at
the results marked as DATA TEST NBL - *
. One can see that using nblists even from Google ASR can help; though
only a little (about 1 %). When more data will be available, more test and more feature engineering can be done.
However, we are more interested in extracting features from lattices or confusion networks.
Now, we have to wait for a working decoder generating good lattices. The OpenJulius decoder is not a suitable as it crashes unexpectedly and therefore it cannot be used in a real system.
Utils for building decoding graph HCLG¶
Summary¶
The build_hclg.sh
script formats language model (LM) and acoustic model (AM)
into files (e.g. HCLG) formated for Kaldi decoders.
The scripts extracts phone lists and sets from lexicon given the acoustic model (AM), the phonetic decision tree (tree) and the phonetic dictionary(lexicon).
The script silently supposes the same phone lists are generated from lexicon as the these used for training AM. If they are not the same, the script crashes.
The use case. Run the script with trained AM on full phonetic set for given language, pass the script also the tree used for tying the phonetic set and also give the script your LM and corresponding lexicon. The lexicon and the LM should also cover the full phonetic set for given language.
The decode_indomain.py
script uses HCLG.fst
and the rest of files
generated by build_hclg.sh
and performes decoding on prerecorded wav files.
The reference speech transcription and path to the wav files are extracted from collected call logs.
The wav files should be from one domain and the LM used to build HCLG.fst
should be from the same domain.
The decode_indomain.py
also evaluates the decoded transcriptions.
The Word Error Rate (WER), Real Time Factor (RTF) and other minor statistics are collected.
Dependencies of build_hclg.sh¶
The build_hclg.sh script requires the scripts listed belofw from $KALDI_ROOT/egs/wsj/s5/utils
.
The “utils scripts transitevely uses scripts from $KALDI_ROOT/egs/wsj/s5/steps
.
The dependency is solved in path.sh
script which create corresponding symlinks
and adds Kaldi binaries to your system path.
You just needed to set up KALDI_ROOT
root variable and provide correct arguments.
Try to run
Needed scripts from utils
symlinked directory.
* gen_topo.pl
* add_lex_disambig.pl
* apply_map.pl
* eps2disambig.pl
* find_arpa_oovs.pl
* gen_topo.pl
* make_lexicon_fst.pl
* remove_oovs.pl
* s2eps.pl
* sym2int.pl
* validate_dict_dir.pl
* validate_lang.pl
* parse_options.sh
Scripts from the list use Kaldi binaries,
so you need Kaldi compiled on your system.
The script path.sh
adds Kaldi binaries to the PATH
and also creates symlinks to utils
and steps
directories,
where the helper scripts are located.
You only need to set up $KALDI_ROOT
variable.
Interective tests and unit tests¶
Testing of Alex can be divided into interactive tests, which depends on on some activity of a user e.g. calling a specific phone number or listening to some audi file, and unit tests, which are testing some very specific properties of algorithms or libraries.
Interactive tests¶
This directory contains only (interactive) tests, which can’t be automated and the results must be verified by humans! E.g. playing or recording audio, testing VOIP connections.
Unit tests¶
Note that the unit tests should be placed in the same directory as the tested module and the name should be test_*.py
e.g. test_module_name.py
.
Using unittest module:
$ python -m unittest alex.test.test_string
This approach works everywhere but doesn’t support test discovery.
Using nose test discovery framework, testing can largely automated.
Nose searchs through packages and runs every test. Tests must be named
test_<something>.py
and must not be executable. Tests doesn’t have to be
run from project root, nose is able to find project root on its own.
How should my unit tests look like?
- Use unittest module
- Name the test file
test_<something>.py
- Make the test file not executable
Approach to bootstraping the domain specific language models¶
**WARNING**: Please note that domain specific language models are build in ./alex/applications/*/lm
This text explains a simple approach to building a domain specific language models, which can be different for every
domain.
While an acoustic model can be build domain independent, the language models (LMs) must be domain specific to ensure high accuracy of the ASR.
In general, building an in-domain LM is easy as long as one has enough of in-domain training data. However, when the in-domain data is scarce, e.g. when deploying a new dialogue system, this task is difficult and there is a need for some bootstrap solution.
The approach described here builds on:
- some bootstrap text - probably handcrafted, which captures the main aspects of the domain
- LM classes - which clusters words into classes, this can be derived from some domain ontology. For example, all food types belong to the FOOD class and all public transport stops stops belong to the STOP class
- in-domain data - collected using some prototype or final system
- general out-of-domain data - for example Wikipedia - from which is selected a subset of data, similar to our in-domain data
Then a simple process of building a domain specific language model can described as follows:
- Append bootstrap text to the text extracted from the indomain data.
- Build a class based language model using the data generated in the previous step and the classes derived from the domain ontology.
- Score the general (domain independent) data using the LM build in the previous step.
- Select some sentences with the lowest perplexity given the class based language model.
- Append the selected sentences to the training data generated in the 1. step.
- Re-build the class based language model.
- Generate dictionaries.
Data for building general LMs¶
To get free general out-of-domain text data, we use the free W2C – Web to Corpus – Corpora available from the LINDAT project at: https://ufal-point.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0022-6133-9
Structure of each domain scripts¶
Each of the projects should contain:
- build.py - builds the final LMs, and computes perplexity of final LMs
Necessary files for the LM¶
For each domain the LM package should contain:
- ARPA trigram language model (
final.tg.arpa
) - ARPA bigram language model (
final.bg.arpa
) - HTK wordnet bigram language model (
final.bg.wdnet
) - List of all words in the language model (
final.vocab
) - Dictionary including all words in the language model using compatible phone set with the
language specific acoustic model (
final.dict
- without pauses andfinal.dict.sp_sil
with short and long pauses)
CamInfoRest¶
For more details please see alex.applications.CamInfoRest.lm.README
.
PTIcs¶
For more details please see Building the language model for the Public Transport Info telephone service (Czech).
Online distribution of resource files such as ASR, SLU, NLG models¶
Large binary files are difficult to store in git. Therefore, files such as resource files for ASR, SLU or NLG are distributed online and on-demand.
To use this functionality you have to use the online_update(file_name)
function from the alex.utils.config
package.
The functions checks the file name whether it exists locally and it is up-to-date. If it is missing or it is old, then
a new version from the server is downloaded.
The function returns name if the downloaded file which equal to input file name. As a result it is transparent in a way, that this function can be used everywhere a file name must be entered.
The server is set to https://vystadial.ms.mff.cuni.cz/download/
; however, it can be changed using the
set_online_update_server(server_name)
function from inside a config file, e.g. the (first) default config file.
Building of acoustic models using KALDI¶
In this document, we describe building of acoustic models
using the KALDI toolkit and the provided scripts.
These acoustic models can be used with the Kaldi decoders
and especially with the Python wrapper of LatgenFasterDecoder
which is integrated with Alex.
We build a different acoustic model for a each language and acoustic condition
pair – LANG_RCOND
. At this time, we provide two sets of scripts for
building English and Czech acoustic models using the VOIP data.
In general, the scripts can be described for the language and acoustic
condition LANG_RCOND
as follows:
Summary¶
- Requires KALDI installation and Linux environment. (Tested on Ubuntu 10.04, 12.04 and 12.10.) Note: We recommend Kaldi fork Pykaldi, because you will need it also for integrated Kaldi decoder to Alex.
- Recipes deployed with the Kaldi toolkit are located at
$KALDI_ROOT/egs/name_of_recipe/s[1-5]/
. This recipe requires to set up$KALDI_ROOT
variable so it can use Kaldi binaries and scripts from$KALDI_ROOT/egs/wsj/s5/
.
Details¶
- The recommended settings are stored at
env_LANG_RCONG.sh
e.genv_voip_en.sh
- We recommend to adjust the settings in file env_LANG_RCONG_CUSTOM.sh` e.g.
env_voip_en_CUSTOM.sh
. See below. Do not commit this file to the git repository! - Our scripts prepare the data to the expected format to
$WORK
directory. - Experiment files are stored to
$EXP
directory. - The symbolic links to
$KALDI_ROOT/wsj/s5/utils
and$KALDI_ROOT/wsj/s5/steps
are automatically created. - The files
path.sh
,cmd.sh
are necessary toutils
andsteps
scripts. Do not relocate them! - Language model (LM) is either built from the training data using
SRILM or specified in
env_LANG_RCOND.sh
.
Example of env_voip_en_CUSTOM.sh
# uses every utterance for the recipe every_N=10 is nice for debugging
export EVERY_N=1
# path to built Kaldi library and scripts
export KALDI_ROOT=/net/projects/vystadial/lib/kronos/pykaldi/kaldi
export DATA_ROOT=/net/projects/vystadial/data/asr/cs/voip/
export LM_paths="build0 $DATA_ROOT/arpa_bigram"
export LM_names="build0 vystadialbigram"
export CUDA_VISIBLE_DEVICES=0 # only card 0 (Tesla on Kronos) will be used for DNN training
Running experiments¶
Before running the experiments, check that:
- you have the Kaldi toolkit compiled: - http://github.com/UFAL-DSG/pykaldi (Recommended Kaldi fork, tested, necessary for further Alex integration) - http://sourceforge.net/projects/kaldi/ (alternative, main Kaldi repository) - In order to compile Kaldi we suggest:
# build openfst
pushd kaldi/tools
make openfst_tgt
popd
# download ATLAS headers
pushd kaldi/tools
make atlas
popd
# generate Kaldi makefile ``kaldi.mk`` and compile Kaldi
pushd kaldi/src
./configure
make && make test
popd
- you have SRILM compiled. (This is needed for building a language model) unless you supply your own LM in the ARPA format.)
pushd kaldi/tools
# download the srilm.tgz archive from http://www.speech.sri.com/projects/srilm/download.html
./install_srilm.sh
pushd
- the
train_LANG_RCOND
script will see the Kaldi scripts and binaries. Check for example that$KALDI_ROOT/egs/wsj/s5/utils/parse_options.sh
is valid path. - in
cmd.sh
, you switched to run the training on a SGE[*] grid if required (disabled by default) andnjobs
is less than number of your CPU cores.
Start the recipe by running bash train_LANG_RCOND.sh
.
[*] | Sun Grid Engine |
Extracting the results and trained models¶
The main script, bash train_LANG_RCOND.sh
, performs not only training of the acoustic
models, but also decoding.
The acoustic models are evaluated during running the scripts and evaluation
reports are printed to the standard output.
The local/results.py exp
command extracts the results from the $EXP
directory.
It is invoked at the end of the train_LANG_RCOND.sh
script.
If you want to use the trained acoustic model outside the prepared script,
you need to build the HCLG
decoding graph yourself. (See
http://kaldi.sourceforge.net/graph.html for general introduction to the FST
framework in Kaldi.)
The HCLG.fst
decoding graph is created by utils/mkgraph.sh
.
See run.sh
for details.
Credits and license¶
The scripts were based on Voxforge KALDI recipe http://vpanayotov.blogspot.cz/2012/07/voxforge-scripts-for-kaldi.html . The original scripts as well as theses scripts are licensed under APACHE 2.0 license.
Building a voice activity detector (VAD)¶
This text described how to build a voice activity detector (VAD) for Alex. This work builds multilingual VAD. That means that we do not have VADs for individual languages but rather only one. It appears that NN VAD has the capacity to distinguish between non-speech and speech in any language.
As of now, we use VAD based on neural networks (NNs) implemented in the Theano toolkit. The main advantage that the same code can efficiently run both CPUs and GPUs and Theano implements automatic derivations. Automatic derivations is very useful especially when gradient descend techniques, such as stochastic gradient descent, are used for model parameters optimisation.
Old GMM code is still present but it may not work and its performance would be significantly worse that of the current NN implementation.
Experiments and the notes for the NN VAD¶
- testing is performed on randomly sampled data points (20%) from the entire set
- L2 regularisation must be very small, in addition it does not help much
- instead of MFCC, we use mel-filter banks coefficients only. It looks like the performance is the same or even better
- as of 2014-09-19 the best compromise between the model complexity and the performance appears to be.
- 30 previous frames
- 15 next frames
- 512 hidden units
- 4 hidden layers
- tanh hidden layer activation
- 4x amplification of the central frame compared to outer frames
- discriminative pre-training
- given this setup we get about 95.3 % frame accuracy on about 27 million of all data
Data¶
data_vad_sil # a directory with only silence, noise data and its mlf file
data_voip_cs # a directory where CS data reside and its MLF (phoneme alignment)
data_voip_en # a directory where EN data reside and its MLF (phoneme alignment)
model_voip # a directory where all the resulting models are stored.
Scripts¶
upload_models.sh # uploads all available models in ``model_voip`` onto the Alex online update server
train_voip_nn_theano_sds_mfcc.py # this is the main trainign script, see its help for more details
bulk_train_nn_theano_mbo_31M_sgd.sh # script with curently ``optimal`` setting for VAD
Comments¶
To save some time especially for multiple experiments on the same data, we store preprocessed speech parametrisation.
The speech parametrisation is stored because it takes about 7 hours to produce.
However, it takes only 1 minute to load from a disk file.
The model_voip
directory stores this speech parametrisation in *.npc
files.
There fore if new data is added, then these NPC files must be deleted.
If there are no NPC files then they are automatically generated from the available WAV files.
The data_voip_{cs,en}
alignment files (mlf files) can be trained using scripts alex/alex/tools/htk
or alex/alex/tools/kaldi
.
See the train_voip_{cs,en}.sh
scripts in one of the directories.
Note that the Kaldi scripts first store alignment in ctm
format and later converts it to mlf
format.
Public Transport Info (English) – data¶
This directory contains the database used by the English Public Transport Info system, i.e. a list of public transportation stops, number expressions etc. that are understood by the system.
The main database module is located in database.py
. You may obtain a dump of the database by running ./database.py dump
.
To build all needed generated files that are not versioned, run build_data.sh
.
Contents of additional data files¶
Some of the data (for the less populous slots) is included directly in the code database.py
, but most of the data (e.g., stops and cities) is located in additional list files.
Resources used by public transport direction finders and weather service¶
The sources of the data that are loaded by the application are:
cities.expanded.txt
– list of known cities and towns in the USA. (tab-separated: slot value name + possible forms separated by semicolons; lines starting with ‘#’ are ignored)states.expanded.txt
– list of us state names (same format).stops.expanded.txt
– list of known stop names (same format) in NY.stops.expanded.txt
– list of known stop names (same format) in NY.streets.expanded.txt
– list of known street names (same format)boroughs.expanded.txt
– list of known borough names (same format)cities.locations.csv
– tab separated list of known cities and towns, their state and geo location (longitude|latitude).stops.locations.csv
– tab separated list of stops, their cities and geo location (longitude|latitude).stops.borough.locations.csv
– tab separated list of stops, their boroughs and geo location (longitude|latitude).streets.types.locations.csv
– tab separated list of streets, their boroughs and type (Avenue, Street, Court etc.)
All of these files are generated from states-in.csv
, cities-in.csv
, stops-in.csv
, streets-in.csv
and boroughs-in.csv
located at ./preprocessing/resources
using the expand_states_script.py
, expand_cities_script.py
, expand_stops_script.py
, expand_streets_script.py
and expand_boroughs_script.py
script respectively.
Please note that all forms in *.expanded.txt
files are lowercased and do not include any punctuation.
Colloquial name variants that are added by hand are located in the ./preprocessing/resources/*-add.txt
files for each slot and are appended to
the expansion process.
build_data.sh
script is combining all the expansion scripts mentioned earlier into one process.
Public Transport Info, English - telephone service¶
Running the system at UFAL with the full UFAL access¶
There are multiple configuration that can used to run the system. In general, it depends on what components you want go use and on what telephone extension you want to run the system.
Within UFAL, we run the system using the following commands:
vhub_mta1
- deployment of our live system on a 1-855-528-7350 phone number, with the default configurationvhub_mta2
- a system deployed to backup the system abovevhub_mta3
- a system deployed to backup the system abovevhub_mta_btn
- a system deployed to backup the system above accessible via web page http://alex-ptien.com
To test the system we use:
vhub_devel
- default devel version of our system deployed on our test extension, logging locally into../call_logs
Running the system without the full UFAL access¶
Users outside UFAL can run the system using the following commands:
vhub_private_ext_google_only_hdc_slu
- default version of our system deployed on private extension specified inprivate_ext.cfg
, using HDC_SLU, Google ASR, TTS, Directions, logging locally into../call_logs
vhub_private_ext_google_kaldi_hdc_slu
- default version of our system deployed on private extension specified inprivate_ext.cfg
, using HDC_SLU, Google TTS, Directions, and KALDI ASR, logging locally into../call_logs
If you want to test the system on your private extension, then modify the private_ext.cfg
config. You must set your
SIP domain including the port, user login, and password. Please make sure that you do not commit your login information
into the repository.
config = {
'VoipIO': {
# default testing extesion
'domain': "*:5066",
'user': "*",
'password': "*",
},
}
Also, you will have to create a “private” directory where you can store your private configurations. As the private default configuration is not part of the Git repository, please make your own empty version of the private default configuration as follows.
mkdir alex/resources/private
echo "config = {}" > alex/resources/private/default.cfg
Alex modules¶
alex package¶
Subpackages¶
alex.applications package¶
Subpackages¶
A script that creates a compatibility table from a list of stops in a certain city and its neighborhood and a list of towns and cities.
Usage:
./add_cities_to_stops.py [-d “Main city”] stops.txt cities.txt cities_stops.tsv
-
alex.applications.PublicTransportInfoCS.data.add_cities_to_stops.
add_cities_to_stops
(cities, stops, main_city)[source]¶
-
alex.applications.PublicTransportInfoCS.data.add_cities_to_stops.
get_city_for_stop
(cities, stop, main_city)[source]¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
Convert stops gathered from the IDOS portal into structures accepted by the PublicTransportInfoCS application.
Usage:
./convert_idos_stops.py cities.txt idos_stops.tsv stops.txt cites_stops.tsv idos_map.tsv
- Input:
- cities.txt = list of all cities idos_stops.tsv = stops gathered from IDOS (format: “list_id<t>abbrev_stop”)
List ID is the name of the city for city public transit, “vlak” for trains and “bus” for buses.
- Output:
- stops.txt = list of all stops (unabbreviated) cities_stops.tsv = city-to-stop mapping idos_map.tsv = mapping from (city, stop) pairs into (list_id, abbrev_stop) used by IDOS
-
alex.applications.PublicTransportInfoCS.data.convert_idos_stops.
expand_abbrevs
(stop_name)[source]¶ Apply all abreviation expansions to the given stop name, all resulting variant names, starting with the ‘main’ variant.
-
alex.applications.PublicTransportInfoCS.data.convert_idos_stops.
expand_numbers
(stop_name)[source]¶ Spell out all numbers that appear as separate tokens in the word (separated by spaces).
A script that collects the locations of all the given cities using the Google Geocoding API.
Usage:
./get_cities_locations.py [-d delay] [-l limit] [-a] cities_locations-in.tsv cities_locations-out.tsv
- -d = delay between requests in seconds (will be extended by a random period
- up to 1/2 of the original value)
-l = limit maximum number of requests -a = retrieve all locations, even if they are set
-
alex.applications.PublicTransportInfoCS.data.get_cities_location.
get_google_coords
(city)[source]¶ Retrieve (all possible) coordinates of a city using the Google Geocoding API.
-
alex.applications.PublicTransportInfoCS.data.get_cities_location.
random
() → x in the interval [0, 1).¶
-
alex.applications.PublicTransportInfoCS.data.ontology.
add_slot_values_from_database
(slot, category, exceptions=set([]))[source]¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
A simple script for adding new utterances along with their semantics to bootstrap.sem and bootstrap.trn.
Usage:
./add_to_bootsrap < input.tsv
The script expects input with tab-separated transcriptions + semantics (one utterance per line). It automatically generates the dummy ‘bootstrap_XXXX.wav’ identifiers and separates the transcription and semantics into two files.
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
This scripts consolidates all input key files. That means, that it generates new keyfiles ({old_name}.pruned, which contains only entries common to all input ket files.
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
Various enums, semi-automatically adapted from the CHAPS CRWS enum list written in C#.
Comments come originally from the CRWS description and are in Czech.
-
alex.applications.PublicTransportInfoCS.crws_enums.
BEDS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
CLIENTEXCEPTION_CODE
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
COMBFLAGS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
COOR
¶ alias of
Enum
-
class
alex.applications.PublicTransportInfoCS.crws_enums.
CRCONST
[source]¶ -
DELAY_CD
= 'CD:'¶
-
DELAY_INTERN
= 'X{0}_{1}:'¶
-
DELAY_INTERN_EXT
= 'Y{0}_{1}:'¶
-
DELAY_TELMAX1
= 'TELMAX1:'¶
-
DELAY_ZSR
= 'ZSR:'¶
-
EXCEPTIONEXCLUSION_CD
= 'CD:'¶
-
-
alex.applications.PublicTransportInfoCS.crws_enums.
DELTAMAX
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
DEP_TABLE
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
EXFUNCTIONRESULT
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
FCS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
LISTID
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
OBJECT_STATUS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
REG
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
REMMASK
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
ROUTE_FLAGS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
SEARCHMODE
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
ST
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
SVCSTATE
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TIMETABLE_FLAGS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TRCAT
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TRSUBCAT
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TTDETAILS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TTERR
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TTGP
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TTINFODETAILS
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
TTLANG
¶ alias of
Enum
-
alex.applications.PublicTransportInfoCS.crws_enums.
VF
¶ alias of
Enum
-
exception
alex.applications.PublicTransportInfoCS.exceptions.
PTICSHDCPolicyException
[source]¶ Bases:
alex.components.dm.exceptions.DialoguePolicyException
-
class
alex.applications.PublicTransportInfoCS.hdc_slu.
DAIBuilder
(utterance, abutterance_lenghts=None)[source]¶ Bases:
object
Builds DialogueActItems with proper alignment to corresponding utterance words. When words are successfully matched using DAIBuilder, their indices in the utterance are added to alignment set of the DAI as a side-effect.
-
build
(act_type=None, slot=None, value=None)[source]¶ Produce DialogueActItem based on arguments and alignment from this DAIBuilder state.
-
ending_phrases_in
(phrases)[source]¶ Returns True if the utterance ends with one of the phrases
Parameters: phrases – a list of phrases to search for Return type: bool
-
-
class
alex.applications.PublicTransportInfoCS.hdc_slu.
PTICSHDCSLU
(preprocessing, cfg)[source]¶ Bases:
alex.components.slu.base.SLUInterface
-
abstract_utterance
(utterance)[source]¶ Return a list of possible abstractions of the utterance.
Parameters: utterance – an Utterance instance Returns: a list of abstracted utterance, form, value, category label tuples
-
handle_false_abstractions
(abutterance)[source]¶ Revert false positive alarms of abstraction
Parameters: abutterance – the abstracted utterance Returns: the abstracted utterance without false positive abstractions
-
parse_1_best
(obs, verbose=False, *args, **kwargs)[source]¶ Parse an utterance into a dialogue act.
:rtype DialogueActConfusionNetwork
-
parse_ampm
(abutterance, cn)[source]¶ Detects the ampm in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_city
(abutterance, cn)[source]¶ Detects stops in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_date_rel
(abutterance, cn)[source]¶ Detects the relative date in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_meta
(utterance, abutt_lenghts, cn)[source]¶ Detects all dialogue acts which do not generalise its slot values using CLDB.
- NOTE: Use DAIBuilder (‘dai’ variable) to match words and build DialogueActItem,
- so that the DAI is aligned to corresponding words. If matched words are not supposed to be aligned, use PTICSHDCSLU matching method instead. Make sure to list negative conditions first, so the following positive conditions are not added to alignment, when they shouldn’t. E.g.: (not any_phrase_in(u, [‘dobrý den’, ‘dobrý večer’]) and dai.any_word_in(“dobrý”))
Parameters: - utterance – the input utterance
- cn – The output dialogue act item confusion network.
Returns: None
-
parse_non_speech_events
(utterance, cn)[source]¶ Processes non-speech events in the input utterance.
Parameters: - utterance – the input utterance
- cn – The output dialogue act item confusion network.
Returns: None
-
parse_number
(abutterance)[source]¶ Detect a number in the input abstract utterance
Number words that form time expression are collapsed into a single TIME category word. Recognized time expressions (where FRAC, HOUR and MIN stands for fraction, hour and minute numbers respectively):
- FRAC [na] HOUR
- FRAC hodin*
- HOUR a FRAC hodin*
- HOUR hodin* a MIN minut*
- HOUR hodin* MIN
- HOUR hodin*
- HOUR [0]MIN
- MIN minut*
Words of NUMBER category are assumed to be in format parsable to int or float
Parameters: abutterance (Utterance) – the input abstract utterance.
-
parse_stop
(abutterance, cn)[source]¶ Detects stops in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_task
(abutterance, cn)[source]¶ Detects the task in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_time
(abutterance, cn)[source]¶ Detects the time in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_train_name
(abutterance, cn)[source]¶ Detects the train name in the input abstract utterance.
Parameters: - abutterance –
- cn –
-
parse_vehicle
(abutterance, cn)[source]¶ Detects the vehicle (transport type) in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_waypoint
(abutterance, cn, wp_id, wp_slot_suffix, phr_wp_types, phr_in=None)[source]¶ Detects stops or cities in the input abstract utterance (called through parse_city or parse_stop).
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
- wp_id – waypoint slot category label (e.g. “STOP=”, “CITY=”)
- wp_slot_suffix – waypoint slot suffix (e.g. “stop”, “city”)
- phr_wp_types – set of phrases for each waypoint type
- phr_in – phrases for ‘in’ waypoint type
-
-
alex.applications.PublicTransportInfoCS.hdc_slu.
ending_phrases_in
(utterance, phrases)[source]¶ Returns True if the utterance ends with one of the phrases
Parameters: - utterance – The utterance to search in
- phrases – a list of phrases to search for
Return type: bool
-
alex.applications.PublicTransportInfoCS.hdc_slu.
first_phrase_span
(utterance, phrases)[source]¶ Returns the span (start, end+1) of the first phrase from the given list that is found in the utterance. Returns (-1, -1) if no phrase is found.
Parameters: - utterance – The utterance to search in
- phrases – a list of phrases to be tried (in the given order)
Return type: tuple
-
class
alex.applications.PublicTransportInfoCS.platform_info.
CRWSPlatformInfo
(crws_response, finder)[source]¶ Bases:
object
-
station_name_splitter
= <_sre.SRE_Pattern object>¶
-
A script that basically creates a csv file that contains a list of places from INPUT_FILE sith second column of a STRING_SAME_FOR_ALL and the benefit is that it can merge with already existing OUTPUT_FILE
unless -c flag is set.
Usage: /.compatibility_script_manual –name OUTPUT_FILE –main-place STRING_SAME_FOR_ALL –list INPUT_FILE [-c]
-
alex.applications.PublicTransportInfoEN.data.preprocessing.compatibility_script_manual.
handle_compatibility
(file_in, file_out, main_place, no_cache=False)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.compatibility_script_manual.
main
()[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.compatibility_script_manual.
read_prev_compatibility
(filename)[source]¶
A script that takes mta stops file and it selects important fields and saves them (works with GTFS mainly) Usage:
./mta_to_csv.py [-m: main_city] [-o: output_file] stops.txt
-
alex.applications.PublicTransportInfoEN.data.preprocessing.mta_to_csv.
average_same_stops
(same_stops)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.mta_to_csv.
extract_fields
(lines, header, main_city, skip_comments=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.mta_to_csv.
get_column_index
(header, caption, default)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.mta_to_csv.
load_list
(filename, skip_comments=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.mta_to_csv.
remove_duplicities
(lines)[source]¶
A script that takes mta stops, it splits them by special characters and each item takes for a street
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
average_same_stops
(same_stops)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
extract_stops
(lines)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
get_column_index
(header, caption, default)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
group_by_name
(data)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
load_list
(filename)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
main
()[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.stops_to_streets_experiment.
remove_duplicities
(lines)[source]¶
A script that takes us cities (city state_code)file and state-codes and it joins them
Usage:
./us_cities_to_csv.py [-o: output_file] cities.txt state-codes.txt
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
average_same_city
(same_stops)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
extract_fields
(lines, header, state_dictionary, skip_comments=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
get_column_index
(header, caption, default)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
group_by_city_and_state
(data)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
load_list
(filename, skip_comments=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
load_state_code_dict
(file_state_codes, skip_comments=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.preprocessing.us_cities_to_csv.
remove_duplicities
(lines)[source]¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
A script that creates an expansion from a preprocessed list of boroughs
For usage write expand_boroughs_script.py -h
-
alex.applications.PublicTransportInfoEN.data.expand_boroughs_script.
all_to_lower
(site_list)[source]¶
A script that creates an expansion from a preprocessed list of cities
For usage write expand_cities_script.py -h
A script that creates an expansion from a preprocessed list of states
For usage write expand_states_script.py -h
A script that creates an expansion from a list of stops
For usage write expand_stops_script.py -h
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
file_check
(filename, message=u'reading file')[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
get_column_index
(header, caption, default)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
handle_compatibility
(file_in, file_out, no_cache=False)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
handle_csv
(csv_in, csv_out, no_cache=False)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
load_list
(filename, skip_comments=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
merge
(primary, secondary, surpress_warning=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
process_places
(places_in, place_out, places_add, no_cache=False)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
read_compatibility
(filename)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
read_expansions
(stops_expanded_file)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
read_first_column
(filename, surpress_warning=True)[source]¶
-
alex.applications.PublicTransportInfoEN.data.expand_stops_script.
read_two_columns
(filename)[source]¶
A script that creates an expansion from a list of stops
For usage write expand_stops_script.py -h
-
alex.applications.PublicTransportInfoEN.data.ontology.
add_slot_values_from_database
(slot, category, exceptions=set([]))[source]¶
-
alex.applications.PublicTransportInfoEN.data.ontology.
load_compatible_values
(fname, slot1, slot2)[source]¶
A simple script for adding new utterances along with their semantics to bootstrap.sem and bootstrap.trn.
Usage:
./add_to_bootsrap < input.tsv
The script expects input with tab-separated transcriptions + semantics (one utterance per line). It automatically generates the dummy ‘bootstrap_XXXX.wav’ identifiers and separates the transcription and semantics into two files.
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
-
class
alex.applications.PublicTransportInfoEN.directions.
Directions
(**kwargs)[source]¶ Bases:
alex.applications.PublicTransportInfoEN.directions.Travel
Ancestor class for transit directions, consisting of several routes.
-
class
alex.applications.PublicTransportInfoEN.directions.
DirectionsFinder
[source]¶ Bases:
object
Abstract ancestor for transit direction finders.
-
class
alex.applications.PublicTransportInfoEN.directions.
GoogleDirections
(input_json={}, **kwargs)[source]¶ Bases:
alex.applications.PublicTransportInfoEN.directions.Directions
Traffic directions obtained from Google Maps API.
-
class
alex.applications.PublicTransportInfoEN.directions.
GoogleDirectionsFinder
(cfg)[source]¶ Bases:
alex.applications.PublicTransportInfoEN.directions.DirectionsFinder
,alex.tools.apirequest.APIRequest
Transit direction finder using the Google Maps query engine.
-
class
alex.applications.PublicTransportInfoEN.directions.
GoogleRoute
(input_json)[source]¶ Bases:
alex.applications.PublicTransportInfoEN.directions.Route
-
class
alex.applications.PublicTransportInfoEN.directions.
GoogleRouteLeg
(input_json)[source]¶ Bases:
alex.applications.PublicTransportInfoEN.directions.RouteLeg
-
class
alex.applications.PublicTransportInfoEN.directions.
GoogleRouteLegStep
(input_json)[source]¶ Bases:
alex.applications.PublicTransportInfoEN.directions.RouteStep
-
VEHICLE_TYPE_MAPPING
= {u'FUNICULAR': u'cable_car', u'COMMUTER_TRAIN': u'train', u'INTERCITY_BUS': u'bus', u'METRO_RAIL': u'tram', u'BUS': u'bus', u'SHARE_TAXI': u'bus', u'RAIL': u'train', u'Long distance train': u'train', u'CABLE_CAR': u'cable_car', u'Train': u'train', u'TRAM': u'tram', u'HEAVY_RAIL': u'train', u'OTHER': u'dontcare', u'SUBWAY': u'subway', u'TROLLEYBUS': u'bus', u'FERRY': u'ferry', u'GONDOLA_LIFT': u'ferry', u'MONORAIL': u'monorail', u'HIGH_SPEED_TRAIN': u'train'}¶
-
-
class
alex.applications.PublicTransportInfoEN.directions.
Route
[source]¶ Bases:
object
Ancestor class for one transit direction route.
-
class
alex.applications.PublicTransportInfoEN.directions.
RouteLeg
[source]¶ Bases:
object
One traffic directions leg.
-
class
alex.applications.PublicTransportInfoEN.directions.
RouteStep
(travel_mode)[source]¶ Bases:
object
One transit directions step – walking or using public transport. Data members: travel_mode – TRANSIT / WALKING
- For TRANSIT steps:
departure_stop departure_time arrival_stop arrival_time headsign – direction of the transit line vehicle – type of the transit vehicle (tram, subway, bus) line_name – name or number of the transit line
- For WALKING steps:
duration – estimated walking duration (seconds)
-
MODE_TRANSIT
= u'TRANSIT'¶
-
MODE_WALKING
= u'WALKING'¶
-
exception
alex.applications.PublicTransportInfoEN.exceptions.
PTIENHDCPolicyException
[source]¶ Bases:
alex.components.dm.exceptions.DialoguePolicyException
-
class
alex.applications.PublicTransportInfoEN.hdc_policy.
PTIENHDCPolicy
(cfg, ontology)[source]¶ Bases:
alex.components.dm.base.DialoguePolicy
The handcrafted policy for the PTI-EN system.
-
DEFAULT_AMPM_TIMES
= {u'night': u'00:00', u'evening': u'18:00', u'pm': u'15:00', u'am': u'10:00', u'morning': u'06:00'}¶
-
DESTIN
= u'FINAL_DEST'¶
-
ORIGIN
= u'ORIGIN'¶
-
backoff_action
(ds)[source]¶ Generate a random backoff dialogue act in case we don’t know what to do.
Parameters: ds – The current dialogue state Return type: DialogueAct
-
check_city_state_conflict
(in_city, in_state)[source]¶ Check for conflicts in the given city and state. Return an apology() DA if the state and city is incompatible.
Parameters: - in_city – city slot value
- in_state – state slot value
Return type: Returns: apology dialogue act in case of conflict, or None
-
check_directions_conflict
(wp)[source]¶ Check for conflicts in the given waypoints. Return an apology() DA if the origin and the destination are the same, or if a city is not compatible with the corresponding stop.
Parameters: wp – wayponts of the user’s connection query Return type: DialogueAct Returns: apology dialogue act in case of conflict, or None
-
confirm_info
(tobe_confirmed_slots)[source]¶ Return a DA containing confirming only one slot from the slot to be confirmed. Confirm the slot with the most probable value among all slots to be confirmed.
Parameters: tobe_confirmed_slots – A dictionary with keys for all slots that should be confirmed, along with their values Return type: DialogueAct
-
filter_iconfirms
(da)[source]¶ Filter implicit confirms if the same information is uttered in an inform dialogue act item. Also filter implicit confirms for stop names equaling city names. Also check if the stop and city names are equal!
Parameters: da – unfiltered dialogue act Returns: filtered dialogue act
-
gather_connection_info
(ds, accepted_slots)[source]¶ Return a DA requesting further information needed to search for traffic directions and a dictionary containing the known information. Infers city names based on stop names and vice versa.
If the request DA is empty, the search for directions may be commenced immediately.
Parameters: ds – The current dialogue state Return type: DialogueAct, dict
-
gather_time_info
(ds, accepted_slots)[source]¶ Handles if in_city specified it handles properly filled in_state slot. If needed, a Request DA is formed for missing in_state slot.
Returns Reqest DA and in_state If the request DA is empty, the search for current_time may be commenced immediately.
Parameters: ds – The current dialogue state,
-
gather_weather_info
(ds, accepted_slots)[source]¶ Handles in_city and in_state to be properly filled. If needed, a Request DA is formed for missing slots to be filled.
Returns Reqest DA and WeatherPoint - information about the place If the request DA is empty, the search for weather may be commenced immediately.
Parameters: ds – The current dialogue state,
-
get_accepted_mpv
(ds, slot_name, accepted_slots)[source]¶ Return a slot’s ‘mpv()’ (most probable value) if the slot is accepted, and return ‘none’ otherwise. Also, convert a mpv of ‘*’ to ‘none’ since we don’t know how to interpret it.
Parameters: - ds – Dialogue state
- slot_name – The name of the slot to query
- accepted_slots – The currently accepted slots of the dialogue state
Return type: string
-
get_an_alternative
(ds)[source]¶ Return an alternative route, if there is one, or ask for origin stop if there has been no route searching so far.
Parameters: ds – The current dialogue state Return type: DialogueAct
-
get_confirmed_info
(confirmed_slots, ds, accepted_slots)[source]¶ Return a DA containing information about all slots being confirmed by the user (confirm/deny).
Update the current dialogue state regarding the information provided.
WARNING This confirms only against values in the dialogue state, however, it should (also in some cases) confirm against the results obtained from database, e.g. departure_time slot.
Parameters: - ds – The current dialogue state
- confirmed_slots – A dictionary with keys for all slots being confirmed, along with their values
Return type:
-
get_connection_res_da
(ds, ludait, slots_being_requested, slots_being_confirmed, accepted_slots, changed_slots, state_changed)[source]¶ Handle the public transport connection dialogue topic.
Parameters: ds – The current dialogue state Return type: DialogueAct
-
get_current_time_res_da
(ds, accepted_slots, state_changed)[source]¶ Generates a dialogue act informing about the current time. :rtype: DialogueAct
-
get_da
(dialogue_state)[source]¶ - The main policy decisions are made here. For each action, some set of conditions must be met. These
- conditions depends on the action.
Parameters: dialogue_state – the belief state provided by the tracker Returns: a dialogue act - the system action
-
get_default_stop_for_city
(city)[source]¶ Return a `default’ stop based on the city name (main bus/train station).
Parameters: city – city name (unicode) Return type: unicode
-
get_directions
(ds, route_type=u'true', check_conflict=False)[source]¶ Retrieve Google directions, save them to dialogue state and return corresponding DAs.
Responsible for the interpretation of AM/PM time expressions.
Parameters: - ds – The current dialogue state
- route_type – a label for the found route (to be passed on to
say_directions()
) - check_conflict – If true, will check if the origin and destination stops are different and issue a warning DA if not.
Return type:
-
get_iconfirm_info
(changed_slots)[source]¶ Return a DA containing all needed implicit confirms.
Implicitly confirm all slots provided but not yet confirmed.
This include also slots changed during the conversation.
Parameters: changed_slots – A dictionary with keys for all slots that have not been implicitly confirmed, along with their values Return type: DialogueAct
-
get_requested_alternative
(ds, slots_being_requested, accepted_slots)[source]¶ Return the requested route (or inform about not finding one).
Parameters: ds – The current dialogue state Return type: DialogueAct
-
get_requested_info
(requested_slots, ds, accepted_slots)[source]¶ Return a DA containing information about all requested slots.
Parameters: - ds – The current dialogue state
- requested_slots – A dictionary with keys for all requested slots and the correct return values.
Return type:
-
get_weather
(ds, ref_point=None)[source]¶ Retrieve weather information according to the current dialogue state. Infers state names based on city names and vice versa.
Parameters: ds – The current dialogue state Return type: DialogueAct
-
get_weather_res_da
(ds, ludait, slots_being_requested, slots_being_confirmed, accepted_slots, changed_slots, state_changed)[source]¶ Handle the dialogue about weather.
Parameters: - ds – The current dialogue state
- slots_being_requested – The slots currently requested by the user
Return type:
-
interpret_time
(time_abs, time_ampm, time_rel, date_rel, lta_time)[source]¶ Interpret time, given current dialogue state most probable values for relative and absolute time and date, plus the corresponding last-talked-about value.
Returns: the inferred time value + flag indicating the inferred time type (‘abs’ or ‘rel’) Return type: tuple(datetime, string)
-
process_directions_for_output
(dialogue_state, route_type)[source]¶ Return DAs for the directions in the current dialogue state. If the directions are not valid (nothing found), delete their object from the dialogue state and return apology DAs.
Parameters: - dialogue_state – the current dialogue state
- route_type – the route type requested by the user (“last”, “next” etc.)
Return type:
-
req_arrival_time
(dialogue_state)[source]¶ Return a DA informing about the arrival time the destination stop of the last recommended connection.
-
req_arrival_time_rel
(dialogue_state)[source]¶ Return a DA informing about the relative arrival time the destination stop of the last recommended connection.
-
req_departure_time
(dialogue_state)[source]¶ Generates a dialogue act informing about the departure time from the origin stop of the last recommended connection.
:rtype : DialogueAct
-
req_departure_time_rel
(dialogue_state)[source]¶ Return a DA informing the user about the relative time until the last recommended connection departs.
-
req_distance
(dialogue_state)[source]¶ Return a DA informing the user about the distance and number of stops in the last recommended connection.
-
req_duration
(dialogue_state)[source]¶ Return a DA informing about journey time to the destination stop of the last recommended connection.
-
req_from_stop
(ds)[source]¶ Generates a dialogue act informing about the origin stop of the last recommended connection.
- TODO: this gives too much of information. Maybe it would be worth to split this into more dialogue acts
- and let user ask for all individual pieces of information. The good thing would be that it would lead to longer dialogues.
:rtype : DialogueAct
-
req_num_transfers
(dialogue_state)[source]¶ Return a DA informing the user about the number of transfers in the last recommended connection.
-
req_time_transfers
(dialogue_state)[source]¶ Return a DA informing the user about transfer places and time needed for the trasfer in the last recommended connection.
-
req_to_stop
(ds)[source]¶ Return a DA informing about the destination stop of the last recommended connection.
-
reset_on_change
(ds, changed_slots)[source]¶ Reset slots which depends on changed slots.
Parameters: - ds – dialogue state
- changed_slots – slots changed in the last turn
-
select_info
(tobe_selected_slots)[source]¶ Return a DA containing select act for two most probable values of only one slot from the slot to be used for select DAI.
Parameters: tobe_selected_slots – A dictionary with keys for all slots which the two most probable values should be selected Return type: DialogueAct
-
-
class
alex.applications.PublicTransportInfoEN.hdc_slu.
DAIBuilder
(utterance, abutterance_lenghts=None)[source]¶ Bases:
object
Builds DialogueActItems with proper alignment to corresponding utterance words. When words are successfully matched using DAIBuilder, their indices in the utterance are added to alignment set of the DAI as a side-effect.
-
build
(act_type=None, slot=None, value=None)[source]¶ Produce DialogueActItem based on arguments and alignment from this DAIBuilder state.
-
ending_phrases_in
(phrases)[source]¶ Returns True if the utterance ends with one of the phrases
Parameters: phrases – a list of phrases to search for Return type: bool
-
-
class
alex.applications.PublicTransportInfoEN.hdc_slu.
PTIENHDCSLU
(preprocessing, cfg)[source]¶ Bases:
alex.components.slu.base.SLUInterface
-
abstract_utterance
(utterance)[source]¶ Return a list of possible abstractions of the utterance.
Parameters: utterance – an Utterance instance Returns: a list of abstracted utterance, form, value, category label tuples
-
handle_false_abstractions
(abutterance)[source]¶ Revert false positive alarms of abstraction
Parameters: abutterance – the abstracted utterance Returns: the abstracted utterance without false positive abstractions
-
parse_1_best
(obs, verbose=False, *args, **kwargs)[source]¶ Parse an utterance into a dialogue act.
:rtype DialogueActConfusionNetwork
-
parse_ampm
(abutterance, cn)[source]¶ Detects the ampm in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_borough
(abutterance, cn)[source]¶ Detects stops in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_city
(abutterance, cn)[source]¶ Detects stops in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_date_rel
(abutterance, cn)[source]¶ Detects the relative date in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_meta
(utterance, abutt_lenghts, cn)[source]¶ Detects all dialogue acts which do not generalise its slot values using CLDB.
- NOTE: Use DAIBuilder (‘dai’ variable) to match words and build DialogueActItem,
- so that the DAI is aligned to corresponding words. If matched words are not supposed to be aligned, use PTICSHDCSLU matching method instead. Make sure to list negative conditions first, so the following positive conditions are not added to alignment, when they shouldn’t. E.g.: (not any_phrase_in(u, [‘dobrý den’, ‘dobrý večer’]) and dai.any_word_in(“dobrý”))
Parameters: - utterance – the input utterance
- cn – The output dialogue act item confusion network.
Returns: None
-
parse_non_speech_events
(utterance, cn)[source]¶ Processes non-speech events in the input utterance.
Parameters: - utterance – the input utterance
- cn – The output dialogue act item confusion network.
Returns: None
-
parse_number
(abutterance)[source]¶ Detect a number in the input abstract utterance
Number words that form time expression are collapsed into a single TIME category word. Recognized time expressions (where FRAC, HOUR and MIN stands for fraction, hour and minute numbers respectively):
- FRAC [na] HOUR
- FRAC hodin*
- HOUR a FRAC hodin*
- HOUR hodin* a MIN minut*
- HOUR hodin* MIN
- HOUR hodin*
- HOUR [0]MIN
- MIN minut*
Words of NUMBER category are assumed to be in format parsable to int or float
Parameters: abutterance (Utterance) – the input abstract utterance.
-
parse_state
(abutterance, cn)[source]¶ Detects state in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_stop
(abutterance, cn)[source]¶ Detects stops in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_street
(abutterance, cn)[source]¶ Detects street in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_task
(abutterance, cn)[source]¶ Detects the task in the input abstract utterance.
Parameters: - abutterance –
- cn – The output dialogue act item confusion network.
-
parse_time
(abutterance, cn)[source]¶ Detects the time in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_vehicle
(abutterance, cn)[source]¶ Detects the vehicle (transport type) in the input abstract utterance.
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
-
parse_waypoint
(abutterance, cn, wp_id, wp_slot_suffix, phr_wp_types, phr_in=None)[source]¶ Detects stops or cities in the input abstract utterance (called through parse_city or parse_stop).
Parameters: - abutterance – the input abstract utterance.
- cn – The output dialogue act item confusion network.
- wp_id – waypoint slot category label (e.g. “STOP=”, “CITY=”)
- wp_slot_suffix – waypoint slot suffix (e.g. “stop”, “city”)
- phr_wp_types – set of phrases for each waypoint type
- phr_in – phrases for ‘in’ waypoint type
-
-
alex.applications.PublicTransportInfoEN.hdc_slu.
ending_phrases_in
(utterance, phrases)[source]¶ Returns True if the utterance ends with one of the phrases
Parameters: - utterance – The utterance to search in
- phrases – a list of phrases to search for
Return type: bool
-
alex.applications.PublicTransportInfoEN.hdc_slu.
first_phrase_span
(utterance, phrases)[source]¶ Returns the span (start, end+1) of the first phrase from the given list that is found in the utterance. Returns (-1, -1) if no phrase is found.
Parameters: - utterance – The utterance to search in
- phrases – a list of phrases to be tried (in the given order)
Return type: tuple
-
alex.applications.PublicTransportInfoEN.hdc_slu.
last_phrase_pos
(utterance, words)[source]¶ Returns the last position of a given phrase in the given utterance, or -1 if not found.
Return type: int
-
alex.applications.PublicTransportInfoEN.hdc_slu.
last_phrase_span
(utterance, phrases)[source]¶ Returns the span (start, end+1) of the last phrase from the given list that is found in the utterance. Returns (-1, -1) if no phrase is found.
Parameters: - utterance – The utterance to search in
- phrases – a list of phrases to be tried (in the given order)
Return type: tuple
-
class
alex.applications.PublicTransportInfoEN.preprocessing.
PTIENNLGPreprocessing
(ontology)[source]¶ Bases:
alex.components.nlg.template.TemplateNLGPreprocessing
Template NLG preprocessing routines for English public transport information.
This serves for spelling out relative and absolute time expressions,
-
preprocess
(template, svs_dict)[source]¶ Preprocess values to be filled into an NLG template. Spells out temperature and time expressions and translates some of the values to English.
Parameters: svs_dict – Slot-value dictionary Returns: The same dictionary, with modified values
-
spell_temperature
(value, interval)[source]¶ Convert a temperature expression into words (assuming nominative).
Parameters: - value – Temperature value (whole number in degrees as string), e.g. ‘1’ or ‘-10’.
- interval – Boolean indicating whether to treat this as a start of an interval, i.e. omit the degrees word.
Returns: temperature expression as string
-
-
class
alex.applications.PublicTransportInfoEN.preprocessing.
PTIENSLUPreprocessing
(*args, **kwargs)[source]¶ Bases:
alex.components.slu.base.SLUPreprocessing
Extends SLUPreprocessing for some transformations:
-
alex.applications.PublicTransportInfoEN.site_preprocessing.
expand
(element, spell_numbers=True)[source]¶
-
class
alex.applications.PublicTransportInfoEN.time_zone.
GoogleTimeFinder
(cfg)[source]¶
-
class
alex.applications.utils.weather.
OpenWeatherMapWeather
(input_json, condition_transl, date=None, daily=False, celsius=True)[source]¶
-
class
alex.applications.utils.weather.
OpenWeatherMapWeatherFinder
(cfg)[source]¶ Bases:
alex.applications.utils.weather.WeatherFinder
,alex.tools.apirequest.APIRequest
Weather service using OpenWeatherMap (http://openweathermap.org)
Submodules¶
alex.applications.ahub module¶
alex.applications.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
alex.applications.exceptions module¶
-
exception
alex.applications.exceptions.
HubException
[source]¶ Bases:
alex.AlexException
alex.applications.shub module¶
-
class
alex.applications.shub.
SemHub
(cfg)[source]¶ Bases:
alex.components.hub.hub.Hub
SemHub builds a text based testing environment for the dialogue manager components.
It reads dialogue acts from the standard input and passes it to the selected dialogue manager. The output is the form dialogue acts.
-
hub_type
= u'SHub'¶
-
input_da_nblist
()[source]¶ Reads an N-best list of dialogue acts from the input.
:rtype : confusion network
-
alex.applications.thub module¶
alex.applications.vhub module¶
alex.applications.webhub module¶
Module contents¶
alex.components package¶
Subpackages¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
-
exception
alex.components.asr.exceptions.
ASRException
[source]¶ Bases:
alex.AlexException
-
class
alex.components.asr.test_utterance.
TestUttCNFeats
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
Basic test for utterance confnet features.
-
class
alex.components.asr.test_utterance.
TestUtterance
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
Tests correct working of the Utterance class.
-
class
alex.components.asr.utterance.
ASRHypothesis
[source]¶ Bases:
alex.ml.hypothesis.Hypothesis
This is the base class for all forms of probabilistic ASR hypotheses representations.
-
class
alex.components.asr.utterance.
AbstractedUtterance
(surface)[source]¶ Bases:
alex.components.asr.utterance.Utterance
,alex.ml.features.Abstracted
-
classmethod
from_utterance
(utterance)[source]¶ Constructs a new AbstractedUtterance from an existing Utterance.
-
other_val
= (u'[OTHER]',)¶
-
phrase2category_label
(phrase, catlab)[source]¶ Replaces the phrase given by `phrase’ by a new phrase, given by `catlab’. Assumes `catlab’ is an abstraction for `phrase’.
-
classmethod
-
class
alex.components.asr.utterance.
Utterance
(surface)[source]¶ Bases:
object
-
find
(phrase)[source]¶ Returns the word index of the start of first occurrence of `phrase’ within this utterance. If none is found, returns -1.
- Arguments:
- phrase – a list of words constituting the phrase sought
-
index
(phrase)[source]¶ Returns the word index of the start of first occurrence of `phrase’ within this utterance. If none is found, ValueError is raised.
- Arguments:
- phrase – a list of words constituting the phrase sought
-
iter_with_boundaries
()[source]¶ Iterates the sequence [SENTENCE_START, word1, ..., wordlast, SENTENCE_END].
-
lower
()[source]¶ Lowercases words of this utterance.
BEWARE, this method is destructive, it lowercases self.
-
replace
(orig, replacement, return_startidx=False)[source]¶ Analogous to the `str.replace’ method. If the original phrase is not found in this utterance, this instance is returned. If it is found, only the first match is replaced.
- Arguments:
orig – the phrase to replace, as a sequence of words replacement – the replacement in the same form return_startidx – if set to True, the tuple (replaced, orig_pos)
-
replace2
(start, end, replacement)[source]¶ Replace the words from start to end with the replacement.
Parameters: - start – the start position of replaced word sequence
- end – the end position of replaced word sequence
- replacement – a replacement
Returns: return a new Utterance instance with the word sequence replaced with the replacement
-
replace_all
(orig, replacement)[source]¶ Replace all occurrences of the given words with the replacement. Only replaces at word boundaries.
Parameters: - orig – the original string to be replaced (as string or list of words)
- replacement – the replacement (as string or list of words)
Return type:
-
utterance
¶
-
-
class
alex.components.asr.utterance.
UtteranceConfusionNetwork
(rep=None)[source]¶ Bases:
alex.components.asr.utterance.ASRHypothesis
,alex.ml.features.Abstracted
Word confusion network
- Attributes:
- cn: a list of alternatives of the following signature
- [word_index-> [ alternative ]]
XXX Are the alternatives always sorted wrt their probability in decreasing order?
TODO Define a lightweight class SimpleHypothesis as a tuple (probability, fact) with easy-to-read indexing. namedtuple might be the best choice.
-
class
Index
(is_long_link, word_idx, alt_idx, link_widx)¶ Bases:
tuple
unique index into the confnet
- Attributes:
is_long_link – indexing to a long link? word_idx – first index either to self.cn or self._long_links alt_idx – second index ditto link_widx – if is_long_link, this indexes the word within a phrase of
the long link
-
alt_idx
¶ Alias for field number 2
-
is_long_link
¶ Alias for field number 0
-
link_widx
¶ Alias for field number 3
-
word_idx
¶ Alias for field number 1
-
class
UtteranceConfusionNetwork.
LongLink
(end, orig_probs, hyp, normalise=False)[source]¶ Bases:
object
-
attrs
= (u'end', u'orig_probs', u'hyp', u'normalise')¶ Represents a long link in a word confusion network.
- Attributes:
end – end index of the link (exclusive) orig_probs – list of probabilities associated with the ordinary
words this link corresponds to- hyp – a (probability, phrase) tuple, the label of this link.
- `phrase’ itself is a sequence of words (list of strings).
- normalise – boolean; whether this link’s probability should be
- taken into account when normalising probabilities for alternatives in the confnet
-
-
UtteranceConfusionNetwork.
add
(hyps)[source]¶ Adds a new arc to the confnet with alternatives as specified.
- Arguments:
- hyps: an iterable of simple hypotheses – (probability, word)
tuples
-
UtteranceConfusionNetwork.
cn
¶
-
UtteranceConfusionNetwork.
get_next_worse_candidates
(hyp_index)[source]¶ Returns such hypotheses that will have lower probability. It assumes that the confusion network is sorted.
-
UtteranceConfusionNetwork.
get_phrase_idxs
(phrase, start=0, end=None, start_in_midlinks=True, immediate=False)[source]¶ Returns indices to words constituting the given phrase within this confnet. It looks only for the first occurrence of the phrase in the interval specified.
- Arguments:
phrase: the phrase to look for, specified as a list of words start: the index where to start searching end: the index after which to stop searching start_in_midlinks: whether a phrase starting in the middle of
a long link should be considered too- immediate: whether the phrase has to start immediately at the start
- index (intervening empty words are allowed)
- Returns:
- an empty list in case that phrase was not found
- a list of indices to words (UtteranceConfusionNetwork.Index) that constitute that phrase within this confnet
-
UtteranceConfusionNetwork.
get_prob
(hyp_index)[source]¶ Returns a probability of the given hypothesis.
-
UtteranceConfusionNetwork.
get_utterance_nblist
(n=10, prune_prob=0.005)[source]¶ Parses the confusion network and generates n best hypotheses.
The result is a list of utterance hypotheses each with a with assigned probability. The list also includes the utterance “_other_” for not having the correct utterance in the list.
Generation of hypotheses will stop when the probability of the hypotheses is smaller then the
prune_prob
.
-
UtteranceConfusionNetwork.
iter_ngrams
(n, with_boundaries=False, start=None)[source]¶ Iterates n-gram hypotheses of the length specified. This is the interface method. It is aware of multi-word phrases (“long links”) that were substituted into the confnet.
- Arguments:
- n: size of the n-grams with_boundaries: whether to include special sentence boundary marks start: at which word index the n-grams have to start (exactly)
-
UtteranceConfusionNetwork.
iter_ngrams_fromto
(from_=None, to=None)[source]¶ Iterates n-gram hypotheses between the indices `from_‘ and `to_‘. This method does not consider phrases longer than 1 that were substituted into the confnet.
-
UtteranceConfusionNetwork.
iter_ngrams_unaware
(n, with_boundaries=False)[source]¶ Iterates n-gram hypotheses of the length specified. This is the interface method, and uses `iter_ngrams_fromto’ internally. This method does not consider phrases longer than 1 that were substituted into the confnet.
- Arguments:
- n: size of the n-grams with_boundaries: whether to include special sentence boundary marks
-
UtteranceConfusionNetwork.
lower
()[source]¶ Lowercases words of this confnet.
BEWARE, this method is destructive, it lowercases self.
-
UtteranceConfusionNetwork.
merge
()[source]¶ Adds up probabilities for the same hypotheses.
TODO: not implemented yet
-
UtteranceConfusionNetwork.
normalise
(end=None)[source]¶ Makes sure that all probabilities add up to one. There should be no need of calling this from outside, since this invariant is ensured between calls to this class’ methods.
-
UtteranceConfusionNetwork.
other_val
= (u'[OTHER]',)¶
-
UtteranceConfusionNetwork.
phrase2category_label
(phrase, catlab)[source]¶ Replaces the phrase given by `phrase’ by a new phrase, given by `catlab’. Assumes `catlab’ is an abstraction for `phrase’.
-
UtteranceConfusionNetwork.
repr_escer
= <alex.utils.text.Escaper object>¶
-
UtteranceConfusionNetwork.
repr_spec_chars
= u'():,;|[]"\\'¶
-
UtteranceConfusionNetwork.
sort
()[source]¶ Sort the alternatives for each word according their probability.
-
UtteranceConfusionNetwork.
str_escer
= <alex.utils.text.Escaper object>¶
-
class
alex.components.asr.utterance.
UtteranceConfusionNetworkFeatures
(type=u'ngram', size=3, confnet=None)[source]¶ Bases:
alex.ml.features.Features
Represents features extracted from an utterance hypothesis in the form of a confusion network. These are simply a probabilistic generalisation of simple utterance features. Only n-gram (incl. skip n-gram) features are currently implemented.
-
class
alex.components.asr.utterance.
UtteranceFeatures
(type=u'ngram', size=3, utterance=None)[source]¶ Bases:
alex.ml.features.Features
Represents the vector of features for an utterance.
The class also provides methods for manipulation of the feature vector, including extracting features from an utterance.
Currently, only n-gram (including skip n-grams) features are implemented.
- Attributes:
- type: type of features (‘ngram’) size: size of features (an integer) features: mapping { feature : value of feature (# occs) }
-
class
alex.components.asr.utterance.
UtteranceHyp
(prob=None, utterance=None)[source]¶ Bases:
alex.components.asr.utterance.ASRHypothesis
Provide an interface for 1-best hypothesis from the ASR component.
-
class
alex.components.asr.utterance.
UtteranceNBList
(rep=None)[source]¶ Bases:
alex.components.asr.utterance.ASRHypothesis
,alex.ml.hypothesis.NBList
Provides functionality of n-best lists for utterances.
When updating the n-best list, one should do the following.
- add utterances or parse a confusion network
- merge and normalise, in either order
- Attributes:
- n_best: the list containing pairs [prob, utterance] sorted from the
- most probable to the least probable ones
-
get_best_utterance
()[source]¶ Returns the most probable utterance.
DEPRECATED. Use get_best instead.
-
class
alex.components.asr.utterance.
UtteranceNBListFeatures
(type=u'ngram', size=3, utt_nblist=None)[source]¶ Bases:
alex.ml.features.Features
-
alex.components.asr.utterance.
load_utt_confnets
(fname, limit=None, encoding=u'UTF-8')[source]¶ Loads a dictionary of utterance confusion networks from a given file.
The file is assumed to contain lines of the following form:
[whitespace..]<key>[whitespace..]=>[whitespace..]<utt_cn>[whitespace..]or just (without keys):
[whitespace..]<utt_cn>[whitespace..]where <utt_cn> is obtained as repr() of an UtteranceConfusionNetwork object.
- Arguments:
- fname – path towards the file to read the utterance confusion networks
- from
limit – limit on the number of confusion networks to read encoding – the file encoding
Returns a dictionary with confnets (instances of Utterance) as values.
-
alex.components.asr.utterance.
load_utt_nblists
(fname, limit=None, n=40, encoding=u'UTF-8')[source]¶ Loads a dictionary of utterance n-best lists from a file with confnets.
The n-best lists are obtained simply from the confnets.
The file is assumed to contain lines of the following form:
[whitespace..]<key>[whitespace..]=>[whitespace..]<utt_cn>[whitespace..]or just (without keys):
[whitespace..]<utt_cn>[whitespace..]where <utt_cn> is obtained as repr() of an UtteranceConfusionNetwork object.
- Arguments:
- fname – path towards the file to read the utterance confusion networks
- from
limit – limit on the number of n-best lists to read n – depth of n-best lists encoding – the file encoding
Returns a dictionary with n-best lists (instances of UtteranceNBList) as values.
-
alex.components.asr.utterance.
load_utterances
(fname, limit=None, encoding=u'UTF-8')[source]¶ Loads a dictionary of utterances from a given file.
The file is assumed to contain lines of the following form:
[whitespace..]<key>[whitespace..]=>[whitespace..]<utterance>[whitespace..]
or just (without keys):
[whitespace..]<utterance>[whitespace..]- Arguments:
- fname – path towards the file to read the utterances from limit – limit on the number of utterances to read encoding – the file encoding
Returns a dictionary with utterances (instances of Utterance) as values.
-
alex.components.asr.utterance.
save_utterances
(file_name, utt, encoding=u'UTF-8')[source]¶ Saves a dictionary of utterances in the wave as key format into a file.
Parameters: - file_name – name of the target file
- utt – a dictionary with the utterances where the keys are the names of the corresponding wave files
Returns: None
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
-
class
alex.components.dm.base.
DialogueManager
(cfg)[source]¶ Bases:
object
This is a base class for a dialogue manager. The purpose of a dialogue manager is to accept input in the form dialogue acts and respond again in the form of dialogue acts.
The dialogue manager should be able to accept multiple inputs without producing any output and be able to produce multiple outputs without any input.
-
class
alex.components.dm.base.
DialoguePolicy
(cfg, ontology)[source]¶ Bases:
object
This is a base class policy.
-
class
alex.components.dm.base.
DialogueState
(cfg, ontology)[source]¶ Bases:
object
This is a trivial implementation of a dialogue state and its update.
It uses only the best dialogue act from the input and based on this it updates its state.
-
get_slots_being_confirmed
()[source]¶ Returns all slots which are currently being confirmed by the user along with the value being confirmed.
-
get_slots_being_noninformed
()[source]¶ Returns all slots provided by the user and the system has not informed about them yet along with the value of the slot.
-
get_slots_being_requested
()[source]¶ Returns all slots which are currently being requested by the user along with the correct value.
-
restart
()[source]¶ Reinitialises the dialogue state so that the dialogue manager can start from scratch.
Nevertheless, remember the turn history.
-
update
(user_da, system_da)[source]¶ Interface for the dialogue act update.
It can process dialogue act, dialogue act N best lists, or dialogue act confusion networks.
Parameters: - user_da (
DialogueAct
,DialogueActNBList
orDialogueActConfusionNetwork
) – Dialogue act to process. - system_da – Last system dialogue act.
- user_da (
-
-
class
alex.components.dm.base.
DiscreteValue
(values, name='', desc='')[source]¶ Bases:
object
-
explain
(full=False, linear_prob=False)[source]¶ This function prints the values and their probabilities for this node.
-
-
class
alex.components.dm.dddstate.
D3DiscreteValue
(values={}, name=u'', desc=u'')[source]¶ Bases:
alex.components.dm.base.DiscreteValue
This is a simple implementation of a probabilistic slot. It serves for the case of simple MDP approach or UFAL DSTC 1.0-like dialogue state deterministic update.
-
distribute
(value, dist_prob)[source]¶ This function distributes a portion of probability mass assigned to the
value
to other values with a weightprob
.
-
explain
(full=False, linear_prob=True)[source]¶ This function prints the values and their probabilities for this node.
-
set
(value, prob=None)[source]¶ This function sets a probability of a specific value.
WARNING This can lead to un-normalised probabilities.
-
test
(test_value=None, test_prob=None, neg_val=False, neg_prob=False)[source]¶ Test the most probable value of the slot whether:
- the most probable value is equal to test_value and
- its probability is larger the test_prob
Each of the above tests can be negated when neg_* is set True.
Parameters: - test_value –
- test_prob –
- neg_val –
- neg_prob –
Returns:
-
-
class
alex.components.dm.dddstate.
DeterministicDiscriminativeDialogueState
(cfg, ontology)[source]¶ Bases:
alex.components.dm.base.DialogueState
This is a trivial implementation of a dialogue state and its update.
It uses only the best dialogue act from the input. Based on this it updates its state.
-
get_accepted_slots
(acc_prob)[source]¶ Returns all slots which have a probability of a non “none” value larger then some threshold.
-
get_changed_slots
(cha_prob)[source]¶ Returns all slots that has changed from the previous turn. Because the change is determined by change in probability for a particular value, there may be very small changes. Therefore, this will only report changes for values with a probability larger than the given threshold.
Parameters: cha_prob – minimum current probability of the most probable hypothesis to be reported Return type: dict
-
get_slots_being_confirmed
(conf_prob=0.8)[source]¶ Return all slots which are currently being confirmed by the user along with the value being confirmed.
-
get_slots_being_noninformed
(noninf_prob=0.8)[source]¶ Return all slots provided by the user and the system has not informed about them yet along with the value of the slot.
This will not detect a change in a goal. For example:
U: I want a Chinese restaurant. S: Ok, you want a Chinese restaurant. What price range you have in mind? U: Well, I would rather want an Italian Restaurant. S: Ok, no problem. You want an Italian restaurant. What price range you have in mind?
Because the system informed about the food type and stored “system-informed”, then we will not notice that we confirmed a different food type.
-
get_slots_being_requested
(req_prob=0.8)[source]¶ Return all slots which are currently being requested by the user along with the correct value.
-
get_slots_tobe_confirmed
(min_prob, max_prob)[source]¶ Returns all slots which have a probability of a non “none” value larger then some threshold and still not so large to be considered as accepted.
-
get_slots_tobe_selected
(sel_prob)[source]¶ Returns all slots which have a probability of the two most probable non “none” value larger then some threshold.
-
has_state_changed
(cha_prob)[source]¶ Returns a boolean indicating whether the dialogue state changed significantly since the last turn. True is returned if at least one slot has at least one value whose probability has changed at least by the given threshold since last time.
Parameters: cha_prob – minimum probability change to be reported Return type: Boolean
-
restart
()[source]¶ Reinitialise the dialogue state so that the dialogue manager can start from scratch.
Nevertheless, remember the turn history.
-
slots
= None¶
-
update
(user_da, system_da)[source]¶ Interface for the dialogue act update.
It can process dialogue act, dialogue act N best lists, or dialogue act confusion networks.
Parameters: - user_da (
DialogueAct
,DialogueActNBList
orDialogueActConfusionNetwork
) – Dialogue act to process. - system_da – Last system dialogue act.
- user_da (
-
-
class
alex.components.dm.dstc_tracker.
DSTCState
(slots)[source]¶ Bases:
object
Represents state of the tracker.
-
class
alex.components.dm.dstc_tracker.
DSTCTracker
(slots, default_space_size=defaultdict(<function <lambda> at 0x7fbf78f57668>, {}))[source]¶ Bases:
alex.components.dm.tracker.StateTracker
Represents simple deterministic DSTC state tracker.
-
state_class
¶ alias of
DSTCState
-
This is an example implementation of a dummy yet funny dialogue policy.
-
class
alex.components.dm.dummypolicy.
DummyDialoguePolicy
(cfg, ontology)[source]¶ Bases:
alex.components.dm.base.DialoguePolicy
This is a trivial policy just to demonstrate basic functionality of a proper DM.
-
exception
alex.components.dm.exceptions.
DMException
[source]¶ Bases:
alex.AlexException
-
exception
alex.components.dm.exceptions.
DialogueManagerException
[source]¶ Bases:
alex.AlexException
-
exception
alex.components.dm.exceptions.
DialoguePolicyException
[source]¶ Bases:
alex.AlexException
-
exception
alex.components.dm.exceptions.
DialogueStateException
[source]¶ Bases:
alex.AlexException
-
exception
alex.components.dm.exceptions.
DummyDialoguePolicyException
[source]¶ Bases:
alex.components.dm.exceptions.DialoguePolicyException
-
class
alex.components.dm.ontology.
Ontology
(file_name=None)[source]¶ Bases:
object
Represents an ontology for a dialogue domain.
-
get_compatible_vals
(slot_pair, value)[source]¶ Given a slot pair (key to ‘compatible_values’ in ontology data), this returns the set of compatible values for the given key. If there is no information about the given pair, None is returned.
Parameters: - slot_pair – key to ‘compatible_values’ in ontology data
- value – the subkey to check compatible values for
Return type:
-
get_default_value
(slot)[source]¶ Given a slot name, get its default value (if set in the ontology). Returns None if the default value is not set for the given slot.
Parameters: slot – the name of the desired slot Return type: unicode
-
is_compatible
(slot_pair, val1, val2)[source]¶ Given a slot pair and a pair of values, this tests whether the values are compatible. If there is no information about the slot pair or the first value, returns False. If the second value is None, returns always True (i.e. None is compatible with anything).
Parameters: - slot_pair – key to ‘compatible_values’ in ontology data
- val1 – value of the 1st slot
- val2 – value of the 2nd slot
Return type: Boolean
-
last_talked_about
(*args, **kwds)[source]¶ Returns a list of slots and values that should be used to for tracking about what was talked about recently, given the input dialogue acts.
Parameters: - da_type – the source dialogue act type
- name – the source slot name
- value – the source slot value
Returns: returns a list of target slot names and values used for tracking
-
-
class
alex.components.dm.pstate.
PDDiscrete
(initial=None)[source]¶ Bases:
alex.components.dm.pstate.PDDiscreteBase
Discrete probability distribution.
-
NULL
= None¶
-
OTHER
= '<other>'¶
-
meta_slots
= set([None, '<other>'])¶
-
-
class
alex.components.hub.asr.
ASR
(cfg, commands, audio_in, asr_hypotheses_out, close_event)[source]¶ Bases:
multiprocessing.process.Process
ASR recognizes input audio and returns an N-best list hypothesis or a confusion network.
Recognition starts with the “speech_start()” command in the input audio stream and ends with the “speech_end()” command.
When the “speech_end()” command is received, the component asks responsible ASR module to return hypotheses and sends them to the output.
This component is a wrapper around multiple recognition engines which handles inter-process communication.
- Attributes:
- asr – the ASR object itself
-
process_pending_commands
()[source]¶ Process all pending commands.
- Available commands:
stop() - stop processing and exit the process flush() - flush input buffers.
Now it only flushes the input connection.
Returns True iff the process should terminate.
-
class
alex.components.hub.dm.
DM
(cfg, commands, slu_hypotheses_in, dialogue_act_out, close_event)[source]¶ Bases:
multiprocessing.process.Process
DM accepts N-best list hypothesis or a confusion network generated by an SLU component. The result of this component is an output dialogue act.
When the component receives an SLU hypothesis then it immediately responds with an dialogue act.
This component is a wrapper around multiple dialogue managers which handles multiprocessing communication.
-
epilogue
()[source]¶ Gives the user last information before hanging up.
:return the name of the activity or None
-
-
exception
alex.components.hub.exceptions.
VoipIOException
[source]¶ Bases:
alex.AlexException
-
class
alex.components.hub.messages.
Message
(source, target)[source]¶ Bases:
alex.utils.mproc.InstanceID
Abstract class which implements basic functionality for messages passed between components in the alex.
-
class
alex.components.hub.nlg.
NLG
(cfg, commands, dialogue_act_in, text_out, close_event)[source]¶ Bases:
multiprocessing.process.Process
The NLG component receives a dialogue act generated by the dialogue manager and then it converts the act into the text.
This component is a wrapper around multiple NLG components which handles multiprocessing communication.
-
class
alex.components.hub.slu.
SLU
(cfg, commands, asr_hypotheses_in, slu_hypotheses_out, close_event)[source]¶ Bases:
multiprocessing.process.Process
The SLU component receives ASR hypotheses and converts them into hypotheses about the meaning of the input in the form of dialogue acts.
This component is a wrapper around multiple SLU components which handles inter-process communication.
-
class
alex.components.nlg.tectotpl.block.a2w.cs.concatenatetokens.
ConcatenateTokens
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Detokenize the sentence, spread whitespace correctly.
-
class
alex.components.nlg.tectotpl.block.a2w.cs.removerepeatedtokens.
RemoveRepeatedTokens
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Remove two identical neighboring tokens.
-
class
alex.components.nlg.tectotpl.block.read.tectotemplates.
TectoTemplates
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Reader for partial t-tree dialog system templates, where treelets can be intermixed with linear text.
Example template:
Vlak přijede v [[7|adj:attr] hodina|n:4|gender:fem].
All linear text is inserted into t-lemmas of atomic nodes, while treelets have their formeme and grammateme values filled in.
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addappositionpunct.
AddAppositionPunct
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Separating Czech appositions, such as in ‘John, my best friend, ...’ with commas.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addauxverbcompoundfuture.
AddAuxVerbCompoundFuture
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add compound future auxiliary ‘bude’.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addauxverbcompoundpassive.
AddAuxVerbCompoundPassive
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add compound passive auxiliary ‘být’.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addauxverbcompoundpast.
AddAuxVerbCompoundPast
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add compound past tense auxiliary of the 1st and 2nd person ‘jsem/jsi/jsme/jste’.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
AUX_PAST_FORMS
= {(u'P', u'2'): u'jste', (u'S', u'1'): u'jsem', (u'S', u'2'): u'jsi', (u'.', u'2'): u'jsi', (u'P', u'1'): u'jsme', (u'.', u'1'): u'jsem'}¶
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addauxverbconditional.
AddAuxVerbConditional
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add conditional auxiliary ‘by’/’bych’.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addauxverbmodal.
AddAuxVerbModal
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add modal verbs.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
DEONTMOD_2_MODAL
= {u'vol': u'cht\xedt', u'hrt': u'm\xedt', u'perm': u'moci', u'fac': u'moci', u'deb': u'muset', u'poss': u'moci'}¶
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addclausalexpletives.
AddClausalExpletives
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.addauxwords.AddAuxWords
Add clausal expletive pronoun ‘to’ (+preposition) to subordinate clauses with ‘že’, if the parent verb requires it.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addclausalpunct.
AddClausalPunct
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
An abstract ancestor for blocks working with clausal punctuation.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addcoordpunct.
AddCoordPunct
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add comma to coordinated lists of 3 and more elements, as well as before some Czech coordination conjunctions (‘ale’, ‘ani’).
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addparentheses.
AddParentheses
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add ‘(‘ / ‘)’ nodes to nodes which have the wild/is_parenthesis attribute set.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
add_parenthesis_node
(anode, lemma, clause_num)[source]¶ Add a parenthesis node as a child of the specified a-node; with the given lemma and clause number set.
-
continued_paren_left
(anode)[source]¶ Return True if this node is continuing a parenthesis from the left.
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addprepositions.
AddPrepositions
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.addauxwords.AddAuxWords
Add prepositional a-nodes according to formemes.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addreflexiveparticles.
AddReflexiveParticles
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add reflexive particles to reflexiva tantum and reflexive passive verbs.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addsentfinalpunct.
AddSentFinalPunct
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.cs.addclausalpunct.AddClausalPunct
Add final sentence punctuation (‘?’, ‘.’).
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addsubconjs.
AddSubconjs
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.addauxwords.AddAuxWords
Add subordinate conjunction a-nodes according to formemes.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.addsubordclausepunct.
AddSubordClausePunct
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.cs.addclausalpunct.AddClausalPunct
Add commas separating subordinate clauses.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
are_in_coord_clauses
(aleft, aright)[source]¶ Check if the given nodes are in two coordinated clauses.
-
get_clause_parent
(anode)[source]¶ Return the parent of the clause the given node belongs to; the result may be the root of the tree.
-
class
alex.components.nlg.tectotpl.block.t2a.cs.capitalizesentstart.
CapitalizeSentStart
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Capitalize the first word in the sentence (skip punctuation etc.).
-
OPEN_PUNCT
= u'^[({[\u201a\u201e\xab\u2039|*"\\\']+$'¶
-
-
class
alex.components.nlg.tectotpl.block.t2a.cs.deletesuperfluousauxs.
DeleteSuperfluousAuxs
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Delete repeated prepositions and and conjunctions in coordinations.
-
BASE_DIST_LIMIT
= 8¶
-
DIST_LIMIT
= {u'mezi': 50, u'pro': 8, u'proto\u017ee': 5, u'v': 5}¶
-
-
class
alex.components.nlg.tectotpl.block.t2a.cs.dropsubjpersprons.
DropSubjPersProns
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Remove the Czech pro-drop subject personal pronouns (or demonstrative “to”) from the a-tree.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.generatepossessiveadjectives.
GeneratePossessiveAdjectives
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
According to formemes, this changes the lemma of the surface possessive adjectives from the original (deep) lemma which was identical to the noun from which the adjective is derived, e.g. changes the a-node lemma from ‘Čapek’ to ‘Čapkův’ if the corresponding t-node has the ‘adj:poss’ formeme.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.generatewordforms.
GenerateWordForms
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Inflect word forms according to filled-in tags.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
BACK_REGEX
= <_sre.SRE_Pattern object>¶
-
class
alex.components.nlg.tectotpl.block.t2a.cs.imposeattragr.
ImposeAttrAgr
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.imposeagreement.ImposeAgreement
Impose case, gender and number agreement of attributes with their governing nouns.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.imposecomplagr.
ImposeComplAgr
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.imposeagreement.ImposeAgreement
Impose agreement of adjectival verb complements with the subject.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.imposepronzagr.
ImposePronZAgr
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.imposeagreement.ImposeAgreement
In phrases such as ‘každý z ...’,’žádná z ...’, impose agreement in gender.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
PRONOUNS
= u'^(jeden|ka\u017ed\xfd|\u017e\xe1dn\xfd|oba|v\u0161echen|(n\u011b|lec)kter\xfd|(jak|kter)\xfdkoliv?|libovoln\xfd)$'¶
-
class
alex.components.nlg.tectotpl.block.t2a.cs.imposerelpronagr.
ImposeRelPronAgr
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.imposeagreement.ImposeAgreement
Impose gender and number agreement of relative pronouns with their antecedent.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.imposesubjpredagr.
ImposeSubjPredAgr
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.t2a.imposeagreement.ImposeAgreement
Impose gender and number agreement of relative pronouns with their antecedent.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.initmorphcat.
InitMorphcat
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
According to t-layer grammatemes, this initializes the morphcat structure at the a-layer that is the basis for a later POS tag limiting in the word form generation.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
DEGREE
= {u'comp': u'2', u'pos': u'1', u'acomp': u'2', u'sup': u'3', None: u'.', u'nr': u'.'}¶
-
GENDER
= {u'anim': u'M', u'fem': u'F', u'inan': u'I', u'inher': u'.', u'neut': u'N', None: u'.', u'nr': u'.'}¶
-
NEGATION
= {None: u'A', u'neg0': u'A', u'neg1': u'N'}¶
-
NUMBER
= {None: u'.', u'nr': u'.', u'sg': u'S', u'pl': u'P', u'inher': u'.'}¶
-
PERSON
= {u'1': u'1', None: u'.', u'3': u'3', u'2': u'2', u'inher': u'.'}¶
-
VOICE
= {u'pas': u'P', u'deagent': u'A', u'passive': u'P', u'act': u'A', u'active': u'A', None: u'.'}¶
-
class
alex.components.nlg.tectotpl.block.t2a.cs.marksubject.
MarkSubject
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Marks the subject of each clause with the Afun ‘Sb’.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.markverbalcategories.
MarkVerbalCategories
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Finishes marking synthetic verbal categories: tense, finiteness, mood.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.movecliticstowackernagel.
MoveCliticsToWackernagel
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Move clitics (e.g. ‘se’, ‘to’ etc.) to the second (Wackernagel) position in the clause.
-
clitic_order
(clitic)[source]¶ Return the position of the given clitic in the natural Czech order of multiple clitics in the same clause.
-
find_eo1st_pos
(clause_root, clause_1st)[source]¶ Find the last word before the Wackernagel position.
-
handle_pronoun_je
(anode)[source]¶ If the given node is a personal pronoun with the form ‘je’, move it before its parent’s subtree and return True. Return false otherwise.
-
is_coord_taking_1st_pos
(clause_root)[source]¶ Return True if the clause root is a coordination member and the coordinating conjunction or shared subjunction is taking up the 1st position. E.g. ‘Běžel, aby se zahřál a dostal se dřív domů.’
-
-
class
alex.components.nlg.tectotpl.block.t2a.cs.projectclausenumber.
ProjectClauseNumber
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Project clause numbering from t-nodes to a-nodes.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.reversenumbernoundependency.
ReverseNumberNounDependency
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
This block reverses the dependency of incongruent Czech numerals (5 and higher), hanging their parents under them in the a-tree.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.t2a.cs.vocalizeprepos.
VocalizePrepos
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
This block replaces the forms of prepositions ‘k’, ‘v’, ‘z’, ‘s’ with their vocalized variants ‘ke’/’ku’, ‘ve’, ‘ze’, ‘se’ according to the following word.
-
class
alex.components.nlg.tectotpl.block.t2a.addauxwords.
AddAuxWords
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Add auxiliary a-nodes according to formemes.
This is a base class for all steps adding auxiliary nodes.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
get_anode
(tnode)[source]¶ Return the a-node corresponding to the given t-node. Defaults to lexical a-node.
-
get_aux_forms
(tnode)[source]¶ This should return a list of new forms for the auxiliaries, or None if none should be added
-
new_aux_node
(aparent, form)[source]¶ Create an auxiliary node with the given surface form and parent.
-
class
alex.components.nlg.tectotpl.block.t2a.copyttree.
CopyTTree
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
This block creates an a-tree based on a t-tree in the same zone.
- Arguments:
- language: the language of the target zone selector: the selector of the target zone
-
class
alex.components.nlg.tectotpl.block.t2a.imposeagreement.
ImposeAgreement
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
A common ancestor for blocks that impose a grammatical agreement of some kind: they should override the should_agree(tnode), process_excepts(tnode), and impose(tnode) methods.
- Arguments:
- language: the language of the target tree selector: the selector of the target tree
-
class
alex.components.nlg.tectotpl.block.util.copytree.
CopyTree
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
This block is able to copy a tree on the same layer from a different zone.
- Arguments:
- language: the language of the TARGET zone selector: the selector of the TARGET zone source_language the language of the SOURCE zone (defaults to same as target) source_selector the selector of the SOURCE zone (defaults to same as target) layer: the layer to which this conversion should be applied
TODO: apply to more layers at once
-
class
alex.components.nlg.tectotpl.block.util.eval.
Eval
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
This block executes arbitrary Python code for each document/bundle or each zone/tree/node matching the current language and selector.
- Arguments:
- document, bundle, zone, atree, anode, ttree, tnode, ntree, nnode, ptree, pnode: code to execute
- for each <name of the argument>
Arguments may be combined, but at least one of them must be set. If only X<tree/node> are set, language and selector is required.
-
process_bundle
(bundle)[source]¶ Process a document (execute code from the ‘bundle’ argument and dive deeper)
-
process_document
(doc)[source]¶ Process a document (execute code from the ‘document’ argument and dive deeper)
-
process_zone
(zone)[source]¶ Process a zone (according to language and selector; execute code for the zone or X<tree|node>) arguments)
-
valid_args
= [u'document', u'doc', u'bundle', u'zone', u'atree', u'anode', u'ttree', u'tnode', u'ntree', u'nnode', u'ptree', u'pnode']¶
-
class
alex.components.nlg.tectotpl.block.write.basewriter.
BaseWriter
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.core.block.Block
Base block for output writing.
-
class
alex.components.nlg.tectotpl.block.write.yaml.
YAML
(scenario, args)[source]¶ Bases:
alex.components.nlg.tectotpl.block.write.basewriter.BaseWriter
-
default_extension
= u'.yaml'¶
-
-
class
alex.components.nlg.tectotpl.core.block.
Block
(scenario, args)[source]¶ Bases:
object
A common ancestor to all Treex processing blocks.
-
process_bundle
(bundle)[source]¶ Process a bundle. Default behavior is to process the zone according to the current language and selector.
-
-
class
alex.components.nlg.tectotpl.core.document.
Bundle
(document, data=None, b_ord=None)[source]¶ Bases:
object
Represents a bundle, i.e. a list of zones pertaining to the same sentence (in different variations).
-
create_zone
(language, selector)[source]¶ Creates a zone at the given language and selector. Will overwrite any existing zones.
-
document
¶ The document this bundle belongs to.
-
get_or_create_zone
(language, selector)[source]¶ Returns the zone for a language and selector; if it does not exist, creates an empty zone.
-
get_zone
(language, selector)[source]¶ Returns the corresponding zone for a language and selector; raises an exception if the zone does not exist.
-
has_zone
(language, selector)[source]¶ Returns True if the bundle has a zone for the given language and selector.
-
ord
¶ The order of this bundle in the document, as given by constructor
-
-
class
alex.components.nlg.tectotpl.core.document.
Document
(filename=None, data=None)[source]¶ Bases:
object
This represents a Treex document, i.e. a sequence of bundles. It contains an index of node IDs.
-
index_backref
(attr_name, source_id, target_ids)[source]¶ Keep track of a backward reference (source, target node IDs are in the direction of the original reference)
-
index_node
(node)[source]¶ Index a node by its id. Also index the node’s references in the backwards reference index.
-
-
class
alex.components.nlg.tectotpl.core.document.
Zone
(data=None, language=None, selector=None, bundle=None)[source]¶ Bases:
object
Represents a zone, i.e. a sentence and corresponding trees.
-
atree
¶ Direct access to a-tree (will raise an exception if the tree does not exist).
-
bundle
¶ The bundle in which this zone is located
-
create_tree
(layer, data=None)[source]¶ Create a tree on the given layer, filling it with the given data (if applicable).
-
document
¶ The document in which this zone is located
-
get_tree
(layer)[source]¶ Return a tree this node has on the given layer or raise an exception if the tree does not exist.
-
language_and_selector
¶ Return string concatenation of the zone’s language and selector.
-
ntree
¶ Direct access to n-tree (will raise an exception if the tree does not exist).
-
ptree
¶ Direct access to p-tree (will raise an exception if the tree does not exist).
-
ttree
¶ Direct access to t-tree (will raise an exception if the tree does not exist).
-
-
exception
alex.components.nlg.tectotpl.core.exception.
DataException
(path)[source]¶ Bases:
alex.components.nlg.tectotpl.core.exception.TreexException
Data file not found exception
-
exception
alex.components.nlg.tectotpl.core.exception.
LoadingException
(text)[source]¶ Bases:
alex.components.nlg.tectotpl.core.exception.TreexException
Block loading exception
-
exception
alex.components.nlg.tectotpl.core.exception.
RuntimeException
(text)[source]¶ Bases:
alex.components.nlg.tectotpl.core.exception.TreexException
Block runtime exception
-
exception
alex.components.nlg.tectotpl.core.exception.
ScenarioException
(text)[source]¶ Bases:
alex.components.nlg.tectotpl.core.exception.TreexException
Scenario-related exception.
-
class
alex.components.nlg.tectotpl.core.node.
A
(data=None, parent=None, zone=None)[source]¶ Bases:
alex.components.nlg.tectotpl.core.node.Node
,alex.components.nlg.tectotpl.core.node.Ordered
,alex.components.nlg.tectotpl.core.node.EffectiveRelations
,alex.components.nlg.tectotpl.core.node.InClause
Representing an a-node
-
attrib
= [(u'form', <type 'unicode'>), (u'lemma', <type 'unicode'>), (u'tag', <type 'unicode'>), (u'afun', <type 'unicode'>), (u'no_space_after', <type 'bool'>), (u'morphcat', <type 'dict'>), (u'is_parenthesis_root', <type 'bool'>), (u'edge_to_collapse', <type 'bool'>), (u'is_auxiliary', <type 'bool'>), (u'p_terminal.rf', <type 'unicode'>)]¶
-
morphcat_case
¶
-
morphcat_gender
¶
-
morphcat_grade
¶
-
morphcat_members
= [u'pos', u'subpos', u'gender', u'number', u'case', u'person', u'tense', u'negation', u'voice', u'grade', u'mood', u'possnumber', u'possgender']¶
-
morphcat_mood
¶
-
morphcat_negation
¶
-
morphcat_number
¶
-
morphcat_person
¶
-
morphcat_pos
¶
-
morphcat_possgender
¶
-
morphcat_possnumber
¶
-
morphcat_subpos
¶
-
morphcat_tense
¶
-
morphcat_voice
¶
-
ref_attrib
= [u'p_terminal.rf']¶
-
-
class
alex.components.nlg.tectotpl.core.node.
EffectiveRelations
[source]¶ Bases:
object
Representing a node with effective relations
-
attrib
= [(u'is_member', <type 'bool'>)]¶
-
get_coap_members
()[source]¶ Return the members of the coordination, if the node is a coap root. Otherwise return the node itself.
-
get_echildren
(or_topological=False, add_self=False, ordered=False, preceding_only=False, following_only=False)[source]¶ Return the effective children of the current node.
-
get_eparents
(or_topological=False, add_self=False, ordered=False, preceding_only=False, following_only=False)[source]¶ Return the effective parents of the current node.
-
is_coap_root
()[source]¶ Testing whether the node is a coordination/apposition root. Must be implemented in descendants.
-
ref_attrib
= []¶
-
-
class
alex.components.nlg.tectotpl.core.node.
InClause
[source]¶ Bases:
object
Represents nodes that are organized in clauses
-
attrib
= [(u'clause_number', <type 'int'>), (u'is_clause_head', <type 'bool'>)]¶
-
ref_attrib
= []¶
-
-
class
alex.components.nlg.tectotpl.core.node.
N
(data=None, parent=None, zone=None)[source]¶ Bases:
alex.components.nlg.tectotpl.core.node.Node
Representing an n-node
-
attrib
= [(u'ne_type', <type 'unicode'>), (u'normalized_name', <type 'unicode'>), (u'a.rf', <type 'list'>)]¶
-
ref_attrib
= [u'a.rf']¶
-
-
class
alex.components.nlg.tectotpl.core.node.
Node
(data=None, parent=None, zone=None)[source]¶ Bases:
object
Representing a node in a tree (recursively)
-
attrib
= [(u'alignment', <type 'list'>), (u'wild', <type 'dict'>)]¶
-
document
¶ The document this node is a member of.
-
get_attr
(name)[source]¶ Return the value of the given attribute. Allows for dictionary nesting, e.g. ‘morphcat/gender’
-
get_attr_list
(include_types=False, safe=False)[source]¶ Get attributes of the current class (gathering all attributes of base classes)
-
get_children
(add_self=False, ordered=False, preceding_only=False, following_only=False)[source]¶ Return all children of the node
-
get_deref_attr
(name)[source]¶ This assumes the given attribute holds node id(s) and returns the corresponding node(s)
-
get_descendants
(add_self=False, ordered=False, preceding_only=False, following_only=False)[source]¶ Return all topological descendants of this node.
-
get_ref_attr_list
(split_nested=False)[source]¶ Return a list of the attributes of the current class that contain references (splitting nested ones, if needed)
-
get_referenced_ids
()[source]¶ Return all ids referenced by this node, keyed under their reference types in a hash.
-
id
¶ The unique id of the node within the document.
-
is_root
¶ Return true if this node is a root
-
parent
¶ The parent of the current node. None for roots.
-
ref_attrib
= []¶
-
remove_reference
(ref_type, refd_id)[source]¶ Remove the reference of the given type to the given node.
-
root
¶ The root of the tree this node is in.
-
set_attr
(name, value)[source]¶ Set the value of the given attribute. Allows for dictionary nesting, e.g. ‘morphcat/gender’
-
set_deref_attr
(name, value)[source]¶ This assumes the value is a node/list of nodes and sets its id/their ids as the value of the given attribute.
-
zone
¶ The zone this node belongs to.
-
-
class
alex.components.nlg.tectotpl.core.node.
Ordered
[source]¶ Bases:
object
Representing an ordered node (has an attribute called ord), defines sorting.
-
attrib
= [(u'ord', <type 'int'>)]¶
-
is_first_node
()[source]¶ Return True if this node is the first node in the tree, i.e. has no previous nodes.
-
is_last_node
()[source]¶ Return True if this node is the last node in the tree, i.e. has no following nodes.
-
is_right_child
¶ Return True if this node has a greater ord than its parent. Returns None for a root.
-
ref_attrib
= []¶
-
shift_after_node
(other, without_children=False)[source]¶ Shift one node after another in the ordering.
-
shift_after_subtree
(other, without_children=False)[source]¶ Shift one node after the whole subtree of another node in the ordering.
-
-
class
alex.components.nlg.tectotpl.core.node.
P
(data=None, parent=None, zone=None)[source]¶ Bases:
alex.components.nlg.tectotpl.core.node.Node
Representing a p-node
-
attrib
= [(u'is_head', <type 'bool'>), (u'index', <type 'unicode'>), (u'coindex', <type 'unicode'>), (u'edgelabel', <type 'unicode'>), (u'form', <type 'unicode'>), (u'lemma', <type 'unicode'>), (u'tag', <type 'unicode'>), (u'phrase', <type 'unicode'>), (u'functions', <type 'unicode'>)]¶
-
ref_attrib
= []¶
-
-
class
alex.components.nlg.tectotpl.core.node.
T
(data=None, parent=None, zone=None)[source]¶ Bases:
alex.components.nlg.tectotpl.core.node.Node
,alex.components.nlg.tectotpl.core.node.Ordered
,alex.components.nlg.tectotpl.core.node.EffectiveRelations
,alex.components.nlg.tectotpl.core.node.InClause
Representing a t-node
-
anodes
¶ Return all anodes of a t-node
-
attrib
= [(u'functor', <type 'unicode'>), (u'formeme', <type 'unicode'>), (u't_lemma', <type 'unicode'>), (u'nodetype', <type 'unicode'>), (u'subfunctor', <type 'unicode'>), (u'tfa', <type 'unicode'>), (u'is_dsp_root', <type 'bool'>), (u'gram', <type 'dict'>), (u'a', <type 'dict'>), (u'compl.rf', <type 'list'>), (u'coref_gram.rf', <type 'list'>), (u'coref_text.rf', <type 'list'>), (u'sentmod', <type 'unicode'>), (u'is_parenthesis', <type 'bool'>), (u'is_passive', <type 'bool'>), (u'is_generated', <type 'bool'>), (u'is_relclause_head', <type 'bool'>), (u'is_name_of_person', <type 'bool'>), (u'voice', <type 'unicode'>), (u'mlayer_pos', <type 'unicode'>), (u't_lemma_origin', <type 'unicode'>), (u'formeme_origin', <type 'unicode'>), (u'is_infin', <type 'bool'>), (u'is_reflexive', <type 'bool'>)]¶
-
aux_anodes
¶
-
compl_nodes
¶
-
coref_gram_nodes
¶
-
coref_text_nodes
¶
-
gram_aspect
¶
-
gram_degcmp
¶
-
gram_deontmod
¶
-
gram_diathesis
¶
-
gram_dispmod
¶
-
gram_gender
¶
-
gram_indeftype
¶
-
gram_iterativeness
¶
-
gram_negation
¶
-
gram_number
¶
-
gram_numertype
¶
-
gram_person
¶
-
gram_politeness
¶
-
gram_resultative
¶
-
gram_sempos
¶
-
gram_tense
¶
-
gram_verbmod
¶
-
lex_anode
¶
-
ref_attrib
= [u'a/lex.rf', u'a/aux.rf', u'compl.rf', u'coref_gram.rf', u'coref_text.rf']¶
-
-
class
alex.components.nlg.tectotpl.core.run.
Scenario
(config)[source]¶ Bases:
object
This represents a scenario, i.e. a sequence of blocks to be run on the data
-
alex.components.nlg.tectotpl.core.util.
as_list
(value)[source]¶ Cast anything to a list (just copy a list or a tuple, or put an atomic item to as a single element to a list).
-
class
alex.components.nlg.tectotpl.tool.lexicon.cs.
Lexicon
[source]¶ Bases:
object
-
get_possessive_adj_for
(noun_lemma)[source]¶ Given a noun lemma, this returns a possessive adjective if it’s in the database.
-
has_expletive
(lemma)[source]¶ Return an expletive for a ‘že’-clause that this verb governs, or False. Lemmas must include reflexive particles for reflexiva tantum.
-
has_synthetic_future
(verb_lemma)[source]¶ Returns True if the verb builds a synthetic future tense form with the prefix ‘po-‘/’pů-‘.
-
inflect_conditional
(lemma, number, person)[source]¶ Return inflected form of a conditional particle/conjunction
-
is_coord_conj
(lemma)[source]¶ Return ‘Y’/’N’ if the given lemma is a coordinating conjunction (depending on whether one should write a comma directly in front).
-
is_incongruent_numeral
(numeral)[source]¶ Return True if the given lemma belongs to a Czech numeral that takes a genitive attribute instead of being an attribute itself
-
is_named_entity_label
(lemma)[source]¶ Return ‘I’/’C’ if the given lemma is a named entity label (used as congruent/incongruent attribute).
-
Data set representation with ARFF input possibility.
-
class
alex.components.nlg.tectotpl.tool.ml.dataset.
Attribute
(name, type_spec)[source]¶ Bases:
object
This represents an attribute of the data set.
-
get_arff_type
()[source]¶ Return the ARFF type of the given attribute (numeric, string or list of values for nominal attributes).
-
num_values
¶ Return the number of distinct values found in this attribute. Returns -1 for numeric attributes where the number of values is not known.
-
numeric_value
(value)[source]¶ Return a numeric representation of the given value. Raise a ValueError if the given value does not conform to the attribute type.
-
soft_numeric_value
(value, add_values)[source]¶ Same as numeric_value(), but will not raise exceptions for unknown numeric/string values. Will either add the value to the list or return a NaN (depending on the add_values setting).
-
-
class
alex.components.nlg.tectotpl.tool.ml.dataset.
DataSet
[source]¶ Bases:
object
ARFF relation data representation.
-
DENSE_FIELD
= u'([^"\\\'][^,]*|\\\'[^\\\']*(\\\\\\\'[^\\\']*)*(?<!\\\\)\\\'|"[^"]*(\\\\"[^"]*)*(?<!\\\\)"),'¶
-
SPARSE_FIELD
= u'([0-9]+)\\s+([^"\\\'\\s][^,]*|\\\'[^\\\']*(\\\\\\\'[^\\\']*)*\\\'|"[^"]*(\\\\"[^"]*)*"),'¶
-
SPEC_CHARS
= u'[\\n\\r\\\'"\\\\\\t%]'¶
-
add_attrib
(attrib, values=None)[source]¶ Add a new attribute to the data set, with pre-filled values (or missing, if not set).
-
append
(other)[source]¶ Append instances from one data set to another. Their attributes must be compatible (of the same types).
-
as_bunch
(target, mask_attrib=[], select_attrib=[])[source]¶ Return the data as a scikit-learn Bunch object. The target parameter specifies the class attribute.
-
as_dict
(mask_attrib=[], select_attrib=[])[source]¶ Return the data as a list of dictionaries, which is useful as an input to DictVectorizer.
Attributes (numbers or indexes) listed in mask_attrib are not added to the dictionary. Missing values are also not added to the dictionary. If mask_attrib is not set but select_attrib is set, only attributes listed in select_attrib are added to the dictionary.
-
attrib_as_vect
(attrib, dtype=None)[source]¶ Return the specified attribute (by index or name) as a list of values. If the data type parameter is left as default, the type of the returned values depends on the attribute type (strings for nominal or string attributes, floats for numeric ones). Set the data type parameter to int or float to override the data type.
-
attrib_index
(attrib_name)[source]¶ Given an attribute name, return its number. Given a number, return precisely that number. Return -1 on failure.
-
delete_attrib
(attribs)[source]¶ Given a list of attributes, delete them from the data set. Accepts a list of names or indexes, or one name, or one index.
-
filter
(filter_func, keep_copy=True)[source]¶ Filter the data set using a filtering function and return a filtered data set.
The filtering function must take two arguments - current instance index and the instance itself in an attribute-value dictionary form - and return a boolean.
If keep_copy is set to False, filtered instances will be removed from the original data set.
-
get_headers
()[source]¶ Return a copy of the headers of this data set (just attributes list, relation name and sparse/dense setting)
-
instance
(index, dtype=u'dict', do_copy=True)[source]¶ Return the given instance as a dictionary (or a list, if specified).
If do_copy is set to False, do not create a copy of the list for dense instances (other types must be copied anyway).
-
is_empty
¶ Return true if the data structures are empty.
-
load_from_arff
(filename, encoding=u'UTF-8')[source]¶ Load an ARFF file/stream, filling the data structures.
-
load_from_dict
(data, attrib_types={})[source]¶ Fill in values from a list of dictionaries (=instances). Attributes are assumed to be of string type unless specified otherwise in the attrib_types variable. Currently only capable of creating dense data sets.
-
load_from_vect
(attrib, vect)[source]¶ Fill in values from a vector of values and an attribute (allow adding values for nominal attributes).
-
match_headers
(other, add_values=False)[source]¶ Force this data set to have equal headers as the other data set. This cares for different values of nominal/numeric attributes – (numeric values will be the same, values unknown in the other data set will be set to NaNs). In other cases, such as a different number or type of attributes, an exception is thrown.
-
merge
(other)[source]¶ Merge two DataSet objects. The list of attributes will be concatenated. The two data sets must have the same number of instances and be either both sparse or both non-sparse.
Instance weights are left unchanged (from this data set).
-
rename_attrib
(old_name, new_name)[source]¶ Rename an attribute of this data set (find it by original name or by index).
-
separate_attrib
(attribs)[source]¶ Given a list of attributes, delete them from the data set and return them as a new separate data set. Accepts a list of names or indexes, or one name, or one index.
-
split
(split_func, keep_copy=True)[source]¶ Split the data set using a splitting function and return a dictionary where keys are different return values of the splitting function and values are data sets containing instances which yield the respective splitting function return values.
The splitting function takes two arguments - the current instance index and the instance itself as an attribute-value dictionary. Its return value determines the split.
If keep_copy is set to False, ALL instances will be removed from the original data set.
-
subset
(*args, **kwargs)[source]¶ Return a data set representing a subset of this data set’s values.
Args can be a slice or [start, ] stop [, stride] to create a slice. No arguments result in a complete copy of the original.
Kwargs may contain just one value – if copy is set to false, the sliced values are removed from the original data set.
-
-
class
alex.components.nlg.tectotpl.tool.ml.model.
AbstractModel
(config)[source]¶ Bases:
object
Abstract ancestor of different model classes
-
check_classification_input
(instances)[source]¶ Check classification input data format, convert to list if needed.
-
evaluate
(test_file, encoding=u'UTF-8', classif_file=None)[source]¶ Evaluate on the given test data file. Return accuracy. If classif_file is set, save the classification results to this file.
-
get_classes
(data, dtype=<type 'int'>)[source]¶ Return a vector of class values from the given DataSet. If dtype is int, the integer values are returned. If dtype is None, the string values are returned.
-
static
load_from_file
(model_file)[source]¶ Load the model from a pickle file or stream (supports GZip compression).
-
-
class
alex.components.nlg.tectotpl.tool.ml.model.
Model
(config)[source]¶ Bases:
alex.components.nlg.tectotpl.tool.ml.model.AbstractModel
-
PREDICTED
= u'PREDICTED'¶
-
construct_classifier
(cfg)[source]¶ Given the config file, construct the classifier (based on the ‘classifier’ or ‘classifier_class’/’classifier_params’ settings. Defaults to DummyClassifier.
-
static
create_training_job
(config, work_dir, train_file, name=None, memory=8, encoding=u'UTF-8')[source]¶ Submit a training process on the cluster which will save the model to a pickle. Return the submitted job and the future location of the model pickle. train_file cannot be a stream, it must be an actual file.
-
-
class
alex.components.nlg.tectotpl.tool.ml.model.
SplitModel
(config)[source]¶ Bases:
alex.components.nlg.tectotpl.tool.ml.model.AbstractModel
A model that’s actually composed of several Model-s.
-
class
alex.components.nlg.tectotpl.tool.cluster.
Job
(code=None, header=u'#!/usr/bin/env pythonn# coding=utf8nfrom __future__ import unicode_literalsn', name=None, work_dir=None, dependencies=None)[source]¶ Bases:
object
This represents a piece of code as a job on the cluster, holds information about the job and is able to retrieve job metadata.
The most important method is submit(), which submits the given piece of code to the cluster.
Important attributes (some may be set in the constructor or at job submission, but all may be set between construction and launch): —————————————————————— name – job name on the cluster (and the name of the created
Python script, default will be generated if not set)- code – the Python code to be run (needs to have imports and
- sys.path set properly)
- header – the header of the created Python script (may contain
- imports etc.)
- memory – the amount of memory to reserve for this job on the
- cluster
cores – the number of cores needed for this job work_dir – the working directory where the job script will be
created and run (will be created on launch)- dependencies-list of Jobs this job depends on (must be submitted
- before submitting this job)
In addition, the following values may be queried for each job at runtime or later: —————————————————————— submitted – True if the job has been submitted to the cluster. state – current job state (‘qw’ = queued, ‘r’ = running, ‘f’
= finished, only if the job was submitted)host – the machine where the job is running (short name) jobid – the numeric id of the job in the cluster (NB: type is
string!)- report – job report using the qacct command (dictionary,
- available only after the job has finished)
exit_status- numeric job exit status (if the job is finished)
-
DEFAULT_CORES
= 1¶
-
DEFAULT_HEADER
= u'#!/usr/bin/env python\n# coding=utf8\nfrom __future__ import unicode_literals\n'¶
-
DEFAULT_MEMORY
= 4¶
-
DIR_PREFIX
= u'_clrun-'¶
-
FINISH
= u'f'¶
-
JOBNAME_LEGAL_CHARS
= 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'¶
-
NAME_PREFIX
= u'pyjob_'¶
-
QSUB_MEMORY_CMD
= u'-hard -l mem_free={0} -l act_mem_free={0} -l h_vmem={0}'¶
-
QSUB_MULTICORE_CMD
= u'-pe smp {0}'¶
-
TIME_POLL_DELAY
= 60¶
-
TIME_QUERY_DELAY
= 1¶
-
exit_status
¶ Retrieve the exit status of the job via the qacct report. Throws an exception the job is still running and the exit status is not known.
-
host
¶ Retrieve information about the host this job is/was running on.
-
jobid
¶ Return the job id.
-
name
¶ Return the job name.
-
report
¶ Access to qacct report. Please note that running the qacct command takes a few seconds, so the first access to the report is rather slow.
-
state
¶ Retrieve information about current job state. Will also retrieve the host this job is running on and store it in the __host variable, if applicable.
A collection of helper functions for generating Czech.
-
class
alex.components.nlg.tools.cs.
CzechTemplateNLGPostprocessing
[source]¶ Bases:
alex.components.nlg.template.TemplateNLGPostprocessing
Postprocessing filled in NLG templates for Czech.
Currently, this class only handles preposition vocalization.
A collection of helper functions for generating English.
-
alex.components.nlg.tools.en.
every_word_for_number
(number, ordinary=False, use_coupling=False)[source]¶ - params: ordinary - if set to True, it returns ordinal of the number (fifth rather than five etc).
- use_coupling if set to True, it returns number greater than 100 with “and” between hundreds and tens
- (two hundred and seventeen rather than two hundred seventeen).
Returns a word given a number 1-100
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
-
exception
alex.components.nlg.exceptions.
NLGException
[source]¶ Bases:
alex.AlexException
-
class
alex.components.nlg.template.
AbstractTemplateNLG
(cfg)[source]¶ Bases:
object
Base abstract class for template-filling generators, providing the routines for template loading and selection.
The generation (i.e. template filling) is left to the derived classes.
It implements numerous backoff strategies: 1) it matches the exactly the input dialogue against the templates 2) if it cannot find exact match, then it tries to find a generic template (slot-independent) 3) if it cannot find a generic template, the it tries to compose
the template from templates for individual dialogue act items-
backoff
(da)[source]¶ Provide an alternative NLG template for the dialogue output which is not covered in the templates. This serves as a backoff solution. This should be implemented in derived classes.
-
compose_utterance_greedy
(da)[source]¶ Compose an utterance from templates by iteratively looking for the longest (up to self.compose_greedy_lookahead) matching sub-utterance at the current position in the DA.
Returns the composed utterance.
-
compose_utterance_single
(da)[source]¶ Compose an utterance from templates for single dialogue act items. Returns the composed utterance.
-
fill_in_template
(tpl, svs)[source]¶ Fill in the given slot values of a dialogue act into the given template. This should be implemented in derived classes.
-
generate
(da)[source]¶ Generate the natural text output for the given dialogue act.
First, try to find an exact match with no variables to fill in. Then try to find a relaxed match of a more generic template and fill in the actual values of the variables.
-
get_generic_da
(da)[source]¶ Given a dialogue act and a list of slots and values, substitute the generic values (starting with { and ending with }) with empty string.
-
get_generic_da_given_svs
(da, svs)[source]¶ Given a dialogue act and a list of slots and values, substitute the matching slot and values with empty string.
-
load_templates
(file_name)[source]¶ Load templates from an external file, which is assumed to be a Python source which defines the variable ‘templates’ as a dictionary containing stringified dialog acts as keys and (lists of) templates as values.
-
match_and_fill_generic
(da, svs)[source]¶ Match a generic template and fill in the proper values for the slots which were substituted by a generic value.
Will return the output text with the proper values filled in if a generic template can be found; will throw a TemplateNLGException otherwise.
-
match_generic_templates
(da, svs)[source]¶ Find a matching template for a dialogue act using substitutions for slot values.
Returns a matching template and a dialogue act where values of some of the slots are substituted with a generic value.
-
random_select
(tpl)[source]¶ Randomly select alternative templates for generation.
The selection process is modeled by an embedded list structure (a tree-like structure). In the first level, the algorithm selects one of N. In the second level, for every item it selects one of M, and joins them together. This continues toward the leaves which must be non-list objects.
There are the following random selection options (only the first three):
{ ‘hello()’ : u”Hello”, }
This will return the “Hello” string.
{ ‘hello()’ : (u”Hello”,
u”Hi”,
),
}
This will return one of the “Hello” or “Hi” strings.
{ ‘hello()’ : (
- [
- (u”Hello.”,
u”Hi.”,
) (u”How are you doing?”,
u”Welcome”.,
), u”Speak!”,
],
u”Hi my friend.”
),
}
- This will return one of the following strings:
“Hello. How are you doing? Speak!” “Hi. How are you doing? Speak!” “Hello. Welcome. Speak!” “Hi. Welcome. Speak!” “Hi my friend.”
-
-
class
alex.components.nlg.template.
TectoTemplateNLG
(cfg)[source]¶ Bases:
alex.components.nlg.template.AbstractTemplateNLG
Template generation using tecto-trees and NLG rules.
-
class
alex.components.nlg.template.
TemplateNLG
(cfg)[source]¶ Bases:
alex.components.nlg.template.AbstractTemplateNLG
A simple text-replacement template NLG implementation with the ability to resort to a back-off system if no appropriate template is found.
-
class
alex.components.nlg.template.
TemplateNLGPostprocessing
[source]¶ Bases:
object
Base class for template NLG postprocessing, handles postprocessing of the text resulting from filling in a template.
This base class provides no functionality, it just defines an interface for derived language-specific and/or domain-specific classes.
-
class
alex.components.nlg.template.
TemplateNLGPreprocessing
(ontology)[source]¶ Bases:
object
Base class for template NLG preprocessing, handles preprocessing of the values to be filled into a template.
This base class provides no functionality, it just defines an interface for derived language-specific and/or domain-specific classes.
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
-
class
alex.components.slu.base.
CategoryLabelDatabase
(file_name=None)[source]¶ Bases:
object
Provides a convenient interface to a database of slot value pairs aka category labels.
- Attributes:
- synonym_value_category: a list of (form, value, category label) tuples
In an utterance:
- there can be multiple surface forms in an utterance
- surface forms can overlap
- a surface form can map to multiple category labels
Then when detecting surface forms / category labels in an utterance:
- find all existing surface forms / category labels and generate a new utterance with for every found surface form and
category label (called abstracted), where the original surface form is replaced by its category label
- instead of testing all surface forms from the CLDB from the longest to the shortest in the utterance, we test all the substrings in the utterance from the longest to the shortest
-
form_upnames_vals
¶ list of tuples (form, upnames_vals) from the database where upnames_vals is a dictionary
{name.upper(): all values for this (form, name)}.
-
form_val_upname
¶ list of tuples (form, value, name.upper()) from the database
-
gen_form_value_cl_list
()[source]¶ Generates an list of form, value, category label tuples from the database. This list is ordered where the tuples with the longest surface forms are at the beginning of the list.
Returns: none
-
class
alex.components.slu.base.
SLUInterface
(preprocessing, cfg, *args, **kwargs)[source]¶ Bases:
object
Defines a prototypical interface each SLU parser should provide.
- It should be able to parse:
- an utterance hypothesis (an instance of UtteranceHyp)
- output: an instance of SLUHypothesis
- an n-best list of utterances (an instance of UtteranceNBList)
- output: an instance of SLUHypothesis
- a confusion network (an instance of UtteranceConfusionNetwork)
- output: an instance of SLUHypothesis
-
parse_confnet
(obs, n=40, *args, **kwargs)[source]¶ Parses an observation featuring a word confusion network using the parse_nblist method.
- Arguments:
- obs – a dictionary of observations
- :: observation type -> observed value where observation type is one of values for `obs_type’ used in `ft_props’, and observed value is the corresponding observed value for the input
n – depth of the n-best list generated from the confusion network args – further positional arguments that should be passed to the
`parse_1_best’ method call- kwargs – further keyword arguments that should be passed to the
- `parse_1_best’ method call
-
parse_nblist
(obs, *args, **kwargs)[source]¶ Parses an observation featuring an utterance n-best list using the parse_1_best method.
- Arguments:
- obs – a dictionary of observations
- :: observation type -> observed value where observation type is one of values for `obs_type’ used in `ft_props’, and observed value is the corresponding observed value for the input
- args – further positional arguments that should be passed to the
- `parse_1_best’ method call
- kwargs – further keyword arguments that should be passed to the
- `parse_1_best’ method call
-
class
alex.components.slu.base.
SLUPreprocessing
(cldb, text_normalization=None)[source]¶ Bases:
object
Implements preprocessing of utterances or utterances and dialogue acts. The main purpose is to replace all values in the database by their category labels (slot names) to reduce the complexity of the input utterances.
In addition, it implements text normalisation for SLU input, e.g. removing filler words such as UHM, UM etc., converting “I’m” into “I am” etc. Some normalisation is hard-coded. However, it can be updated by providing normalisation patterns.
-
normalise_confnet
(confnet)[source]¶ Normalises the confnet (the output of an ASR).
E.g., it removes filler words such as UHM, UM, etc., converts “I’m” into “I am”, etc.
-
normalise_nblist
(nblist)[source]¶ Normalises the N-best list (the output of an ASR).
Parameters: nblist – Returns:
-
normalise_utterance
(utterance)[source]¶ Normalises the utterance (the output of an ASR).
E.g., it removes filler words such as UHM, UM, etc., converts “I’m” into “I am”, etc.
-
text_normalization_mapping
= [(['erm'], []), (['uhm'], []), (['um'], []), (["i'm"], ['i', 'am']), (['(sil)'], []), (['(%hesitation)'], []), (['(hesitation)'], [])]¶
-
-
class
alex.components.slu.cued_da.
CUEDDialogueAct
(da_str=None)[source]¶ Bases:
alex.components.slu.da.DialogueAct
CUED-style dialogue act
-
class
alex.components.slu.da.
DialogueAct
(da_str=None)[source]¶ Bases:
object
Represents a dialogue act (DA), i.e., a set of dialogue act items (DAIs).
The DAIs are stored in the `dais’ attribute, sorted w.r.t. their string representation. This class is not responsible for discarding a DAI which is repeated several times, so that you can obtain a DA that looks like this:
inform(food=”chinese”)&inform(food=”chinese”)- Attributes:
- dais: a list of DAIs that constitute this dialogue act
-
dais
¶
-
has_dat
(dat)[source]¶ Checks whether any of the dialogue act items has a specific dialogue act type.
-
has_only_dat
(dat)[source]¶ Checks whether all the dialogue act items has a specific dialogue act type.
-
merge
(da)[source]¶ Merges another DialogueAct into self. This is done by concatenating lists of the DAIs, and sorting and merging own DAIs afterwards.
If sorting is not desired, use `extend’ instead.
-
merge_same_dais
()[source]¶ Merges same DAIs. I.e., if they are equal on extension but differ in original values, merges the original values together, and keeps the single DAI. This method causes the list of DAIs to be sorted.
-
class
alex.components.slu.da.
DialogueActConfusionNetwork
[source]¶ Bases:
alex.components.slu.da.SLUHypothesis
,alex.ml.hypothesis.ConfusionNetwork
Dialogue act item confusion network. This is a very simple implementation in which all dialogue act items are assumed to be independent. Therefore, the network stores only posteriors for dialogue act items.
This can be efficiently stored as a list of DAIs each associated with its probability. The alternative for each DAI is that there is no such DAI in the DA. This can be represented as the null() dialogue act and its probability is 1 - p(DAI).
If there are more than one null() DA in the output DA, then they are collapsed into one null() DA since it means the same.
Please note that in the confusion network, the null() dialogue acts are not explicitly modelled.
-
get_best_da_hyp
(use_log=False, threshold=None, thresholds=None)[source]¶ Return the best dialogue act hypothesis.
- Arguments:
- use_log: whether to express probabilities on the log-scale
- (otherwise, they vanish easily in a moderately long confnet)
- threshold: threshold on probabilities – items with probability
- exceeding the threshold will be present in the output (default: 0.5)
- thresholds: threshold on probabilities – items with probability
- exceeding the threshold will be present in the output. This is a mapping {dai -> threshold}, and if supplied, overwrites settings of `threshold’. If not supplied, it is ignored.
-
get_best_nonnull_da
()[source]¶ Return the best dialogue act (with the highest probability) ignoring the best null() dialogue act item.
Instead of returning the
null()
act, it returns the most probable DAI with a defined slot name.
-
get_da_nblist
(n=10, prune_prob=0.005)[source]¶ Parses the input dialogue act item confusion network and generates N-best hypotheses.
The result is a list of dialogue act hypotheses each with a with assigned probability. The list also include a dialogue act for not having the correct dialogue act in the list - other().
Generation of hypotheses will stop when the probability of the hypotheses is smaller then the
prune_prob
.
-
-
class
alex.components.slu.da.
DialogueActHyp
(prob=None, da=None)[source]¶ Bases:
alex.components.slu.da.SLUHypothesis
Provides functionality of 1-best hypotheses for dialogue acts.
-
class
alex.components.slu.da.
DialogueActItem
(dialogue_act_type=None, name=None, value=None, dai=None, attrs=None, alignment=None)[source]¶ Bases:
alex.ml.features.Abstracted
Represents dialogue act item which is a component of a dialogue act.
Each dialogue act item is composed of
dialogue act type - e.g. inform, confirm, request, select, hello
- slot name and value pair - e.g. area, pricerange, food for name and
centre, cheap, or Italian for value
- Attributes:
- dat: dialogue act type (a string) name: slot name (a string or None) value: slot value (a string or None)
-
add_unnorm_value
(newval)[source]¶ Registers `newval’ as another alternative unnormalised value for the value of this DAI’s slot.
-
alignment
¶
-
category_label2value
(catlabs=None)[source]¶ Use this method to substitute back the original value for the category label as the value of this DAI.
- Arguments:
- catlabs: an optional mapping of category labels to tuples (slot
value, surface form), as obtained from alex.components.slu:SLUPreprocessing
If this object does not remember its original value, it takes it from the provided mapping.
-
dat
¶
-
extension
()[source]¶ Returns an extension of self, i.e., a new DialogueActItem without hidden fields, such as the original value/category label.
-
merge_unnorm_values
(other)[source]¶ Merges unnormalised values of `other’ to unnormalised values of `self’.
-
name
¶
-
normalised2value
()[source]¶ Use this method to substitute back an unnormalised value for the normalised one as the value of this DAI.
Returns True iff substitution took place. Returns False if no more unnormalised values are remembered as a source for the normalised value.
-
orig_values
¶
-
splitter
= u':'¶
-
unnorm_values
¶
-
value
¶
-
class
alex.components.slu.da.
DialogueActNBList
[source]¶ Bases:
alex.components.slu.da.SLUHypothesis
,alex.ml.hypothesis.NBList
Provides functionality of N-best lists for dialogue acts.
When updating the N-best list, one should do the following.
- add DAs or parse a confusion network
- merge and normalise, in either order
- Attributes:
- n_best: the list containing pairs [prob, DA] sorted from the most
- probable to the least probable ones
-
merge
()[source]¶ Adds up probabilities for the same hypotheses. Takes care to keep track of original, unnormalised DAI values. Returns self.
-
class
alex.components.slu.da.
SLUHypothesis
[source]¶ Bases:
alex.ml.hypothesis.Hypothesis
This is the base class for all forms of probabilistic SLU hypotheses representations.
-
alex.components.slu.da.
load_das
(das_fname, limit=None, encoding=u'UTF-8')[source]¶ Loads a dictionary of DAs from a given file.
The file is assumed to contain lines of the following form:
[[:space:]..]<key>[[:space:]..]=>[[:space:]..]<DA>[[:space:]..]or just (without keys):
[[:space:]..]<DA>[[:space:]..]- Arguments:
- das_fname – path towards the file to read the DAs from limit – limit on the number of DAs to read encoding – the file encoding
Returns a dictionary with DAs (instances of DialogueAct) as values.
-
alex.components.slu.da.
merge_slu_confnets
(confnet_hyps)[source]¶ Merge multiple dialogue act confusion networks.
This is a rewrite of the DAILogRegClassifier from dailrclassifier_old.py
. The underlying approach is the same; however,
the way how the features are computed is changed significantly.
-
class
alex.components.slu.dailrclassifier.
DAILogRegClassifier
(cldb, preprocessing, features_size=4, *args, **kwargs)[source]¶ Bases:
alex.components.slu.base.SLUInterface
Implements learning of dialogue act item classifiers based on logistic regression.
The parser implements a parser based on set of classifiers for each dialogue act item. When parsing the input utterance, the parse classifies whether a given dialogue act item is present. Then, the output dialogue act is composed of all detected dialogue act items.
Dialogue act is defined as a composition of dialogue act items. E.g.
confirm(drinks=”wine”)&inform(name=”kings shilling”) <=> ‘does kings serve wine’
where confirm(drinks=”wine”) and inform(name=”kings shilling”) are two dialogue act items.
This parser uses logistic regression as the classifier of the dialogue act items.
-
abstract_utterance
(utterance)[source]¶ Return a list of possible abstractions of the utterance.
Parameters: utterance – an Utterance instance Returns: a list of abstracted utterance, form, value, category label tuples
-
gen_classifiers_data
(min_pos_feature_count=5, min_neg_feature_count=5, verbose=False, verbose2=False)[source]¶
-
get_abstract_utterance
(utterance, fvc)[source]¶ Return an utterance with the form inn fvc abstracted to its category label
Parameters: - utterance – an Utterance instance
- fvc – a form, value, category label tuple
Returns: return the abstracted utterance
-
get_abstract_utterance2
(utterance)[source]¶ Return an utterance with the form un fvc abstracted to its category label
Parameters: utterance – an Utterance instance Returns: return the abstracted utterance
-
get_features
(obs, fvc, fvcs)[source]¶ Generate utterance features for a specific utterance given by utt_idx.
Parameters: - obs – the utterance being processed in multiple formats
- fvc – a form, value category tuple describing how the utterance should be abstracted
Returns: a set of features from the utterance
-
get_features_in_utterance
(utterance, fvc, fvcs)[source]¶ Returns features extracted from the utterance observation. At this moment, the function extracts N-grams of size self.feature_size. These N-grams are extracted from:
- the original utterance,
- the abstracted utterance for the given FVC
- the abstracted where all other FVCs are abstracted as well
Parameters: - utterance –
- fvc –
Returns: the UtteranceFeatures instance
-
get_fvc
(*args, **kwds)[source]¶ This function returns the form, value, category label tuple for any of the following classses
- Utterance
- UttranceNBList
- UtteranceConfusionNetwork
Parameters: obs – the utterance being processed in multiple formats Returns: a list of form, value, and category label tuples found in the input sentence
-
get_fvc_in_confnet
(confnet)[source]¶ Return a list of all form, value, category label tuples in the confusion network.
Parameters: nblist – an UtteranceConfusionNetwork instance Returns: a list of form, value, and category label tuples found in the input sentence
-
get_fvc_in_nblist
(nblist)[source]¶ Return a list of all form, value, category label tuples in the nblist.
Parameters: nblist – an UtteranceNBList instance Returns: a list of form, value, and category label tuples found in the input sentence
-
get_fvc_in_utterance
(utterance)[source]¶ Return a list of all form, value, category label tuples in the utterance. This is useful to find/guess what category label level classifiers will be necessary to instantiate.
Parameters: utterance – an Utterance instance Returns: a list of form, value, and category label tuples found in the input sentence
-
parse_1_best
(obs={}, ret_cl_map=False, verbose=False, *args, **kwargs)[source]¶ Parse
utterance
and generate the best interpretation in the form of a dialogue act (an instance of DialogueAct).The result is the dialogue act confusion network.
-
parse_confnet
(obs, verbose=False, *args, **kwargs)[source]¶ Parses the word confusion network by generating an n-best list and parsing this n-best list.
-
-
class
alex.components.slu.dailrclassifier.
Features
[source]¶ Bases:
object
This is a simple feature object. It is a light version of an unnecessary complicated alex.ml.features.Features class.
-
merge
(features, weight=1.0, prefix=None)[source]¶ Merges passed feature dictionary with its own features. To the features can be applied weight factor or the features can be added as a binary feature. If a prefix is provided, then the features are added with the prefixed feature name.
Parameters: - features – a dictionary-like object with features as keys and values
- weight – a weight of added features with respect to already existing features. If None, then it is is added as a binary feature
- prefix – prefix for a name of an added features, This is useful when one want to distinguish between similarly generated features
-
-
class
alex.components.slu.dailrclassifier.
UtteranceFeatures
(type=u'ngram', size=3, utterance=None)[source]¶ Bases:
alex.components.slu.dailrclassifier.Features
This is a simple feature object. It is a light version of a alex.components.asr.utterance.UtteranceFeatures class.
-
exception
alex.components.slu.exceptions.
DialogueActConfusionNetworkException
[source]¶ Bases:
alex.components.slu.exceptions.SLUException
,alex.ml.hypothesis.ConfusionNetworkException
-
exception
alex.components.slu.exceptions.
SLUException
[source]¶ Bases:
alex.AlexException
-
class
alex.components.slu.templateclassifier.
TemplateClassifier
(config)[source]¶ Bases:
object
This parser is based on matching examples of utterances with known semantics against input utterance. The semantics of the example utterance which is closest to the input utterance is provided as a output semantics.
“Hi” => hello() “I can you give me a phone number” => request(phone) “I would like to have a phone number please” => request(phone)
The first match is reported as the resulting dialogue act.
-
class
alex.components.slu.test_da.
TestDA
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
-
exception
alex.components.tts.exceptions.
TTSException
[source]¶ Bases:
alex.AlexException
Module contents¶
alex.corpustools package¶
Submodules¶
alex.corpustools.asr_decode module¶
alex.corpustools.asrscore module¶
alex.corpustools.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
alex.corpustools.cued-audio2ufal-audio module¶
alex.corpustools.cued-call-logs-sem2ufal-call-logs-sem module¶
alex.corpustools.cued-sem2ufal-sem module¶
alex.corpustools.cued module¶
This module is meant to collect functionality for handling call logs – both working with the call log files in the filesystem, and parsing them.
-
alex.corpustools.cued.
find_logs
(infname, ignore_list_file=None, verbose=False)[source]¶ Finds CUED logs below the paths specified and returns their filenames. The logs are determined as files matching one of the following patterns:
user-transcription.norm.xml user-transcription.xml user-transcription-all.xmlIf multiple patterns are matched by files in the same directory, only the first match is taken.
- Arguments:
- infname – either a directory, or a file. In the first case, logs are
- looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the log to include.
- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying logs that should be excluded from the results
verbose – print lots of output?
Returns a set of paths to files satisfying the criteria.
-
alex.corpustools.cued.
find_wavs
(infname, ignore_list_file=None)[source]¶ Finds wavs below the paths specified and returns their filenames.
- Arguments:
- infname – either a directory, or a file. In the first case, wavs are
- looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.
- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying wavs that should be excluded from the results
Returns a set of paths to files satisfying the criteria.
-
alex.corpustools.cued.
find_with_ignorelist
(infname, pat, ignore_list_file=None, find_kwargs={})[source]¶ Finds specific files below the paths specified and returns their filenames.
- Arguments:
pat – globbing pattern specifying the files to look for infname – either a directory, or a file. In the first case, wavs are
looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying wavs that should be excluded from the results
- find_kwargs – if provided, this dictionary is used as additional
- keyword arguments for the function `utils.fs.find’ for finding positive examples of files (not the ignored ones)
Returns a set of paths to files satisfying the criteria.
alex.corpustools.cued2utt_da_pairs module¶
-
class
alex.corpustools.cued2utt_da_pairs.
TurnRecord
(transcription, cued_da, cued_dahyp, asrhyp, audio)¶ Bases:
tuple
-
asrhyp
¶ Alias for field number 3
-
audio
¶ Alias for field number 4
-
cued_da
¶ Alias for field number 1
-
cued_dahyp
¶ Alias for field number 2
-
transcription
¶ Alias for field number 0
-
-
alex.corpustools.cued2utt_da_pairs.
extract_trns_sems
(infname, verbose, fields=None, ignore_list_file=None, do_exclude=True, normalise=True, known_words=None)[source]¶ Extracts transcriptions and their semantic annotation from a directory containing CUED call log files.
- Arguments:
- infname – either a directory, or a file. In the first case, logs are
- looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the call log to include.
verbose – print lots of output? fields – names of fields that should be required for the output.
Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying logs that should be skipped
normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are
excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.
Returns a list of TurnRecords.
-
alex.corpustools.cued2utt_da_pairs.
extract_trns_sems_from_file
(fname, verbose, fields=None, normalise=True, do_exclude=True, known_words=None, robust=False)[source]¶ Extracts transcriptions and their semantic annotation from a CUED call log file.
- Arguments:
fname – path towards the call log file verbose – print lots of output? fields – names of fields that should be required for the output.
Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are
excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.- robust – whether to assign recordings to turns robustly or trust where
- they are in the log. This could be useful for older CUED logs where the elements sometimes escape to another <turn> than they belong. However, in cases where `robust’ leads to finding the correct recording for the user turn, the log is damaged at other places too, and the resulting turn record would be misleading. Therefore, we recommend leaving robust=False.
Returns a list of TurnRecords.
alex.corpustools.cued2wavaskey module¶
Finds CUED XML files describing calls in the directory specified, extracts a couple of fields from them for each turn (transcription, ASR 1-best, semantics transcription, SLU 1-best) and outputs them to separate files in the following format:
{wav_filename} => {field}
An example ignore list file could contain the following three lines:
/some-path/call-logs/log_dir/some_id.wav some_id.wav jurcic-??[13579]*.wav
The first one is an example of an ignored path. On UNIX, it has to start with a slash. On other platforms, an analogic convention has to be used.
The second one is an example of a literal glob.
The last one is an example of a more advanced glob. It says basically that all odd dialogue turns should be ignored.
alex.corpustools.cuedda module¶
alex.corpustools.fisherptwo2ufal-audio module¶
alex.corpustools.grammar_weighted module¶
alex.corpustools.librispeech2ufal-audio module¶
alex.corpustools.lm module¶
alex.corpustools.malach-en2ufal-audio module¶
alex.corpustools.merge_uttcns module¶
alex.corpustools.num_time_stats module¶
Traverses the filesystem below a specified directory, looking for call log directories. Writes a file containing statistics about each phone number (extracted from the call log dirs’ names):
- number of calls
- total size of recorded wav files
- last expected date the caller would call
- last date the caller actually called
- the phone number
Call with -h to obtain the help for command line arguments.
2012-12-11 Matěj Korvas
alex.corpustools.recording_splitter module¶
alex.corpustools.semscore module¶
-
alex.corpustools.semscore.
score
(fn_refsem, fn_testsem, item_level=False, detailed_error_output=False, outfile=<open file '<stdout>', mode 'w'>)[source]¶
-
alex.corpustools.semscore.
score_da
(ref_da, test_da, daid)[source]¶ Computed according to http://en.wikipedia.org/wiki/Precision_and_recall
alex.corpustools.split-asr-data module¶
alex.corpustools.srilm_ppl_filter module¶
alex.corpustools.text_norm_cs module¶
This module provides tools for CZECH normalisation of transcriptions, mainly for those obtained from human transcribers.
alex.corpustools.text_norm_en module¶
This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.
alex.corpustools.text_norm_es module¶
This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.
alex.corpustools.ufal-call-logs-audio2ufal-audio module¶
alex.corpustools.ufal-transcriber2ufal-audio module¶
alex.corpustools.ufaldatabase module¶
alex.corpustools.vad-mlf-from-ufal-audio module¶
alex.corpustools.voxforge2ufal-audio module¶
alex.corpustools.wavaskey module¶
-
alex.corpustools.wavaskey.
load_wavaskey
(fname, constructor, limit=None, encoding=u'UTF-8')[source]¶ Loads a dictionary of objects stored in the “wav as key” format.
The input file is assumed to contain lines of the following form:
[[:space:]..]<key>[[:space:]..]=>[[:space:]..]<obj_str>[[:space:]..]or just (without keys):
[[:space:]..]<obj_str>[[:space:]..]where <obj_str> is to be given as the only argument to the `constructor’ when constructing the objects stored in the file.
- Arguments:
fname – path towards the file to read the objects from constructor – function that will be called on each string stored in
the file and whose result will become a value of the returned dictionarylimit – limit on the number of objects to read encoding – the file encoding
Returns a dictionary with objects constructed by `constructor’ as values.
-
alex.corpustools.wavaskey.
save_wavaskey
(fname, in_dict, encoding=u'UTF-8', trans=<function <lambda>>)[source]¶ Saves a dictionary of objects in the wave as key format into a file.
Parameters: - file_name – name of the target file
- utt – a dictionary with the objects where the keys are the names of teh corresponding wave files
Parma trans: a function which can transform a saved object
Returns: None
Module contents¶
alex.ml package¶
Subpackages¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
This module implements a factor class, which can be used to do computations with probability distributions.
-
class
alex.ml.bn.factor.
Factor
(variables, variable_values, prob_table, logarithmetic=True)[source]¶ Bases:
object
Basic factor.
-
marginalize
(keep)[source]¶ Marginalize all but specified variables.
Marginalizing means summing out values which are not in keep. The result is a new factor, which contains only variables from keep.
Example:
>>> f = Factor(['A', 'B'], ... {'A': ['a1', 'a2'], 'B': ['b1', 'b2']}, ... { ... ('a1', 'b1'): 0.8, ... ('a2', 'b1'): 0.2, ... ('a1', 'b2'): 0.3, ... ('a2', 'b2'): 0.7 ... }) >>> result = f.marginalize(['A']) >>> print result.pretty_print(width=30) ------------------------------ A Value ------------------------------ a1 1.1 a2 0.9 ------------------------------
Parameters: keep (list of str) – Variables which should be left in marginalized factor. Returns: Marginalized factor. Return type: Factor
-
most_probable
(n=None)[source]¶ Return a list of most probable assignments from the table.
Returns a sorted list of assignment and their values according to their probability. The size of the list can be changed by specifying n.
Parameters: n (int) – The number of most probable elements, which should be returned. Returns: A list of tuples (assignment, value) in descending order. Return type: list of (tuple, float)
-
normalize
(parents=None)[source]¶ Normalize a factor table.
The table is normalized so all elements sum to one. The parents argument is a list of names of parents. If it is specified, then only those rows in table, which share the same parents, are normalized.
Example:
>>> f = Factor(['A', 'B'], ... {'A': ['a1', 'a2'], 'B': ['b1', 'b2']}, ... { ... ('a1', 'b1'): 3, ... ('a1', 'b2'): 1, ... ('a2', 'b1'): 1, ... ('a2', 'b2'): 1, ... }) >>> f.normalize(parents=['B']) >>> print f.pretty_print(width=30) ------------------------------ A B Value ------------------------------ a1 b1 0.75 a1 b2 0.5 a2 b1 0.25 a2 b2 0.5 ------------------------------
Parameters: parents (list) – Parents of the factor.
-
observed
(assignment_dict)[source]¶ Set observation.
Example:
>>> f = Factor( ... ['X'], ... { ... 'X': ['x0', 'x1'], ... }, ... { ... ('x0',): 0.5, ... ('x1',): 0.5, ... }) >>> print f.pretty_print(width=30, precision=3) ------------------------------ X Value ------------------------------ x0 0.5 x1 0.5 ------------------------------ >>> f.observed({('x0',): 0.8, ('x1',): 0.2}) >>> print f.pretty_print(width=30, precision=3) ------------------------------ X Value ------------------------------ x0 0.8 x1 0.2 ------------------------------
Parameters: assignment_dict (dict or None) – Observed values for different assignments of values or None.
-
pretty_print
(width=79, precision=10)[source]¶ Create a readable representation of the factor.
Creates a table with a column for each variable and value. Every row represents one assignemnt and its corresponding value. The default width of the table is 79 chars, to fit to terminal window.
Parameters: - width (int) – Width of the table.
- precision (int) – Precision of values.
Returns: Pretty printed factor table.
Return type: str
-
-
alex.ml.bn.factor.
from_log
(n)[source]¶ Convert number from log arithmetic.
Parameters: n (number or array like) – Number to be converted from log arithmetic. Returns: Number in decimal scale. Return type: number or array like
-
alex.ml.bn.factor.
to_log
(n, out=None)[source]¶ Convert number to log arithmetic.
We want to be able to represent zero, therefore every number smaller than epsilon is considered a zero.
Parameters: - n (number or array like) – Number to be converted.
- out (ndarray) – Output array.
Returns: Number in log arithmetic.
Return type: number or array like
Belief propagation algorithms for factor graph.
-
class
alex.ml.bn.lbp.
LBP
(strategy='sequential', **kwargs)[source]¶ Bases:
alex.ml.bn.lbp.BP
Loopy Belief Propagation.
LBP is an approximative inference algorithm for factor graphs. LBP works with generic factor graphs. It does accurate inference for trees and is equal to sum-product algorithm there.
It is possible to specify which strategy should be used for choosing next node for update. Sequential strategy will update nodes in exact order in which they were added. Tree strategy will assume the graph is a tree (without checking) and will do one pass of sum-product algorithm.
-
exception
alex.ml.bn.lbp.
LBPError
[source]¶ Bases:
alex.ml.bn.lbp.BPError
Node representations for factor graph.
-
class
alex.ml.bn.node.
DirichletFactorNode
(name, aliases=None)[source]¶ Bases:
alex.ml.bn.node.FactorNode
Node containing dirichlet factor.
-
class
alex.ml.bn.node.
DirichletParameterNode
(name, alpha, aliases=None)[source]¶ Bases:
alex.ml.bn.node.VariableNode
Node containing parameter.
-
class
alex.ml.bn.node.
DiscreteFactorNode
(name, factor)[source]¶ Bases:
alex.ml.bn.node.FactorNode
Node containing factor.
-
class
alex.ml.bn.node.
DiscreteVariableNode
(name, values, logarithmetic=True)[source]¶ Bases:
alex.ml.bn.node.VariableNode
Node containing variable.
-
class
alex.ml.bn.node.
FactorNode
(name, aliases=None)[source]¶ Bases:
alex.ml.bn.node.Node
-
exception
alex.ml.bn.node.
IncompatibleNeighborError
[source]¶ Bases:
alex.ml.bn.node.NodeError
-
class
alex.ml.bn.node.
Node
(name, aliases=None)[source]¶ Bases:
object
Abstract class for nodes in factor graph.
-
class
alex.ml.bn.node.
VariableNode
(name, aliases=None)[source]¶ Bases:
alex.ml.bn.node.Node
-
class
alex.ml.ep.node.
ConstChangeGoal
(name, desc, card, parameters, parents=None)[source]¶ Bases:
alex.ml.ep.node.GroupingGoal
ConstChangeGoal implements all functionality as is include in GroupingGoal; however, it that there are only two transition probabilites for transitions between the same values and the different values.
-
class
alex.ml.ep.node.
Goal
(name, desc, card, parameters, parents=None)[source]¶ Bases:
alex.ml.ep.node.Node
Goal can contain only the same values as the observations.
As a consequence, it can contain values of its previous node.
-
class
alex.ml.ep.node.
GroupingGoal
(name, desc, card, parameters, parents=None)[source]¶ Bases:
alex.ml.ep.node.GroupingNode
,alex.ml.ep.node.Goal
GroupingGoal implements all functionality as is include in Goal; however, it only update the values for which was observed some evidence.
-
class
alex.ml.ep.node.
GroupingNode
(name, desc, card)[source]¶ Bases:
alex.ml.ep.node.Node
-
class
alex.ml.ep.node.
Node
(name, desc, card)[source]¶ Bases:
object
A base class for all nodes in a belief state.
-
getMostProbableValue
()[source]¶ The function returns the most probable value and its probability in a tuple.
-
-
class
alex.ml.gmm.gmm.
GMM
(n_features=1, n_components=1, thresh=0.001, min_covar=0.001, n_iter=1)[source]¶ This is a GMM model of the input data. It is memory efficient so that it can process very large input array like objects.
The mixtures are incrementally added by splitting the heaviest component in two components and perturbation of the original mean.
-
class
alex.ml.gmm.
GMM
(n_features=1, n_components=1, thresh=0.001, min_covar=0.001, n_iter=1)[source]¶ This is a GMM model of the input data. It is memory efficient so that it can process very large input array like objects.
The mixtures are incrementally added by splitting the heaviest component in two components and perturbation of the original mean.
-
class
alex.ml.lbp.node.
DiscreteFactor
(name, desc, prob_table)[source]¶ Bases:
alex.ml.lbp.node.Factor
This is a base class for discrete factor nodes in the Bayesian Network.
It can works with full conditional table defined by the provided prob_table function.
The variables must be attached in the same order as are the parameters in the prob_table function.
-
class
alex.ml.lbp.node.
DiscreteNode
(name, desc, card, observed=False)[source]¶ Bases:
alex.ml.lbp.node.VariableNode
This is a class for all nodes with discrete/enumerable values.
The probabilities are stored in log format.
-
explain
(full=False, linear_prob=False)[source]¶ This function prints the values and their probabilities for this node.
-
get_most_probable_value
()[source]¶ The function returns the most probable value and its probability in a tuple.
-
get_output_message
(factor)[source]¶ Returns output messages from this node to the given factor.
This is done by subtracting the input log message from the given factor node from the current estimate log probabilities in this node.
-
-
class
alex.ml.lbp.node.
Factor
(name, desc)[source]¶ Bases:
alex.ml.lbp.node.GenericNode
This is a base class for all factor nodes in the Bayesian Network.
Submodules¶
alex.ml.exceptions module¶
-
exception
alex.ml.exceptions.
FFNNException
[source]¶ Bases:
alex.AlexException
-
exception
alex.ml.exceptions.
NBListException
[source]¶ Bases:
alex.AlexException
alex.ml.features module¶
This module contains generic code for working with feature vectors (or, in general, collections of features).
-
class
alex.ml.features.
Abstracted
[source]¶ Bases:
object
-
instantiate
(type_, value, do_abstract=False)[source]¶ - Example: Let self represent
- da1(a1=T1:v1)&da2(a2=T2:v2)&da3(a3=T1:v3).
Calling self.instantiate(“T1”, “v1”) results in
da1(a1=T1)&da2(a2=v2)&da3(a3=v3) ..if do_abstract == False
da1(a1=T1)&da2(a2=v2)&da3(a3=T1_other) ..if do_abstract == True
Calling self.instantiate(“T1”, “x1”) results in
da1(a1=x1)&da2(a2=v2)&da3(a3=v3) ..if do_abstract == False
- da1(a1=T1_other)&da2(a2=v2)&da3(a3=T1_other)
- ..if do_abstract == True.
-
iter_typeval
()[source]¶ Iterates the abstracted items in self, yielding combined representations of the type and value of each such token. An abstract method of this class.
-
other_val
= '[OTHER]'¶
-
splitter
= '='¶
-
-
class
alex.ml.features.
Features
(*args, **kwargs)[source]¶ Bases:
object
A mostly abstract class representing features of an object.
- Attributes:
- features: mapping of the features to their values set: set of the features
-
get_feature_coords_vals
(feature_idxs)[source]¶ Builds the feature vector based on the provided mapping of features onto their indices. Returns the vector as a two lists, one of feature coordinates, one of feature values.
- Arguments:
- feature_idxs: a mapping { feature : feature index }
-
get_feature_vector
(feature_idxs)[source]¶ Builds the feature vector based on the provided mapping of features onto their indices.
- Arguments:
- feature_idxs: a mapping { feature : feature index }
-
classmethod
join
(feature_sets, distinguish=True)[source]¶ Joins a number of sets of features, keeping them distinct.
- Arguments:
- distinguish – whether to treat the feature sets as of different
- types (distinguish=True) or just merge features from them by adding their values (distinguish=False). Default is True.
Returns a new instance of JoinedFeatures.
-
class
alex.ml.features.
JoinedFeatures
(feature_sets)[source]¶ Bases:
alex.ml.features.Features
JoinedFeatures are indexed by tuples (feature_sets_index, feature) where feature_sets_index selects the required set of features. Sets of features are numbered with the same indices as they had in the list used to initialise JoinedFeatures.
- Attributes:
features: mapping { (feature_set_index, feature) : value of feature } set: set of the (feature_set_index, feature) tuples generic: mapping { (feature_set_index, abstracted_feature) :
generic_feature }- instantiable: mapping { feature : generic part of feature } for
- features from self.features.keys() that are abstracted
alex.ml.ffnn module¶
-
class
alex.ml.ffnn.
FFNN
[source]¶ Bases:
object
Implements simple feed-forward neural network with:
– input layer - activation function linear – hidden layers - activation function tanh – output layer - activation function softmax
-
add_layer
(w, b)[source]¶ Add next layer into the network.
Parameters: - w – next layer weights
- b – next layer biases
Returns: none
-
load
(file_name)[source]¶ Loads saved NN.
Parameters: file_name – file name of the saved NN Returns: None
-
predict
(input)[source]¶ Returns the output of the last layer.
- As it is output of a layer with softmax activation function, the output is a vector of probabilities of
- the classes being predicted.
Parameters: input – input vector for the first NN layer. Returns: return the output of the last activation layer
-
alex.ml.hypothesis module¶
This module collects classes representing the uncertainty about the actual value of a base type instance.
-
class
alex.ml.hypothesis.
ConfusionNetwork
[source]¶ Bases:
alex.ml.hypothesis.Hypothesis
Confusion network. In this representation, each fact breaks down into a sequence of elementary acts.
-
add_merge
(p, fact, combine=u'max')[source]¶ Add a fact and if it exists merge it according to the given combine strategy.
-
classmethod
from_fact
(fact)[source]¶ Constructs a deterministic confusion network that asserts the given `fact’. Note that `fact’ has to be an iterable of elementary acts.
-
merge
(conf_net, combine=u'max')[source]¶ Merges facts in the current and the given confusion networks.
- Arguments:
- combine – can be one of {‘new’, ‘max’, ‘add’, ‘arit’, ‘harm’}, and
- determines how two probabilities should be merged (default: ‘max’)
XXX As of now, we know that different values for the same slot are contradictory (and in general, the set of contradicting attr-value pairs could be larger). We should therefore consider them alternatives to each other.
-
-
class
alex.ml.hypothesis.
Hypothesis
[source]¶ Bases:
object
This is the base class for all forms of probabilistic hypotheses representations.
-
class
alex.ml.hypothesis.
NBList
[source]¶ Bases:
alex.ml.hypothesis.Hypothesis
This class represents the uncertainty using an n-best list.
When updating an N-best list, one should do the following.
- add utterances or parse a confusion network
- merge and normalise, in either order
-
add
(probability, fact)[source]¶ Finds the last hypothesis with a lower probability and inserts the new item before that one. Optimised for adding objects from the highest probability ones to the lowest probability ones.
alex.ml.logarithmetic module¶
-
alex.ml.logarithmetic.
add
(a, b)[source]¶ Computes pairwise addition of two vectors in the log domain.
This is equivalent to [a1+b1, a2+b2, ...] in the linear domain..
-
alex.ml.logarithmetic.
devide
(a, b)[source]¶ Computes pairwise division between vectors a and b in the log domain.
This is equivalent to [a1/b1, a2/b2, ...] in the linear domain.
-
alex.ml.logarithmetic.
dot
(a, b)[source]¶ Computes dot product in the log domain.
This is equivalent to a1*b1+a2*b2+... in the linear domain.
-
alex.ml.logarithmetic.
linear_to_log
(a)[source]¶ Converts a vector from the linear domain to the log domain.
-
alex.ml.logarithmetic.
log_to_linear
(a)[source]¶ Converts a vector from the log domain to the linear domain.
-
alex.ml.logarithmetic.
multiply
(a, b)[source]¶ Computes pairwise multiplication between vectors a and b in the log domain.
This is equivalent to [a1*b1, a2*b2, ...] in the linear domain.
-
alex.ml.logarithmetic.
normalise
(a)[source]¶ normalises the input probability vector to sum to one in the log domain.
This is equivalent to a/sum(a) in the linear domain.
alex.ml.test_hypothesis module¶
alex.ml.tffnn module¶
Module contents¶
alex.tests package¶
Submodules¶
alex.tests.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
alex.tests.test_asr_google module¶
alex.tests.test_mproc module¶
alex.tests.test_numpy_with_optimised_ATLAS module¶
alex.tests.test_pyaudio module¶
alex.tests.test_tts_flite_en module¶
alex.tests.test_tts_google_cs module¶
alex.tests.test_tts_google_en module¶
alex.tests.test_tts_voice_rss_en module¶
Module contents¶
alex.tools package¶
Subpackages¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
Submodules¶
alex.tools.apirequest module¶
-
class
alex.tools.apirequest.
APIRequest
(cfg, fname_prefix, log_elem_name)[source]¶ Bases:
object
Handles functions related web API requests (logging).
alex.tools.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
Module contents¶
alex.utils package¶
Submodules¶
alex.utils.analytics module¶
alex.utils.audio module¶
alex.utils.audio_play module¶
alex.utils.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
alex.utils.cache module¶
-
alex.utils.cache.
lfu_cache
(maxsize=100)[source]¶ Least-frequently-used cache decorator.
Arguments to the cached function must be hashable. Cache performance statistics stored in f.hits and f.misses. Clear the cache with f.clear(). http://en.wikipedia.org/wiki/Least_Frequently_Used
-
alex.utils.cache.
lru_cache
(maxsize=100)[source]¶ Least-recently-used cache decorator.
Arguments to the cached function must be hashable. Cache performance statistics stored in f.hits and f.misses. Clear the cache with f.clear(). http://en.wikipedia.org/wiki/Cache_algorithms#Least_Recently_Used
alex.utils.caminfodb module¶
alex.utils.config module¶
-
class
alex.utils.config.
Config
(file_name=None, project_root=False, config={})[source]¶ Bases:
object
Config handles configuration data necessary for all the components in Alex. It is implemented using a dictionary so that any component can use arbitrarily structured configuration data.
Before the configuration file is loaded, it is transformed as follows:
- ‘{cfg_abs_path}’ as a string anywhere in the file is replaced by an
absolute path of the configuration files. This can be used to make the configuration file independent of the location of programs that use it.
-
DEFAULT_CFG_PPATH
= u'resources/default.cfg'¶
-
config_replace
(p, s, d=None)[source]¶ Replace a pattern p with string s in the whole config (recursively) or in a part of the config given in d.
-
classmethod
load_configs
(config_flist=[], use_default=True, log=True, *init_args, **init_kwargs)[source]¶ Loads and merges configs from paths listed in `config_flist’. Use this method instead of direct loading configs, as it takes care of not only merging them but also processing some options in a special way.
- Arguments:
- config_flist – list of paths to config files to load and merge;
- order matters (default: [])
- use_default – whether to insert the default config
- ($ALEX/resources/default.cfg) at the beginning of `config_flist’ (default: True)
- log – whether to log the resulting config using the system logger
- (default: True)
- init_args – additional positional arguments will be passed to
- constructors for each config
- init_kwargs – additional keyword arguments will be passed to
- constructors for each config
-
merge
(other)[source]¶ Merges self’s config with other’s config and saves it as a new self’s config.
- Keyword arguments:
- other: a Config object whose configuration dictionary to merge
into self’s one
-
unfold_lists
(pattern, unfold_id_key=None, part=[])[source]¶ Unfold lists under keys matching the given pattern into several config objects, each containing one item. If pattern is None, all lists are expanded.
Stores a string representation of the individual unfolded values under the unfold_id_key if this parameter is set.
Only expands a part of the whole config hash (given by list of keys forming a path to this part) if the path parameter is set.
-
alex.utils.config.
callback_download_progress
(blocks, block_size, total_size)[source]¶ callback function for urlretrieve that is called when connection is created and when once for each block
Parameters: - blocks – number of blocks transferred so far
- block_size – in bytes
- total_size – in bytes, can be -1 if server doesn’t return it
-
alex.utils.config.
load_as_module
(path, force=False, encoding=u'UTF-8', text_transforms=[])[source]¶ Loads a file pointed to by `path’ as a Python module with minimal impact on the global program environment. The file name should end in ‘.py’.
- Arguments:
path – path towards the file force – whether to load the file even if its name does not end in
‘.py’encoding – character encoding of the file text_transforms – collection of functions to be run on the original
file text
Returns the loaded module object.
-
alex.utils.config.
online_update
(file_name)[source]¶ This function can download file from a default server if it is not available locally. The default server location can be changed in the config file.
The original file name is transformed into absolute name using as_project_path function.
Parameters: file_name – the file name which should be downloaded from the server Returns: a file name of the local copy of the file downloaded from the server
-
alex.utils.config.
set_online_update_server
(server_name)[source]¶ Set the name of the online update server. This function can be used to change the server name from inside a config file.
Parameters: server_name – the HTTP(s) path to the server and a location where the desired data reside. Returns: None
alex.utils.cuda module¶
alex.utils.czech_stemmer module¶
Czech stemmer Copyright © 2010 Luís Gomes <luismsgomes@gmail.com>.
- Ported from the Java implementation available at:
- http://members.unine.ch/jacques.savoy/clef/index.html
alex.utils.enums module¶
alex.utils.env module¶
alex.utils.excepthook module¶
Depending on the hook_type, ExceptionHook class adds various hooks how to catch exceptions.
-
class
alex.utils.excepthook.
ExceptionHook
(hook_type, logger=None)[source]¶ Bases:
object
Singleton objects for registering various hooks for sys.exepthook. For registering a hook, use set_hook.
-
apply
()[source]¶ The object can be used to store settings for excepthook. a = ExceptionHook(‘log’) # now it logs b = ExceptionHook(‘ipdb’) # now it uses ipdb a.apply() # now it logs again
-
logger
= None¶
-
alex.utils.exceptions module¶
-
exception
alex.utils.exceptions.
ConfigException
[source]¶ Bases:
alex.AlexException
-
exception
alex.utils.exceptions.
SessionClosedException
[source]¶ Bases:
alex.AlexException
-
exception
alex.utils.exceptions.
SessionLoggerException
[source]¶ Bases:
alex.AlexException
alex.utils.filelock module¶
Context manager for locking on a file. Obtained from
http://www.evanfosmark.com/2009/01 /cross-platform-file-locking-support-in-python/,
licensed under BSD.
This is thought to work safely on NFS too, in contrast to fcntl.flock(). This is also thought to work safely over SMB and else, in contrast to fcntl.lockf(). For both issues, consult http://oilq.org/fr/node/13344.
Use as simply as
- with FileLock(filename):
- <critical section for working with the file at `filename’>
-
class
alex.utils.filelock.
FileLock
(file_name, timeout=10, delay=0.05)[source]¶ Bases:
object
A file locking mechanism that has context-manager support so you can use it in a with statement. This should be relatively portable as it doesn’t rely on msvcrt or fcntl for the locking.
alex.utils.fs module¶
Filesystem utility functions.
-
class
alex.utils.fs.
GrepFilter
(stdin, stdout, breakchar=u'n')[source]¶ Bases:
multiprocessing.process.Process
-
add_listener
(regex, callback)[source]¶ Adds a listener to the output strings.
- Arguments:
- regex – the compiled regular expression to look for
- (`regex.search’) in any piece of output
- callback – a callable that is invoked for output where `regex’ was
found. This will be called like this:
outputting &= callback(output_unicode_str)That means, callback should take the unicode string argument containing what would have been output and return a boolean value which is True iff outputting should stop.
Returns the index of the listener for later reference.
-
-
alex.utils.fs.
find
(dir_, glob_, mindepth=2, maxdepth=6, ignore_globs=[], ignore_paths=None, follow_symlinks=True, prune=False, rx=None, notrx=None)[source]¶ A simplified version of the GNU `find’ utility. Lists files with basename matching `glob_‘ found in `dir_‘ in depth between `mindepth’ and `maxdepth’.
The `ignore_globs’ argument specifies a glob for basenames of files to be ignored. The `ignore_paths’ argument specifies a collection of real absolute pathnames that are pruned from the search. For efficiency reasons, it should be a set.
In the current implementation, the traversal resolves symlinks before the file name is checked. However, taking symlinks into account can be forbidden altogether by specifying `follow_symlinks=False’. Cycles during the traversal are avoided.
The returned set of files consists of real absolute pathnames of those files.
alex.utils.htk module¶
-
class
alex.utils.htk.
MLF
(file_name=None, max_files=None)[source]¶ Read HTK MLF files.
Def: segment is a sequence of frames with the same label.
-
class
alex.utils.htk.
MLFFeaturesAlignedArray
(filter=None)[source]¶ Creates array like object from multiple mlf files and corresponding audio data. For each aligned frame it returns a feature vector and its label.
If a filter is set to a particular value, then only frames with the label equal to the filer will be returned. In this case, the label is not returned when iterating through the array.
-
class
alex.utils.htk.
MLFMFCCOnlineAlignedArray
(windowsize=250000, targetrate=100000, filter=None, usec0=False, usedelta=True, useacc=True, n_last_frames=0, mel_banks_only=False)[source]¶ Bases:
alex.utils.htk.MLFFeaturesAlignedArray
This is an extension of MLFFeaturesAlignedArray which computes the features on the fly from the input wav files.
It uses our own implementation of the MFCC computation. As a result it does not give the same results as the HTK HCopy.
The experience suggests that our MFFC features are worse than the features generated by HCopy.
alex.utils.interface module¶
alex.utils.lattice module¶
alex.utils.mfcc module¶
-
class
alex.utils.mfcc.
MFCCFrontEnd
(sourcerate=16000, framesize=512, usehamming=True, preemcoef=0.97, numchans=26, ceplifter=22, numceps=12, enormalise=True, zmeansource=True, usepower=True, usec0=True, usecmn=False, usedelta=True, useacc=True, n_last_frames=0, lofreq=125, hifreq=3800, mel_banks_only=False)[source]¶ This is an a CLOSE approximation of MFCC coefficients computed by the HTK.
The frame size should be a number of power of 2.
TODO: CMN is not implemented. It should normalise only teh cepstrum, not the delta or acc coefficients.
It was not tested to give exactly the same results the HTK. As a result, it should not be used in conjunction with models trained on speech parametrised with the HTK.
Over all it appears that this implementation of MFCC is worse than the one from the HTK. On the VAD task, the HTK features score 90.8% and the this features scores only 88.7%.
-
class
alex.utils.mfcc.
MFCCKaldi
(sourcerate=16000, framesize=512, usehamming=True, preemcoef=0.97, numchans=26, ceplifter=22, numceps=12, enormalise=True, zmeansource=True, usepower=True, usec0=True, usecmn=False, usedelta=True, useacc=True, n_last_frames=0, lofreq=125, hifreq=3800, mel_banks_only=False)[source]¶ TODO port Kaldi mfcc to Python. Use similar parameters as in suggested in __init__ function
alex.utils.mproc module¶
Implements useful classes for handling multiprocessing implementation of the Alex system.
-
class
alex.utils.mproc.
InstanceID
[source]¶ Bases:
object
This class provides unique ids to all instances of objects inheriting from this class.
-
instance_id
= <Synchronized wrapper for c_int(0)>¶
-
lock
= <Lock(owner=None)>¶
-
-
class
alex.utils.mproc.
SystemLogger
(output_dir, stdout_log_level='DEBUG', stdout=True, file_log_level='DEBUG')[source]¶ Bases:
object
This is a multiprocessing-safe logger. It should be used by all components in Alex.
-
get_session_dir_name
(*args, **kw)[source]¶ Return directory where all the call related files should be stored.
-
get_time_str
()[source]¶ Return current time in dashed ISO-like format.
It is useful in constructing file and directory names.
-
levels
= {'INFO': 20, 'CRITICAL': 40, 'EXCEPTION': 50, 'SYSTEM-LOG': 0, 'WARNING': 30, 'ERROR': 60, 'DEBUG': 10}¶
-
lock
= <RLock(None, 0)>¶
-
log
(*args, **kw)[source]¶ Logs the message based on its level and the logging setting. Before writing into a logging file, it locks the file.
-
session_end
(*args, **kw)[source]¶ WARNING: Deprecated Disables logging into the session-specific directory.
We better do not end a session because very often after the session_end() method is called there are still incoming messages. Therefore, it is better to wait for the session_start() method to set a new destination for the session log.
-
session_start
(*args, **kw)[source]¶ Create a specific directory for logging a specific call.
NOTE: This is not completely safe. It can be called from several processes.
-
-
alex.utils.mproc.
async
(func)[source]¶ A function decorator intended to make “func” run in a separate thread (asynchronously). Returns the created Thread object
E.g.: @async def task1():
do_something@async def task2():
do_something_toot1 = task1() t2 = task2() ... t1.join() t2.join()
-
alex.utils.mproc.
etime
(name='Time', min_t=0.3)[source]¶ This decorator measures the execution time of the decorated function.
-
alex.utils.mproc.
file_lock
(file_name)[source]¶ Multiprocessing lock using files. Lock on a specific file.
-
alex.utils.mproc.
file_unlock
(lock_file)[source]¶ Multiprocessing lock using files. Unlock on a specific file.
alex.utils.nose_plugins module¶
alex.utils.parsers module¶
-
class
alex.utils.parsers.
CamTxtParser
(lower=False)[source]¶ Bases:
object
Parser of files of the following format: <<BOF>> [record]
[record]
... <<EOF>>
where [record] has the following format:
<<[record]>> [property name]([property value]) <</[record]>>
[property name] and [property value] are arbitrary strings
Any ” or ‘ characters are stripped from the beginning and end of each [property value].
-
line_expr
= <_sre.SRE_Pattern object>¶
-
alex.utils.procname module¶
alex.utils.rdb module¶
alex.utils.sessionlogger module¶
-
class
alex.utils.sessionlogger.
SessionLogger
[source]¶ Bases:
multiprocessing.process.Process
This is a multiprocessing-safe logger. It should be used by Alex to log information according the SDC 2010 XML format.
Date and times should also include time zone.
Times should be in seconds from the beginning of the dialogue.
alex.utils.test_analytics module¶
alex.utils.test_fs module¶
Unit tests for alex.util.fs.
-
class
alex.utils.test_fs.
TestFind
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
alex.utils.test_sessionlogger module¶
alex.utils.test_text module¶
alex.utils.text module¶
-
class
alex.utils.text.
Escaper
(chars=u''"', escaper=u'\', re_flags=0)[source]¶ Bases:
object
Creates a customised escaper for strings. The characters that need escaping, as well as the one used for escaping can be specified.
-
ESCAPED
= 1¶
-
ESCAPER
= 0¶
-
NORMAL
= 2¶
-
annotate
(esced)[source]¶ Annotates each character of a text that has been escaped whether:
Escaper.ESCAPER - it is the escape character Escaper.ESCAPED - it is a character that was escaped Escaper.NORMAL - otherwise.It is expected that only parts of the text may have actually been escaped.
Returns a list with the annotation values, co-indexed with characters of the input text.
-
static
re_literal
(char)[source]¶ Escapes the character so that when it is used in a regexp, it matches itself.
-
-
alex.utils.text.
escape_special_characters_shell
(text, characters=u'\'"')[source]¶ Simple function that tries to escape quotes. Not guaranteed to produce the correct result!! If that is needed, use the new `Escaper’ class.
-
alex.utils.text.
min_edit_dist
(target, source)[source]¶ Computes the min edit distance from target to source.
-
alex.utils.text.
min_edit_ops
(target, source, cost=<function <lambda>>)[source]¶ Computes the min edit operations from target to source.
Parameters: - target – a target sequence
- source – a source sequence
- cost – an expression for computing cost of the edit operations
Returns: a tuple of (insertions, deletions, substitutions)
-
alex.utils.text.
parse_command
(command)[source]¶ Parse the command name(var1=”val1”,...) into a dictionary structure:
E.g. call(destination=”1245”,opt=”X”) will be parsed into:
- { “__name__”: “call”,
- “destination”: “1245”, “opt”: “X”}
Return the parsed command in a dictionary.
-
alex.utils.text.
split_by
(text, splitter, opening_parentheses=u'', closing_parentheses=u'', quotes=u'\'"')[source]¶ Splits the input text at each occurrence of the splitter only if it is not enclosed in parentheses.
text - the input text string splitter - multi-character string which is used to determine the position
of splitting of the text- opening_parentheses - an iterable of opening parentheses that has to be
- respected when splitting, e.g. “{(” (default: ‘’)
- closing_parentheses - an iterable of closing parentheses that has to be
- respected when splitting, e.g. “})” (default: ‘’)
quotes - an iterable of quotes that have to come in pairs, e.g. ‘”’
alex.utils.ui module¶
alex.utils.various module¶
-
alex.utils.various.
flatten
(list_, ltypes=(<type 'list'>, <type 'tuple'>))[source]¶ Flatten nested list into a simple list.
-
alex.utils.various.
get_text_from_xml_node
(node)[source]¶ Get text from all child nodes and concatenate it.
-
alex.utils.various.
group_by
(objects, attrs)[source]¶ Groups `objects’ by the values of their attributes `attrs’.
Returns a dictionary mapping from a tuple of attribute values to a list of objects with those attribute values.
Module contents¶
Submodules¶
alex.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides