苦労する遊び人の読書録

このページは早稲田大学人間科学部、浅井拓也の読書ログです。 基本的には私がこれまで読んできた論文に関しての翻訳メモを記述します。

Contents (論文)

言語獲得研究

Accuracy of perceptual and acoustic methods for the detection of inspiratory loci in spontaneous speech

自発音声における吸気位置探索のための知覚及び音響的な方法による精度

  • Yu-Tsai Wang & Ignatius S. B. Nip & Jordan R. Green & Ray D. Kent & Jane Finley Kent & Cara Ullman
  • Published online: 24 February 2012
  • Psychonomic Society, Inc. 2012
  • Keywords Accuracy . Breath group . Spontaneous speech .Acoustics
Abstract (要旨)

本研究では,呼吸グループを識別するために知覚的,音響的に決定された呼吸の自発音声における位置の正確さを調査した. 16名の参加者はpneumotachとマイクロフォンに接続されながら,快適な話速とラウドネスで日常の簡単な話題について話すように頼まれた. 呼吸座の位置は空気力学的な信号をベースに決定され,知覚的,音響的に決定される座に対する基準として提供された. 信号検出理論は手法の精度に使用された. 結果,ポーズ検出における最もよい精度は

  1. 知覚的には少なくとも2,3人の評定者の合意に基づき,
  2. 音響的には,300msのポーズ持続時間をしきい値に使用する

時に達成されることを示した. 一般に,知覚をベースにした方法は音響をベースにした方法より正確であった. 自発音声における呼吸座の知覚的決定と,音響的決定,空気力学的決定との不一致は呼吸グループの決定の方法の選択に重みをつける必要がある.

The present study investigates the accuracy of perceptually and acoustically determined inspiratory loci in spontaneous speech for the purpose of identifying breath groups. Sixteen participants were asked to talk about simple topics in daily life at a comfortable speaking rate and loudness while connected to a pneumotach and audio microphone. The locations of inspiratory loci were determined on the basis of the aerodynamic signal, which served as a reference for loci identified perceptually and acoustically. Signal detection theory was used to evaluate the accuracy of the methods. The results showed that the greatest accuracy in pause detection was achieved (1) perceptually, on the basis of agreement between at least two of three judges, and (2) acoustically, using a pause duration threshold of 300 ms. In general, the perceptually based method was more accurate than was the acoustically based method. Inconsistencies among perceptually determined, acoustically determined, and aerodynamically determined inspiratory loci for spontaneous speech should be weighed in selecting a method of breath group determination.

Intro (導入)

発話中,呼吸のパターンは,基本的な恒常性の呼吸と,発話に伴う様々な要求を調整するために常に変化する. 霊長類の中では,人間は音声発話に向けた洗練され,柔軟な能力が特徴的に現れる(MacLarnon & Hewitt, 1999). 発話中は,呼気の持続時間は通常,完全な呼吸サイクルのわずか9-19%を占めている(吸気+呼気,MacLarnon & Hewitt, 1999). 発話のために特徴的な呼吸パターン(速い吸気と緩やかでコントロールされた呼気)は必然的に音声の出力に呼吸に関連する構造を課す. この行動は一般にBrearhGroupとして知られており,連続するシラブルか単語が一息で発話される. 健康あるいは調子の悪い話者の最適な音声のパフォーマンスに関してBreathGroupの統制はコミュニケーションに影響され影響する側面がある. 1つの発話の言語的な特徴の運搬に対して,適切な空気工学的なパワーが存在することを確かにする発話行為の前に吸気の位置及び深さは計画される必要がある(Winkworth, Davis, Adams, & Ellis, 1995).

During speech, breathing patterns are constantly changing to balance the varying demands of an utterance with those of underlying homeostatic respiration. Among primates, humans appear unique in this refined and flexible capability for sound production (MacLarnon & Hewitt, 1999). During speech, the duration of inspiration typically represents only 9%–19% of the full breath cycle (inspiration + expiration; Loudon, Lee, & Holcomb, 1988). The characteristic respiratory pattern for speech (quick inspiration and a gradual and controlled expiration) inevitably imposes a breathrelated structure on vocal output. This structure is commonly known as the breath group, a sequence of syllables or words produced on a single breath. Management of breath groups is one aspect of efficient and effective communication, for optimum vocal performance in both healthy and disordered speakers. The location and degree of inspiration must be planned prior to the production of an utterance to ensure that there is adequate aerodynamic power for conveying the linguistic properties of an utterance (Winkworth, Davis, Adams, & Ellis, 1995).

BreathGroupの特定はしばしば,録音された音声サンプルの解析,とくに,一節の読み上げ,対話,講演においての根本的なステップになる. 呼気はプロソディやそれに関連する変数に対する事前事後に存在する発話の区切りを示す. BreathGroupの便利さは以下の研究で示されている.

  1. 通常発話呼吸(Hoit & Hixon, 1987; Mitchell, Hoit, & Watson, 1996)
  2. 乳児の発話の発達(Nathani & Oller 2001)
  3. 自動音声認識システムや文章読み上げシステム(Ainsworth, 1973; Rieger, 2003)のような発話テクノロジーのデザイン
  4. 失語症の診察やリハビリ(Che, Wang, Lu, & Green, 2011; Huber & Darling, 2011; Yorkston, 1996)

これらのいろいろなアプリケーションの共通点は一呼吸に発話されたシラブルや単語のグルーピングを特定することが必要である.

Identification of breath groups is often a fundamental step in the analysis of recorded speech samples, especially for reading passages, dialogs, and orations. Inspirations mark intervals of speech that can be subsequently examined for prosody and related variables. The usefulness of breath groups has been demonstrated in studies of (1) normal speech breathing (Hoit & Hixon, 1987; Mitchell, Hoit, & Watson, 1996), (2) the development of speech in infants (Nathani & Oller 2001), (3) design of speech technologies such as automatic speech recognition and text-to-speech synthesis (Ainsworth, 1973; Rieger, 2003), and (4) the assessment and treatment of speech disorders (Che, Wang, Lu, & Green, 2011; Huber & Darling, 2011; Yorkston, 1996). Common to these various applications is the need to identify groupings of syllables or words produced on a single breath, which is the inevitable respiratory imprint on spoken communication.

BreathGroupは3つの方法で調査された.

  • 知覚的 : 発話を聞くことによって評価
  • 音響的 : 一般に,ある閾値以上の無声区間かパルスの探索によって評価
  • 生理学的 : 典型的には胸部がよく動くところや,発話中の通気の方向を記録して評価

生理的な方法は最も妥当性の高い方法に思われる. しかし,常に簡単に発話の研究に含めるものではないし,事前に収録したサンプルを解析するのには使えない(例えば,アーカイブされた収録など). 一方で多くの研究はBreathGroupを知覚的,音響的に特定,評価している. 知覚的,音響的方法を使用したBrethGroupの研究に関する基本的な問題はどのくらいうまくこれらの方法が生理学的な解析と相関を持つのかということである.

Breath groups have been determined in three ways: perceptually (by listening to the speech output), acoustically (usually by detecting pauses or silences that exceed a criterion threshold), and physiologically (typically by recording chest wall movements or the direction of airflow during speech). The physiologic method may be considered the gold standard; however, it is not always easily incorporated into studies of speech and cannot be used to analyze previously recorded samples (such as archival recordings) that did not employ physiological measures. Although many studies identify and evaluate breath groups perceptually and acoustically, the basic question about breath group studies using perceptual and acoustic methods is how well they correlate with physiologic analysis.

知覚的な判断は呼吸サイクルに関連する発話の特徴量の音響的な判断に基づいた婉曲的な探索方法である(Bunton, Kent, & Rosenbek, 2000; Oller & Smith, 1977; Schlenck, Bettrich, & Willmes, 1993; Wang, Kent, Duffy, & Thomas, 2005; Wozniak, Coelho, Duffy, & Liles, 1999) 知覚的,音響的両方の方法は事前に収録した発話サンプルにも適応でき,ソフトウェアやハードウェアの慎ましやかな運用のみを可能にする. 一方で多くの研究ではこれらの婉曲的な方法のどちらかを使用した Breath Group の調査が行われており,これらのアプローチの精度はテストされていない. 知覚的な方法はすべて主観的なものであり,聞き手の感覚を元にしており,音響的な方法は受け入れられるポーズに対する最低限の持続時間を特定する使用者が必要である. 音声信号のしたがって,無声部分,それはおそらくパスルであるが,この目安を超えていないものは調査されていない(Campbell & Dollaghan, 1995; Green, Beukelman, & Ball, 2004; Walker, Archibald, Cherniak, & Fish, 1992; Yunusova, Weismer, Kent, & Rusche, 2005).

Perceptual determination is an indirect detection based on auditory judgments of speech features associated with the respiratory cycle (Bunton, Kent, & Rosenbek, 2000; Oller & Smith, 1977; Schlenck, Bettrich, & Willmes, 1993; Wang, Kent, Duffy, & Thomas, 2005; Wozniak, Coelho, Duffy, & Liles, 1999). Both the perceptual and acoustic methods can be applied to previously recorded speech samples and can be accomplished with only modest investment in hardware or software. Although most studies have investigated breath groups using either of these indirect methods, the accuracy of these approaches has not been tested; the perceptual method is entirely subjective, based on listeners’ impressions; the acoustic method requires the user to specify a minimum duration for an acceptable pause. Therefore, silent portions in the speech signal that may be pauses but do not exceed this criterion are not investigated (Campbell & Dollaghan, 1995; Green, Beukelman, & Ball, 2004; Walker, Archibald, Cherniak, & Fish, 1992; Yunusova, Weismer, Kent, & Rusche, 2005).

この婉曲的な方法とは対照的に,生理学的な探索は直接吸音と空気の流れと一緒に通して終了したイベントを探索する(Wang, Green, Nip, Kent, Kent, & Ullman, 2010) or chestwall movements (Bunton, 2005; Forner & Hixon, 1977; Hammen & Yorkston, 1994; Hixon, Goldman, & Mead, 1973; Hixon, Mead, & Goldman, 1976; Hoit & Hixon, 1987; Hoit, Hixon, Watson, & Morgan, 1990; McFarland, 2001; Mitchell et al., 1996; Winkworth et al., 1995; Winkworth, Davis, Ellis, & Adams, 1994). 生理学的な探索は適切な計装が必要であり,例えば口頭の空気の流れを計測するためのマスクの着用が必要など,参加者に最低限の負荷がかかる.

In contrast to the indirect methods, the physiologic determination directly detects inspiratory and expiratory events through either airflow (Wang, Green, Nip, Kent, Kent, & Ullman, 2010) or chestwall movements (Bunton, 2005; Forner & Hixon, 1977; Hammen & Yorkston, 1994; Hixon, Goldman, & Mead, 1973; Hixon, Mead, & Goldman, 1976; Hoit & Hixon, 1987; Hoit, Hixon, Watson, & Morgan, 1990; McFarland, 2001; Mitchell et al., 1996; Winkworth et al., 1995; Winkworth, Davis, Ellis, & Adams, 1994). Physiologic detection requires adequate instrumentation and may impose at least slight encumbrances on participants, such as the need to wear a face mask for oral airflow measures.

本稿では短文読み上げタスクにおける Brearh Group のより簡単な探索のためのフォローアップである. Brearh Grop の探索の精度は発話タスクにおそらく影響されるため,最低でも自発音声と音声研究の主要なタスクである短文読み上げでの異なる探索方法のパフォーマンスを調査する必要がある. 研究ではこれら2種類の発話タスクが Brearh Group の構造において多少異なるパターンと関連することを示した(Wang, Green, Nip, Kent, & Kent, 2010).

The present study is a follow-up to an earlier investigation of breath group detection in a task of passage reading. Because accuracy of breath group detection may be affected by the speaking task, it is necessary to examine the performance of different methods of detection in at least spontaneous speech and passage reading, which have been primary tasks in the study of speech production. Studies have shown that these two speaking tasks are associated with somewhat different patterns in breath group structure (Wang, Green, Nip, Kent, & Kent, 2010).

Method (メゾッド)
Participants and stimuli (被験者及び刺激について)

この研究の参加者は20歳から64歳(平均:40 SD 15)までの16人の健康な大人(男性:6人 女性:10名)である. すべての参加者は北アメリカ英語の母語話者で,自己申告では話しことば,書き言葉の神経学的失語症の経歴はない. 参加者は正常で正しい聴覚および視覚である. 参加者には彼らの話しことば,書き言葉の適切性と日常生活における簡単な内容の議論をおこなうための認知的なスキルとを確かめるためのスクリーンを行った. 16名の話し手に加え,3人のWisconsin–Madison大学から来た人間が音声収録における音響的-知覚的な手がかりを元にそれぞれの音声サンプルの吸気の位置を判断した.

Sixteen healthy adults (6 males, 10 females), ranging in age from 20 to 64 years (M 0 40, SD 0 15), participated in the study. All participants were native speakers of North American English, with no self-reported history of speech, language, or neurological disorders. Participants had normal or corrected hearing and vision. Participants were screened to ensure that they had adequate speech, language, and cognitive skills required to discuss simple topics regarding daily life. In addition to the 16 speakers, three individuals from the University of Wisconsin–Madison judged where inspiratory loci fell in each speaking sample on the basis of auditory-perceptual cues in the audio recording.

Experimental protocol (実験手順)

参加者は席に座り,円形の通気口のついたマスク(Glottal Enterprises MA-1 L)の小さな器具を顔にを着けるように支持された. スピーキングタスク中の呼気,吸気の空気の流れはpneumotachograph(airflow)を使って記録され,形質導入(Biopac SS11LA)はフェイスマスクと対になっている. 先行研究ではフェイスマスクは呼吸のパターンに影響しないことが示されている(Collyer & Davis, 2006). 一方で,呼吸の活動はおそらく頭と腕の筋肉についている被験者のマスクに影響されるが,被験者はこの研究においては心地よく会話をしていた. 音響信号は 48 kHZ (量子化:16 bit) で電気的に記録され,通気口付きのマスクから大体2-4cmに位置するプロフェショナルマイク(Sennheiser)を使用した. 被験者は Canon XL-1 デジタルビデオレコーダーを使用したビデオ収録もされた.しかし,音響的な信号のみをBreathGroupの探索には使用している.

Participants were seated and were instructed to hold a circumferentially vented mask (Glottal Enterprises MA-1 L) tightly against their faces. Expiratory and inspiratory airflows during the speaking tasks were recorded using a pneumotachograph (airflow) transducer (Biopac SS11lA) that was coupled to the facemask. Previous research has demonstrated that facemasks do not significantly alter breathing patterns (Collyer & Davis, 2006). Although respiratory activity may be affected by the participants’ use of facemasks in combination with the hand and arm muscle forces needed to hold the mask tightly against the face, participants in the present study were talking comfortably. Audio signals were recorded digitally at 48 kHz (16-bit quantization) using a professional microphone (Sennheiser), which was placed approximately 2–4 cm away from the vented mask. Participants were also video-recorded using a Canon XL-1 s digital video recorder; however, only the audio signals were used for the analysis of breath group determination.

被験者は出来る限り快適な話速とラウドネスで以下のトピックについて話すように頼まれた.

  • topics
    • 家族について
    • 普通の日の活動について
    • お気に入りの趣味について
    • 楽しみのために行うことについて
    • 将来の不安について

話題はLCDプロジェクターを使用した大きなスクリーンで示された. 流暢な自発音声サンプルかつ簡単な形成を得るために収録を開始する前に参加者はこれらの話題について考える時間を与えられた. それぞれの反抗は最低でも6つの BreathGroup を含む(airfow transducer によってモニターされるものとして)ために必要な処理である.

Participants were asked to talk about the following topics with a comfortable speaking rate and loudness in as much detail as possible: their family, activities in an average day, their favorite activities, what they do for enjoyment, and their plans for their future. The topics were presented on a large screen using an LCD projector. Participants were given time to formulate their responses to the topics before the recording was initiated to obtain reasonably organized and fluent spontaneous speech samples. Each response was required to be composed of at least six breath groups (as monitored by an airflow transducer).

Breath group determination (BreathGroupの特定)

pneumotachometerから取得した空気力学的なデータと対話音声信号の刺激は Biopac Student Lab 3.6.7 で記録した. 空気の流れの信号は 1000 Hz でサンプリングし,ローパスフィルター( FLP: 500 Hz )にかけた. 合成された空気の流れの信号はあとで実際の吸気位置の視覚的な特定に使用する. 吸気を特定すると空気の流れの痕跡の上方向のピークで表現される(図1). 信号の中の呼気では下方向に変化する. 吸気位置に関する不確かさがあった数少ない場所は,吸気位置の特定の合意に達するため第一,第二著者が空気の流れの跡を調査した.

_images/fig1.png

注釈

Fig.1

収録された発話サンプルの空気工学的な信号を元にしたドットによる特定された吸気の位置の例示(パネル下). 上部のパネルは音響信号の音圧を示している. 矢印は空気の流れによる探索結果を示す.

A demonstration of the locations of inspiration indicated by the dots for the recorded speech sample based on the aerodynamic signal (the lower panel). The upper panel is the corresponding sound pressure of the acoustic signal. The arrows indicate the direction of airfow

Aerodynamics Data from the pneumotachometer and the simultaneous digital audio signal were recorded using Biopac Student Lab 3.6.7. The airflow signal was sampled at 1000 Hz and subsequently low-pass filtered (FLP 0 500 Hz). The resultant airflow signal was later used to visually identify actual inspiratory loci, represented by the upward peak in the airflow trace indicating inspiration (Fig. 1), whereas a downward trend in the signal indicated expiration. On the rare occasions where there was uncertainty about the location of the inspiratory location, the first and the second authors examined the airflow traces in order to reach a consensus agreement on the inspiratory location.

発話サンプルの知覚的な BreathGrroup は Wisconsin Madison 大学の三人の判断者によって主観的に調査された. 判定者は英語母語話者でありどのように発話の吸気ポーズの信号の知覚的な手がかりを使って BreathGroup を特定するのかを訓練されている. BreathGroupの持続時間のための判定者はどのように利用可能な手がかりの基準を元にBreathGroupの位置を決定するのかについて訓練したあと,彼らの発話を行った. 彼らは,他の対話音声サンプルを聴き,吸気の起きた場所をTranscription Sheetにポイントした. 吸気が聞こえなかった時には,判定者は音声知覚を視覚化情報と例えばパルスの持続時間の長さや,F0の下降,そしてフレーズ末の持続時間の長さなど種々の音響的な手がかりの基準に従い吸気ポイントをマークした. 判断者はタスクの教示説明の基本的なセットも提供された(Appendix参照). 加えて,判定者は Breath Group の位置についての彼らの決定に自身があることを確かにするために発話サンプルを聞いた. Breath Groupの決定の手順は以下の通りである.

  1. 音声サンプルは吸音位置の決定の判断としては機能していない転記者によって書き起こしされた.
  2. 句読点と大文字小文字の区別(代名詞Iと固有名詞は除く)は書き起こしの句読点やその他の視覚情報から Breath Group を判定してしまうのを防ぐため書き起こしからは排除された. 語順を利用しての判断を防ぐため,それぞれの単語はスペース3つで区切った.
  3. 発話サンプルは判定者が発話者の順序をランダマイズされた Breath Group の 探索を行うために,ランダム数の表を使用し,準備された.
  4. 判定者は発話サンプルを通常のラウドネスで聴き,吸音だと受け取った位置にマークをした. 判定者は吸音位置を吸音が観測できない場合にはかれらの聴覚印象にしたがって最もよく推測することを依頼された. したがって,これらの判断は聞き手が利用できる多次元的な手がかり,例えばパルス持続時間の長さやF0下降,単語末やシラブルの持続時間の長さ,にもどついている可能性がある. 判定者はBreath Group の位置の推定に満足すうるまで,繰り返し,デジタル化された音声サンプルを聞くことが許された.
  5. 吸音位置の知覚的な判定は3人の判定者によって可能なペアすべてを比較し,判定の妥当性を検討した. 信頼性の計測は三人の判定者の知覚的な吸気位置決定の総数で吸気の合意数を割ることで定義した.

Perception Breath groups for the speech samples were determined perceptually by three judges at the University of Wisconsin–Madison. The judges were native English speakers trained on how to identify breath groups using known perceptual cues that signal the production of inspiratory pauses. The judges for the determination of breath group were trained to learn how to determine the location of breath groups on the basis of possible cues before performing their tasks. They were asked to listen to other conversation speech samples and to mark the points on their transcription sheets at which inspiration occurred. When the inspiration was not audible, the judges estimated the inhalation point on the basis of auditory-perceptual impression and various acoustic cues, such as longer pause duration, f 0 declination, and longer phrase-final duration, which are fairly reliable indicators of pauses in normal speech and infant vocalization (Nathani & Oller, 2001; Oller & Lynch, 1992). The judges were also provided with a standard set of instructions explaining the task (see the Appendix). In addition, the judges were allowed to listen to the speech samples repeatedly to ensure that they were confident in their determination on the breath group location. The procedures of breath group determination were as follows:

  1. The speech samples were orthographically transcribed by a trained transcriptionist who did not serve as a judge in the determination of inspiratory loci.
  2. Punctuations and upper- and lowercase distinctions (except for the pronoun I and proper names) were removed from the orthographic transcripts to prevent the judges from analyzing breath groups on the basis of punctuation and related visual cues in the transcript. Three spaces separated each word to prevent the judges from using word order to separate breath groups.
  3. The speech samples prepared for the judges for the task of breath group determination were randomized for order of speaker, using a table of random numbers.
  4. The judges listened to the speech samples at normal loudness and marked perceived inspiratory loci on the transcripts. The judges were asked to make a best guess of the inhalation location on the basis of their auditoryperceptual impressions when inspirations were not obvious. Therefore, these judgments could be based on multiple cues available to listeners, such as longer pause duration, f0 declination, and longer phrase-final word or syllable duration. The judges were allowed to listen to the digitized speech samples repeatedly until they were satisfied with their determination of the breath group location.
  5. The perceptual judgments of inspiratory loci were compared across each possible pairing of the three judges and across all the three judges to gauge the interjudge reliability. Measurement reliability was defined as the number of points that the judges agreed upon an inspiratory location divided by the total number of perceptually determined inspiratory loci by the three judges.
Acoustics (音響的特徴量)

発話ポーズ解析,もしくはSPA(Green et al., 2004)と呼ばれるカスタムMatbalが音響的に発話サンプルに対する Breath Group の識別された位置を特定した. このソフトは発話に対しポーズ部分を特定するある最小のしきい値を手動で与えてやる必要がある. また,ポーズや発話範囲の持続時間のためのしきい値も必要である. 本研究では5つのポーズ持続時間しきい値を試した:150, 200, 250, 300, 350 ms. これらは先行研究で典型的に使用されているポーズ持続時間のしきい値をカバーするように選択した. 例えば

  • 吸音位置は150ms以上のパルスとして定義される(Yunusova et al., 2005).
  • 250msである(Walker et al., 1992),
  • 300msである(Campbell & Dollaghan, 1995)

発話セグメント持続時間の最小しきい値は,コンスタントに25msとした. 一度これらのパラメータをセットし音響的な波形を整形し,その後,録音の部分を元に信号の境界を特定した.これは信号のアンプティチュードの閾値や最小ポーズ持続時間(例えば250ms)に従ったものである. アンプティチュードの最小閾値を越えた位置は発話として特定した. 最小ポーズ持続時間よりもポーズ範囲が小さい場合,隣接した発話部分は信号領域を考慮した. 最後に,発話サンプルにおけるすべての発話とポーズ範囲はアルゴリズムによって算出された.

A custom MATLAB algorithm called speech pause analysis, or SPA (Green et al., 2004), determined the acoustically identified locations of the breath groups for the speech samples. The software required that a section of pausing be identified manually to specify the minimum amplitude threshold for speech. The software also required specification of durational threshold values for the minimum pause and speech segment durations. For the present study, five pause duration thresholds were tested: 150, 200, 250, 300, and 350 ms. These were selected to cover the range of pause duration thresholds typically used in previous studies; for example, inspiratory loci have been defined as pauses greater than 150 ms (Yunusova et al., 2005), 250 ms (Walker et al., 1992), or 300 ms (Campbell & Dollaghan, 1995). The minimum threshold for speech segment duration was held constant at 25 ms. Once these parameters were set, the acoustic waveform was rectified, and then signal boundaries were identified on the basis of the portions of the recording that fell below the signal amplitude threshold and above the specified minimum pause duration (e.g., 250 ms). Portions that exceeded the minimum amplitude threshold were identified as speech. Adjacent speech regions were considered to be a single region if a pause region was less than the minimum pause duration. Finally, all the speech and pause regions in the speech samples were calculated by the algorithm.

Accuracy (精度)

吸気の位置は空気の流れの信号によって,最初にマークした発話者すべてに対して探索を行った. 空気動学的信号における吸気位置は,かれらの生理学的なイベントなので,真実の吸気イベントとして扱った. 空気動学的に探索された吸気位置は認知的な探索及び音響的な探索の精度を決定するためのセットである. ここで,吸気位置は3つの方法を使って探索されたので,条件別の位置を比較した. まず,認知的及び音響的判定の吸音位置数は総計にされた. これらの位置はその後,空気力学的な信号を使用して特定されたものと比較された. 空気力学的方法によって特定された吸音と判定者によって特定された認知的,音響的吸音位置がマッチする場合は true positive として記録した. 吸音と受け入れられ,しかし空気力学的信号としてえ特定されなかった位置は false positive として記録した. 判定者によって記述されず,空気力学的に推定された位置は Miss と記述した.

The loci of inspiration determined by the airflow signal for all speakers were marked first. Inspiratory loci in the aerodynamic signal were taken as the true inspiratory events because they reflected the physiologic events. The aerodynamically determined inspiratory loci were set to determine the accuracy of the perceptually determined loci and acoustically determined loci. Once inspiratory loci were determined using each of the three methods, the loci between conditions were compared. First, the number of perceptually or acoustically judged inspiratory loci was totaled. These loci were then compared with those identified using the aerodynamic signal. Loci identified perceptually and acoustically were then coded as a true positive when loci identified by the judges matched an inspiration identified by the aerodynamic method. Loci for which judges perceived an inspiration but that were not indicated in the aerodynamic signal were coded as a false positive. Aerodynamically determined loci that were not identified by the judges were coded as a miss.

Statistical analysis (統計的解析)

信号探索解析(MacMillan & Creelman, 1991)は知覚に基づいた方法と音響的解析において最も的確な結果を生むポーズ閾値を使用した方法を評価するために使用された. 特に True positive rate (TPR) , False positive rate (FPR), Accuracy, D値 によって特定される精度は知覚的判断,ポーズ閾値のそれぞれで計測した.

Signal detection analysis (MacMillan & Creelman, 1991) was used to evaluate which perceptually based method and which pause threshold used in acoustic analysis yielded the most accurate results. Specifically, sensitivity as indicated by the true positive rate (TPR), the false positive rate (FPR; 1 − specificity), accuracy, and d′ values were determined for each perceptual judgment and for each pause threshold.

Results (結果)
Accuracy (精度)

空気の流れ信号から探索された吸音の総数はすべての話者で1,106個である. SPA アルゴリズムによって探索された 150ms 以上のポーズの総数は2,281個であり,これは判定者が彼らの決定をするための潜在的な吸音の総数であると考えられる.

The total number of inspirations determined from the airflow signal for all speakers was 1,106. The number of pauses greater than 150 ms detected by the SPA algorithm was 2,281, which was considered the total number of potential inspirations for judges to make their decisions.

Perception

3人の判定者によって知覚的に探索された吸音位置の総数は1,177である. 判定者1,判定者2,判定者3,によってそれぞれ特定された吸音位置の数は1,088,1,094,1,054である. 3人中最低2人の間で常に判定されたもの(例えばJ1J2,J2J3,J3J1)の総数は1,080である. 3人の中で共通して判定されたものの数は979である. 3人の中で判定間で最も高く一致していた2人の確実性は0.92(1080/1177)である. 3人の判定者間の確実性は0.83である(979/1177).

The total number of inspiratory loci determined perceptually by the three judges was 1,177. The number of inspiratory locations determined individually by judge 1 (J1), judge 2 (J2), and judge 3 (J3) was 1,088, 1,094, and 1,054, respectively. The number of consistent judgments between at least two of the three judges (i.e., J1J2, J1J3, J2J3, or J1J2J3), was 1,080. The number of consistent judgments across all three judges was 979. The highest interjudge reliability between two of the three judges was .92 (1,080/1,177). The interjudge reliability across all the three judges was .83 (979/1,177).

1106個の実際の吸音位置を参照するとJ1は1066個正解し,Missは42個, 22個のFalseがあった. J2は1065, 43, 29個 である. J3は1010, 98, 44個 である. 3人中最低2人の同意のあった位置では 1068個の正解と,40個のMiss,12個のFalseがあった. 3人が全員同意した位置に関しては976個の正解と132個のMiss,3個のFalseがある.

Referenced to the 1,106 actual inspiratory loci, J1 correctly identified 1,066, missed 42, and added 22 (false alarm). J2 correctly identified 1,065, missed 43, and added 29. J3 correctly identified 1,010, missed 98, and added 44. The loci that were consistent between at least two of the three judges were 1,068 correctly identified, 40 missed, and 12 added. The loci that were consistent across all three judges were 976 correctly identified, 132 missed, and 3 added.

表1では特に知覚的判断の 精度,Accuracy,D値を示す. 呼気の場所は 平均正解率(TPR) と False (FPR) とで3人の間で約95%受け入れられた. J1は sensitivity, specify, accuracy, D値,共に一番高かった. 3人の判定者の間で同意のとれた決定は, specificity は常に増加したが, false alarm rate は3人の判定者でばらついた. しかし,3人中最低2人の合意がとれた決定では, specificity は 99% 近くになり, sensitivity, accuracy, D値 はすべて最も高い値になった. 総括すると, 自然発話における吸音位置の知覚的判断の最もよい弁別は3人の判定者のうち最低2人以上の合意に基づいたものであった. しかし,図2の 反応作用曲線(ROC)において示すように, 3人の弁別結果はむしろしっかりクラスタ化された.

Table 1 shows the sensitivity, specificity, accuracy, and d′ data for the perceptual judgments. Inspiratory locations were perceived correctly (TPR) about 95% of the time on average, and the false alarm rate (FPR) varied among the three judges. J1 had the highest sensitivity, specificity, accuracy, and d′. When the decision was based on the agreement across all three judges, the specificity was increased substantially, but the sensitivity and accuracy were decreased to 88%. However, when the decision was based on agreement between at least two of the three judges, the specificity was near 99%, and the sensitivity, accuracy, and d′ were all at their highest. Overall, the best discrimination of the perceptual judgment of inspiratory loci in spontaneous speech was based on the consistency between at least two of the three judges. However, as is shown in the receiver operating characteristic (ROC) curve of Fig. 2, the separate results for the three judges are clustered rather tightly.

Judge(s)   Inspiratory location TPR FPR Accuracy d-prime beta (ratio)
    Yes No  
JI Yes 1066 22 0.9621 0.0188 0.972 3.856 1.799
  No 42 1151  
J2 Yes 1065 29 0.9612 0.0247 0.968 3.729 1.452
  No 43 1144  
J3 Yes 1010 44 0.9116 0.0375 0.938 3.131 1.96
  No 98 1129  
2 Yes 1068 12 0.9639 0.0102 0.977 4.116 2.915
  No 40 1161  
3 Yes 976 3 0.8809 0.0026 0.941 3.979 25.122
  No 132 1170  

注釈

Table1

The sensitivity, specificity, accuracy, and d-prime data of perceptual judgments determined by judge (J1, J2, J3), by the consistency of at least 2 of the 3 judges, and by the consistency of all the 3 judges. True positive rate (TPR) refers to sensitivity, whereas false positive (FPR) refers to 1- spcificity

_images/fig2.png

注釈

Fig. 2

Receiving operator characteristic curve for the perceptual and acoustic methods of breath group determination. Perceptual results are shown for each judge and agreements between two judges (2) and three judges (3). Acoustic results are shown for various thresholds of pause duration

Acoustics (音響)

SPA アルゴリズムによって音響的に検出されたポーズの数は

The number of pauses acoustically determined by the SPA algorithm is given in parentheses in the following summary for the five different pause thresholds: 150 ms (2,281), 200 ms (1,864), 250 ms (1,657), 300 ms (1,513), and 350 ms (1,406). Table 2 shows the sensitivity, specificity, accuracy, and d′ data for the SPA algorithm results. Figure 2 shows the ROC for the combined perceptual and acoustic results. The TPR (sensitivity) values of the five different pause thresholds were all above 98%, but the FPR differed greatly among different threshold values, with smaller thresholds resulting in greater FPRs. The smaller thresholds had near perfect sensitivity but very poor specificity and, consequently, lower accuracy. Thus, in terms of the d′ value, the SPA acoustically determined inspiratory loci of 300-ms threshold had the best performance.

As compared with the actual inspiratory locations determined by the aerodynamic signal, the perceptually determined method with the best performance had smaller TPR and FPR but larger accuracy and d′ than did the acoustically determined method for this spontaneous speech task (Table 2). Moreover, the sensitivity values of the five different pause thresholds were all higher than those of perceptual judgments, but the specificity values were much larger and varied widely (Table 2). Consequently, on the basis of accuracy and d′ analysis, the performance of the perceptually based breath determination of breath groups is judged to be better than that of the acoustic method of pause detection.

Discussion (ディスカッション)

The present study indicates that (1) the greatest accuracy in the perceptual detection of inspiratory loci was achieved with agreement between two of the three judges; (2) the most accurate pause duration threshold used for the acoustic detection of inspiratory loci was 300 ms; and (3) the perceptual method of breath group determination was more accurate than the acoustically based determination of pause duration.

For the perceptual approach, the criterion of agreement between two of the three judges yielded the highest TPR, accuracy (.977), and d′ (4.116). This approach had approximately 1.75% (40/2,281) false negatives and 0.53% (12/2,281) false positives. Apparently, the more stringent criterion of consistency across all three judges led to an increase of false negatives that was much larger than the decrease of false positives, thereby reducing both accuracy and d′. In contrast, the most accurate approach for detecting inspiratory loci on the basis of listening in a reading task (Wang, Green, Nip, Kent, Kent, & Ullman, 2010) was agreement across all three judges, which achieved an accuracy of .902, a d′ of 4.140, and a small number of both false negatives (approximately 10%) and false positives (0%). The accuracy of the perceptual approach was better for spontaneous speech in the present study than it was for passage reading in the study by Wang, Green, Nip, Kent, Kent, and Ullman (2010). The differences between spontaneous speech and reading are likely explained by differences in breath group structure, as discussed in Wang, Green, Nip, Kent, and Kent (2010). Breath groups had longer durations for spontaneous speech, as compared with reading. In addition, inspiratory pauses for spontaneous speech are more likely to fall in grammatically inappropriate locations, potentially making the inspirations to be more perceptually salient to the judges.

Using acoustic algorithms to identify inspiratory loci, the optimal threshold of pause detection in the present study was 300 ms, which achieved an accuracy of .817 and a d′ value of 2.994. With this threshold, the false negative rate is 0.2% (5/2,281), but the false positive rate is much higher, approximately 18% (412/2,281). Wang, Green, Nip, Kent, Kent, and Ullman (2010) reported that the most accurate pause duration threshold for detecting inspiratory loci in the reading task was 250, which achieved an accuracy of .895, a d′ of 3.561, a zero rate of false negatives, and an approximately 10% rate of false positives. Task effects between reading and spontaneous speech occurred for the acoustic method, much as they did for the perceptual method. The accuracy and d′ values in spontaneous speech were lower than those in reading. Furthermore, the false negative rate and false positive rate in spontaneous speech were both raised when compared with reading. Consequently, the acoustically determined method in spontaneous speech performed more poorly than for reading, which is likely related to the task differences in the breath group structure and perhaps in cognitive-linguistic load.

Because the minimum inter-breath-group pause in reading for healthy speakers is 250 ms (Wang, Green, Nip, Kent, & Kent, 2010), the 150- and 200-ms thresholds produced no false negatives but many false positives, which lowered their accuracy. In contrast, with thresholds above 200 ms, the decrease in the number of false positives was substantially more than the increase of the number of false negatives, which increased the accuracy. Generally speaking, the false positive rate differed among different pause thresholds, indicating that the selection of the pause threshold is very sensitive to the detection of false positives in spontaneous speech. Because the spontaneous speech samples in the present study were produced fluently by healthy adults who were familiar with the topics to be addressed, there was negligible occurrence of prolonged cognitive hesitations or articulatory or speech errors. Therefore, the present findings may not apply to speech produced by talkers with neurological or other impairments, whose speech might be characterized by either a faster or a slower speaking rate and with more pauses of long durations unrelated to inspiration. A threshold of 300 ms might potentially be either too short for individuals who speak significantly slower or too long for speakers with faster than typical speaking rates.

Judge(s)   Inspiratory location TPR FPR Accuracy d-prime beta (ratio)
    Yes No  
150 Yes 1106 1175 0.9995 0.9996 0.485 -0.017 1.056
  No 0 0          
200 Yes 1106 758 0.9995 0.6451 0.668 2.947 0.004
  No 0 417          
250 Yes 1104 553 0.9982 0.4706 0.757 2.983 0.015
  No 2 622          
300 Yes 1089 317 0.9846 0.2698 0.854 2.774 0.117
  No 17 858          

注釈

Table 2

The sensitivity, specificity, accuracy, and d-prome data of acoustically determined by SPA algorithm. true positive rate (TPR) refers to sensitivity, whreas false positive rate (FPR) refers to 1- spcificity

Taking together the present results and those of Wang, Green, Nip, Kent, Kent, and Ullman (2010), it can be concluded that for both spontaneous speech and passage reading, the perceptual method of breath group determination is more accurate than the acoustic method based on pause duration. The ability of listeners to identify breath groups is no doubt aided by their knowledge that speech is typically produced on a prolonged expiratory phase. Simple acoustic measurements of pauses are naive to this expectation, which is one reason perceptual assessment can be more accurate than acoustic pause detection. The larger d′ obtained for the perceptual approach may indicate that listeners are sensitive to many cues beyond pause duration. Factors related to physiologic needs, cognitive demands, and linguistic accommodations that affect the locations of inspirations and the durations of interbreath-group pauses are possibly perceptible by human ears. Perceptual cues for inspiration include the occurrence of pauses at a major constituent boundary, anacrusis, final syllable lengthening, and final syllable pitch movement (Wozniak et al., 1999). Some of these factors could be included in an elaborated acoustic method that relies on more than just pause duration.

The choice of method for breath group determination should be based on a consideration of the risk–benefit ratio. If errors cannot be tolerated, physiologic methods are preferred, if not mandatory. But if this is not possible (as in the analysis of archived audio signals), the choice between perceptual and acoustic methods should weigh the risk of greater errors (likely to occur with the acoustic method) against the relative costs (in terms of both analysis time and technology). As is shown in Fig. 2, the results for any one judge in the perceptual method were more accurate than those for any of the pause duration thresholds used in the acoustic study. Perceptual determination appears to be a better choice, on the basis of accuracy alone. Of course, these findings pertain to studies interested in identifying only inspiratory pauses, and not those located at phrase and word boundaries; the high false positive rates obtained for the acoustic method suggest that this approach may be well suited for this purpose, although additional research is needed. If it is desired to examine the relationship between breath groups and linguistic structures, preparation of a transcript is necessary for any method of breath group determination. Finally, it should be recognized that the present results and those of Wang, Green, Nip, Kent, Kent, and Ullman (2010) pertain to healthy adult speakers. Generalization of the results to younger or older speakers or to speakers with disorders should be done with caution.

Acknowledgements (認定)

This work was supported in part by Research Grant number 5 R01 DC00319, R01 DC009890, and R01 DC006463 from the National Institute on Deafness and Other Communication Disorders (NIDCD-NIH) and NSC 100-2410-H-010-005-MY2 from the National Science Council, Taiwan. Additional support was provided by the Barkley Trust, University of Nebraska–Lincoln, Department of Special Education and Communication Disorders. Some of the data were presented in a poster session at the 5th International Conference on Speech Motor Control, Nijmegen, 2006. We would like to acknowledge Hsiu-Jung Lu and Yi-Chin Lu for data processing.

Appendix

The instruction of breath group determination for conversational speech samples You will be provided with a transcription of the conversational speech samples without punctuations for each speaker in the present study. The task is to mark the points at which speakers stop for a breath. When you identify this point, place a mark on the corresponding location on the transcript. Make your best guess as to where the speaker stops to take a breath. Sometimes you can hear an expiration and/or inspiration, but in other cases you may have to make the judgment based on other cues, such as longer pause duration, f0 declination, and longer phrasefinal duration. In this task, you can listen to the sound files repeatedly before you are confident in your determination on the breath group location. Do you have any questions?

References
  • Ainsworth, W. (1973).
    • A system for converting English text into speech.
    • IEEE Transactions on Audio and Electroacoustics, 21, 288–290.
  • Bunton, K. (2005).
    • Patterns of lung volume use during an extemporaneous speech task in persons with Parkinson disease.
    • Journal of Communication Disorders, 38, 331–348.
  • Bunton, K., Kent, R. D., & Rosenbek, J. C. (2000).
    • Perceptuo-acoustic assessment of prosodic impairment in dysarthria.
    • Clinical Linguistics and Phonetics, 14, 13–24.
  • Campbell, T. F., & Dollaghan, C. A. (1995).
    • Speaking rate, articulatory speed, and linguistic processing in children and adolescents with severe traumatic brain injury.
    • Journal of Speech and Hearing Research, 38, 864–875.
  • Che, W. C., Wang, Y. T., Lu, H. J., & Green, J. R. (2011).
    • Respiratory changes during reading in Mandarin-speaking adolescents with prelingual hearing impairment.
    • Folia Phoniatrica et Logopaedica, 63, 275–280.
  • Collyer, S., & Davis, P. J. (2006).
    • Effect of facemask use on respiratory patterns of women in speech and singing.
    • Journal of Speech Language and Hearing Research, 49, 412–423.
  • Forner, L. L., & Hixon, T. J. (1977).
    • Respiratory kinematics in profoundly hearing-impaired speakers.
    • Journal of Speech and Hearing Research, 20, 373–408.
  • Green, J. R., Beukelman, D. R., & Ball, L. J. (2004).
    • Algorithmic estimation of pauses in extended speech samples of dysarthric and typical speech.
    • Journal of Medical Speech-Language Pathology, 12, 149–154.
  • Hammen, V. L., & Yorkston, K. M. (1994).
    • Respiratory patterning and variability in dysarthric speech.
    • Journal of Medical SpeechLanguage Pathology, 2, 253–261.
  • Hixon, T. J., Goldman, M. D., & Mead, J. (1973).
    • Kinematics of the chest wall during speech production: Volume displacements of the rib cage, abdomen, and lung.
    • Journal of Speech and Hearing Research, 16, 78–115.
  • Hixon, T. J., Mead, J., & Goldman, M. D. (1976).
    • Dynamics of the chest wall during speech production: Function of the thorax, rib cage, diaphragm, and abdomen.
    • Journal of Speech and Hearing Research, 19, 297–356.
  • Hoit, J. D., & Hixon, T. J. (1987).
    • Age and speech breathing.
    • Journal of Speech and Hearing Research, 30, 351–366.
  • Hoit, J. D., Hixon, T. J., Watson, P. J., & Morgan, W. J. (1990).
    • Speech breathing in children and adolescents.
    • Journal of Speech and Hearing Research, 33, 51–69.
  • Huber, J. E., & Darling, M. (2011).
    • Effect of Parkinson’s disease on the production of structured and unstructured speaking tasks: Respiratory physiologic and linguistic considerations.
    • Journal of Speech, Language, and Hearing Research, 54, 33–46.
  • Loudon, R. G., Lee, L., & Holcomb, B. J. (1988).
    • Volumes and breathing patterns during speech in healthy and asthmatic subjects.
    • Journal of Speech and Hearing Research, 31, 219–227.
  • MacLarnon, A. M., & Hewitt, G. P. (1999).
    • The evolution of human speech: The role of enhanced breathing control.
    • American Journal of Physical Anthropology, 109, 341–363.
  • Macmillan, N. A., & Creelman, C. D. (1991).
    • Detection theory: A user’s guide.
    • New York: Cambridge University Press.
  • McFarland, D. H. (2001).
    • Respiratory markers of conversational interaction.
    • Journal of Speech Language and Hearing Research, 44, 128–143.
  • Mitchell, H. L., Hoit, J. D., & Watson, P. J. (1996).
    • Cognitive-linguistic demands and speech breathing.
    • Journal of Speech and Hearing Research, 39, 93–104.
  • Nathani, S., & Oller, D. K. (2001).
    • Beyond ba-ba and gu-gu: Challenges and strategies in coding infant vocalizations.
    • Behavior Research Methods, Instruments,& Computers, 33, 321–330.
  • Oller, D. K., & Lynch, M. P. (1992).
    • Infant vocalizations and innovations in infraphonology: Toward a broader theory of development and disorders.
    • In C. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.), Phonological development: Models, research, implications (pp. 509–536).
    • Parkton, MD: York Press.
    • Behav Res (2012) 44:1121–1128
  • Oller, D. K., & Smith, B. L. (1977).
    • Effect of final-syllable position on vowel duration in infant babbling.
    • Journal of the Acoustical Society of America, 62, 994–997.
  • Rieger, J. M. (2003).
    • The effect of automatic speech recognition systems on speaking workload and task efficiency.
    • Disability and Rehabilitation, 25, 224–235.
  • Schlenck, K. J., Bettrich, R., & Willmes, K. (1993).
    • Aspects of disturbed prosody in dysarthria.
    • Clinical Linguistics & Phonetics, 7, 119–128.
  • Walker, J. F., Archibald, L. M., Cherniak, S. R., & Fish, V. G. (1992).
    • Articulation rate in 3- and 5-year-old children.
    • Journal of Speech & Hearing Research, 35, 4–13.
  • Wang, Y.-T., Green, J. R., Nip, I. S. B., Kent, R. D., & Kent, J. F.(2010).
    • Breath group analysis for reading and spontaneous speech in healthy adults.
    • Folia Phoniatrica et Logopaedica, 62, 297–302.
  • Wang, Y.-T., Green, J. R., Nip, I. S. B., Kent, R. D., Kent, J. F., & Ullman, C. (2010).
    • Accuracy of perceptually based and acoustically based inspiratory loci in reading.
    • Behavior Research Methods, 42, 791–797.
  • Wang, Y.-T., Kent, R. D., Duffy, J. R., & Thomas, J. E. (2005).
    • Dysarthria in traumatic brain injury: A breath group and intonational analysis.
    • Folia Phoniatrica et Logopedica, 57, 59–89.
  • Winkworth, A. L., Davis, P. J., Adams, R. D., & Ellis, E. (1995).
    • Breathing patterns during spontaneous speech.
    • Journal of Speech and Hearing Research, 38, 124–144.
  • Winkworth, A. L., Davis, P. J., Ellis, E., & Adams, R. D. (1994).
    • Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors.
    • Journal of Speech and Hearing Research, 37, 535–556.
  • Wozniak, R. J., Coelho, C. A., Duffy, R. J., & Liles, B. Z. (1999).
    • Intonation unit analysis of conversational discourse in closed head injury.
    • Brain Injury, 13, 191–203.
  • Yorkston, K. (1996).
    • Treatment efficacy: Dysarthria.
    • Journal of Speech and Hearing Research, 39, 546–557.
  • Yunusova, Y., Weismer, G., Kent, R. D., & Rusche, N. M. (2005).
    • Breath-group intelligibility in dysarthria: Characteristics and underlying correlates.
    • Journal of Speech, Language, and Hearing Research, 48, 1294–1310.

Compensating for coarticulation during phonetic category acquisition

  • Naomi Feldman
Introduction

Recent models of phonetic category acquisition have taken a Mixture of Gaussians approach, in which Gaussian categories are chosen to best fit the distribution of speech sounds in the input, where the input typically consists of isolated speech sounds [1] [2] [3]. One limitation of these approaches is that they do not take into account predictable phonetic variability that is based on context. In actual speech, the acoustic realizations of a phoneme depend on the identities of neighboring sounds, due to coarticulation with these sounds (e.g., as demonstrated in [4] ). This presents a challenge for models of phonetic category acquisition that assume context-independent Gaussian distributions of sounds.

最近の音素カテゴリ獲得のモデルはガウス混合分布アプローチが取られている. このモデルにおいては分離された発話音声からなる典型的なインプットの発話音声の分布に最もよくフィットするガウシアンのカテゴリが選択される. これらのアプローチの限界は,コンテクストに基づく音声的な分散を考慮することができないことである. 実際の発話においては, コーティキュレーションがあるため, 音響的な音素の実体は近隣の音の特性に依存する. 本研究ではコンテクストに独立した音声のガウス分布を仮定した音素カテゴリ獲得モデルの変化を示す.

A more realistic model of phonetic category acquisition should take account of dependencies between neighboring acoustic values that arise due to factors such as gestural overlap and articulatory ease. Despite extensive research on the conditions under which people and animals compensate for coarticulation in speech perception, the problem has only recently begun to be addressed in the context of phonetic category acquisition. There has been a first attempt at solving this problem [5] , in which it is assumed that a learner needs to figure out the direction and magnitude of the shift in acoustic values that occurs in a particular context. This is done by taking the mean of all sound that occur in that context, and comparing it to the mean of all sounds that occur in other contexts.

音素カテゴリ獲得のより現実的なモデルは身振りの重複や音調的容易さのような要素のために起きる近隣の音響的値間の依存度を考慮するべきである. 人間や動物の音声知覚におけるコーアティキュレーションの補償における状態に関する広範囲な研究にもかかわらず, 音素カテゴリの獲得のコンテキストの特定を開始されたのは最近である. この問題の解決する最初の試みは存在し [5] , この研究では学習者が特定のコンテクストにおいて生じる音響的値の遷移の方向と大きさを理解する必要があると仮定している. これは, このコンテキストに生じたすべての音声の平均値を取得し,他のコンテキストで生じるすべての音声の平均とその音声を比較することで行った.

The authors demonstrate that correcting for this shift in the acoustic values makes it easier for a learner to recover a set of Gaussian phonetic categories from acoustic data. However, there is a circularity here, as categories must be known in order to pick out a particular phonological context that would cause an acoustic shift. Thus, it would be desirable for an algorithm to learn both layers of structure simultaneously. As a first step, this paper explores the category learning problem in a system where coarticulatory influences are present, but where the parametric form of these coarticulatory influences is assumed to be known.

作者はこの遷移の音響的値が学習者が音響データからガウシアンの音素カテゴリの再現することを容易にすることで, 正しさを示している. しかし, 音響的遷移を引き起こす特定の音韻的なコンテキストをピックアウトするためにカテゴリを知らなくてはいけないため, ここには矛盾が存在する. したがって, 両方のレイヤーを同時に学習するためのアルゴリズムが望ましい. 最初のステップとして, 本稿ではコーアティキュレーションの影響が存在し,かつ,その値のパラメータが既知であると仮定されるシステムのカテゴリ学習問題を調査した.

Phonological Constraints in Exponential Family Models

Phonological constraints were first proposed to be characterized by weighted harmony functions in Harmonic Grammar [6] . Phonetic productions of words in this framework are selected to best satisfy a set of weighted constraints, given the underlying phonological properties of those words. Constraints relate underlying properties to phonetic surface properties (e.g., a constraint would assign higher probability to a /p/ pronounced as [p], as opposed to [b]) and favor certain surface properties over others (e.g., a higher probability is assigned to CV syllables than to CCV syllables). This was later put into a maximum entropy learning framework [7] , where weights are learned for each constraint to maximize the likelihood of a set of training data. In this work, each function \(\phi\) (x) corresponds to a count of the number of times a particular constraint had been violated.

音素カテゴリは Harmonic Grammar [6] において, 重み付けられた調和関数によって決定するために最初に提案された. このフレームワークにおいて,単語の音声学的発声は最も妥当な重み付けられた拘束のセットを選択し,それらの単語の

注釈

Harmonic Grammar

Harmonic Grammar is a linguistic model proposed by Geraldine Legendre, Yoshiro Miyata, and Paul Smolensky in 1990. It is a connectionist approach to modeling linguistic well-formedness.

More recently, the same sort of constraint-based framework has been suggested to be useful for characterizing gradient phonetic effects. Flemming [8] proposed constraints that favor acoustic values similar to a given target production, and also favor similar acoustic values in neighboring speech sounds. Specifically, a speaker is assumed to be minimizing the weighted squared error terms

(1)\[\begin{eqnarray} \theta_1(\omega_1−\mu_1)^2 \end{eqnarray}\]
(2)\[\begin{eqnarray} \theta_2(\omega_1−\mu_1)^2 \end{eqnarray}\]
(3)\[\begin{eqnarray} \theta_3(\omega_1−\omega_2)^2 \end{eqnarray}\]

where each \(\theta\) term is the corresponding weight for a particular constraint, the w terms are the acoustic realizations of neighboring phonemes, and the \(\mu\) terms are targets for specific phonological categories.

This weighted sum squared error cost function corresponds to a particular type of pairwise undirected graphical model (Figure 1). In particular, if the squared error cost functions correspond to the log likelihood of a particular set of values for \(\omega_1\) and \(\omega_2\) , then the potential functions between each 1 pair of nodes are simply Gaussians with variance \(\frac{1}{2\theta}\).

Figure 1: The graphical model corresponding to the model used in [8] .

Figure 1: The graphical model corresponding to the model used in [8] .

This problem can be generalized to a series of \(N\) acoustic values, where local potentials are Gaussian around the corresponding category mean \(\mu_c\) with variance \(\Sigma_C\) specific to that category, all pairwise potentials between neighboring acoustic values are Gaussian with common variance \(\Sigma_S\) . The conditional probability \(p(w|z)\) can be expressed as

(4)\[\begin{eqnarray} p(\omega|z) \propto \exp\left\{ - \frac{1}{2} \left[ \sum_{i=1}^{N}(\omega_i - \mu_{zi})^{T\sum_{zi}^{-1}}(\omega_i -\mu_{zi}) + \sum_{i=1}^{N-1}(\omega_i-\omega_{i+1})^{T\sum_{S}^{-1}}(\omega_i - \omega_{i+1}) \right] \right\} \end{eqnarray}\]

This can equivalently be expressed in the information form

(5)\[\begin{eqnarray} p(\omega | z) \propto \exp \left\{ - \frac{1}{2}\omega^TJ\omega + h^T\omega \right\} \end{eqnarray}\]
where each diagonal entry in the matrix \(J\) is \(\sum_{zi}^{-1} + 2\sum_{S}^{-1}\) , except the first and last diagonal entries which are \(\sum_{zi}^{-1} +\sum_{S}^{-1}\) , and the off-diagonal entries are \(−\sum_{S}^{-1}\) for neighboring sounds, zero otherwise.
Each entry in the vector \(h\) is equal to \(\sum_{zi}^{-1}\mu_{zi}\)

The acoustic values \(w_i\) are jointly Gaussian and their marginals can be computed straightforwardly using Gaussian belief propagation. Given category assignments and category parameters, the factor graph is simply a chain with local node potentials and pairwise potentials. The messages originate at the ends of the chain at the variable nodes. Messages from variable nodes to factor nodes are

(6)\[\hat{J_{i\j}} = \sum_{zi}^{-1} + \sum_{S}^{-1} + J_{k \to i} \mathrm{\ for\ } i = 1, N\]
(7)\[\begin{split}\hat{J_{i\j}} &= \sum_{zi}^{-1} + 2\sum_{S}^{-1} + J_{k \to i} \mathrm{\ for\ all\ other\ i}\\ \hat{h}_{i\j} &= \sum_{zi}^{-1}\mu_{zi} + h_{k \to i}\end{split}\]

where the chain structure ensures that there is at most one incoming message to take into account. Messages from factor nodes to variable nodes take the form of

(8)\[J_{i \to j} = -\sum_{S}^{-1}\hat{J}_{i\j}\sum_{S}^{-1}\]
(9)\[h_{i \to j} = -\sum_{S}^{-1} \hat{J}_{i\j} \hat{H}_{i\j}\]

Marginals on node i can be computed straightforwardly as

(10)\[\hat{J}_{i\j} = \sum_{zi}^{-1} + \sum_{S}^{-1} + \sum_{k}^{}J_{k \to i} \mathrm{\ for\ } i = 1, N\]
(11)\[\begin{split}\hat{J}_{i\j} &= \sum_{zi}^{-1} + 2\sum_{S}^{-1} + \sum_{k}^{}J_{k \to i} \mathrm{\ for\ all\ other } i\\ \hat{h}_{i\j} &= \sum_{zi}^{-1}\mu_{zi} + \sum_{k} h_{k \to i}\end{split}\]
Learning Model

Flemming’s model takes the perspective of a speaker producing an utterance, and assumes that speakers select values of w conditioned on the category assignments \(z\). The work does not address the problems of perception or learning. In the perception and learning problems, a listener or a learner would observe w and need to recover the categories \(z\). Whereas a listener would decide between a set of categories with known means and variances, a learner would need to decide how many categories are in their language, learn the category parameters, and assign each sound to its appropriate category.

Without coarticulation, the learning problem can be characterized as an infinite mixture model (Figure 2a). The learner observes acoustic values and needs to recover the set of categories that generated these acoustic values, and decide which sound belongs to which category. Samples from this posterior distribution can be obtained straightforwardly through Gibbs sampling, as described in [3] .

With coarticulation, the non-parametric model used by [3] can be combined with the framework for weighted constraints used by [8] . A graphical model with the necessary properties is shown in Figure 2b. The key difference in this new model is that the probability distribution \(p(w|z)\) no longer factorizes as \(\prod_i p(w_i|z_i)\). Instead, each zi is generated independently from the Dirichlet process, but the entire acoustic vector w is assumed to be sampled jointly conditioned on all the values of \(z_i\) .

The specific distributions associated with the Dirichlet process are

\[\begin{split}z_i &∼ DP(\alpha, G_0 ) \\ G_0 &: \sigma_c ∼ IW (\nu_0 , \sigma_0 ) ; \mu_c ∼ N (\mu_0 , \frac{\sigma_c}{\nu_0} )\end{split}\]

and the conditional probability distribution \(p(w|z)\) is given by Equation 4.

A learner observes the acoustic values and needs to recover the set of categories that generated the data and decide which sound belongs to which category. Whereas inference in the original model could be done using a collapsed Gibbs sampler, integrating out the category parameters \(\mu_c\) and \(\sigma_c\), integrating over category parameters precludes closed-form calculation of the likelihood function \(p(w|z)\) in the new model because the local potential functions are no longer Gaussian. The sampling algorithm used here therefore explicitly samples the parameters \(\mu_c\) and \(\sigma_c\) for each category.

In each iteration, each category \(z_i\) in turn is chosen conditioned on all other current category assignments and all acoustic values. This conditional probability distribution is

(12)\[\begin{eqnarray} p(z_i \mid z_{−i} , \omega, \mu, \sigma) \propto p(z_i \mid z_{−i} ) p(w \mid z, \mu, \sigma) \end{eqnarray}\]

The prior \(p(zi \mid z_{−i} )\) is proportional to the number of sounds already assigned to a particular category, with probability \(\alpha\) of assignment to a new category. The likelihood term \(p(\omega\mid z, \mu, \sigma)\) can be factored as \(p(w_i |z, \mu, \sigma)p(w_{−i} \mid w_i , z_{−i} , \mu, \sigma)\), using the fact that \(w_{−i}\) is independent of \(z_i\) when conditioned on \(w_i\) . The second term does not depend on \(z_i\) and can be ignored. The first term is the marginal probability of \(w_i\) given current category assignments, and can be computed through the message passing algorithm summarized in the previous section. Note, however, that it is not possible to ignore the contribution of \(z_{−i}\) when computing the likelihood term for \(z_i\), as this likelihood term is not conditioned on the values of \(w_{−i}\).

_images/fig21.png

Figure 2: (a) A graphical model of the infinite mixture model. (b) Adapting the model to take into account coarticulation between neighboring sounds.

To estimate the likelihood of a new category, 10 sets of category parameters are sampled from the \(\alpha\) . The likelihood can prior distribution on \(\mu_c\) and \(\sigma_c\), and these are each assigned pseudocounts of 10 be computed for each of these tables as though it were an existing category. If the \(z_i\) being sampled is the only instance of a particular category currently assigned in the corpus, then the parameters from that category are used in place of one of the 10 samples from the prior [9].

The parameters \(\mu_c\) and \(\sigma_c\) should then be resampled for each category, but I could not figure out how to do this, because of the dependencies between different acoustic values. I suspect there is a way to find the posterior on \(\mu_c\) with fixed \(\sigma_c\), which I could not figure out within the time frame for this project, but I’m not sure there is a straightforward way to find the posterior on \(\sigma_c\) at all. In the simulations below I did not ever resample the parameters for existing categories, and any new parameters had to be selected by creating a new category.

Simulations

Simulations compared the coarticulation model to the infinite mixture model. For both models, the concentration parameter \(alpha\) was set at 0.1, and the prior over phonetic categories had parameters \(\mu_0 = 0, \sigma_0 = 0.1, and \nu_0 = 0.1\). The coarticulation model was given a fixed parameter of \(\sigma_S = 0.05\). The samplers were each run for 2,000 iterations. For each simulation, pairwise accuracy and completeness measures were used to compute an overall F-score.

Simulation 1 was conducted to demonstrate that the model can take coarticulatory influences into account, finding the correct categories like a regular infinite mixture model but with greater accuracy in category parameters. This is a non-trivial accomplishment because the category parameters are never resampled in the coarticulation model, whereas they are resampled in the infinite mixture model.

One hundred datapoints were generated from two categories with means at -1 and 1 and variances of 0.01. Each category had a mixing probability of \(\frac{1}{2}\) . \(\sigma_S\) was set at 0.05. The resulting corpus is shown in Figure 3.

Both models recovered the two categories perfectly, and assigned sounds to their respective categories correctly, achieving an F-score of 1. However, the learned category parameters were more accurate in the coarticulation model, where the parameters were \(\mu = 1.05, \sigma = 0.01 and \mu = −0.98, \sigma = 0.01\). The infinite mixture model did substantially worse, recovering \(\mu = 0.72, \sigma = 0.03 and \mu = −0.69, \sigma = 0.02\).

_images/fig3.png

Figure 3: Corpus used for Simulation 1. Connected black dots represent the speech sounds in the corpus, and red dots show the means of the categories that generated the sounds.

_images/fig4.png

Figure 4: Corpus used for Simulation 2. Connected black dots represent the speech sounds in the corpus, and red dots show the means of the categories that generated the sounds.

Simulation 2 was a more difficult problem, for which the infinite mixture model could not distinguish two categories. This corpus was created using the same parameters as the previous corpus, but the categories in this corpus had variance 0.5, equal to the coarticulatory variance. This corpus is shown in Figure 4.

The infinite mixture model failed to separate the two categories, instead assigning all sounds to a single category with \(\mu_c = −0.03, \sigma_c = 0.34\). This corresponded to an F-score of 0.66. The coarticulation model also found an incorrect number of categories, separating the points into three categories, corresponding to parameters \(\mu_c = −0.91, \sigma_c = 0.06, \mu_c = 0.97, \sigma_c = 0.04, and \mu_c = −0.04, \sigma_c = 0.001\). However, this last category contained only 7 of the 100 sounds in the corpus. This solution had an F-score of 0.93.

Finally, Simulation 3 tested the models on a more complex corpus created from five categories with more substantial overlap. Sounds were generated from categories with means at -2, -1, 0, 1, and 2. Variances were all set to 0.01, like in Simulation 1. However, because the categories were closer together, and because there were more of them, this was a more difficult learning problem. The corpus is shown in Figure 5. The infinite mixture model merged all five categories into a single categories with \(\mu_c = −0.04, \sigma_c = 1.06\), achieving an F-score of 0.33. The coarticulation model found six categories, with the following parameters:

\[\begin{split}\mu_c &= &0.02, &\sigma_c &= 0.01&\\ \mu_c &= &2.06, &\sigma_c &= 0.03&\\ \mu_c &= −&1.85, &\sigma_c &= 0.02&\\ \mu_c &= −&1.08, &\sigma_c &= 0.01&\\ \mu_c &= &0.75, &\sigma_c &= 0.01&\\ \mu_c &= −&2.52, &\sigma_c &= 11.71&\end{split}\]

The last category appears spurious, and indeed contained only 3 sounds from the corpus. This solution had an F-score of 0.89.

_images/fig5.png

Figure 5: Corpus used for Simulation 3. Connected black dots represent the speech sounds in the corpus, and red dots show the means of the categories that generated the sounds.

Discussion

This project explored the phonetic category learning problem in a system with coarticulation, where sounds are affected by neighboring sounds. Despite a poor inference algorithm in which no category parameters were resampled for existing categories, the coarticulation model showed the ability to recover parameters of phonetic categories in a toy corpus, recovering more accurate parameters than the infinite mixture model and separating categories that were merged by the infinite mixture model.

Immediate future work should address inference of category parameters, finding a way to recover the posterior distribution on µc and \(\sigma_c\) without the need to invert an NxN matrix. It is likely possible to find the posterior distribution on \(\mu_c\) given a fixed value of \(\sigma_c\) , as the degree of contribution of neighboring sounds changes only with changes in sigma_c .

Language learners need to solve several problems at once: learning category assignments and parameters, but also learning the particular coarticulatory patterns of their language, and sequential dependencies between categories. Dependencies between neighboring categories have been especially difficult to deal with because most work in phonetic category acquisition has used exchangeable models, which by definition assign equal probability to any ordering of sounds in the corpus. This paper has proposed a framework for sequential dependences in which these dependencies are characterized by interacting weighted constraints, following work in formal linguistics [6, 8], and has begun exploring the type of inference that can be performed in such a model.

References
[1]Gautam K. Vallabha, James L. McClelland, Ferran Pons, Janet F. Werker, and Shigeaki Amano. Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104:13273–13278, 2007.
[2]Bob McMurray, Richard N. Aslin, and Joseph C. Toscano. Statistical learning of phonetic categories: insights from a computational approach. Developmental Science, 12(3):369–378, 2009.
[3](1, 2, 3) Naomi H. Feldman, Thomas L. Griffiths, and James L. Morgan. Learning phonetic categories by learning a lexicon. In N. A. Taatgen and H. van Rijnst, editors, Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 2208–2213. Cognitive Science Society, Austin, TX, 2009.
[4]James L. Hillenbrand, Michael J. Clark, and Terrance M. Nearey. Effects of consonant environment on vowel formant patterns. Journal of the Acoustical Society of America, 109(2):748–763, 2001.
[5](1, 2) Brian Dillon, Ewan Dunbar, and William Idsardi. A single-stage approach to learning phonological categories: Insights from inuktitut. in preparation.
[6](1, 2)
  1. Legendre, Y. Miyata, and P. Smolensky. Harmonic grammar: A formal multi-level connectionist theory of linguistic well-formedness: Theoretical foundations. Technical Report 90-5, Institute of Cognitive Science, University of Colorado, 1990.
[7]Sharon Goldwater and Mark Johnson. Learning OT constraint rankings using a maximum entropy model. Proceedings of the Workshop on Variation within Optimality Theory, 2003.
[8](1, 2, 3) Edward Flemming. Scalar and categorical phenomena in a unified model of phonetics and phonology. Phonology, 18:7–44, 2001.
[9]Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Technical Report No. 9815, Department of Statistics, University of Toronto, 1998.

Interword Spacing and Landing Position Effects During Chinese Reading in Children and Adults

  • Chuanli Zang
  • Feifei Liang
  • Xuejun Bai
  • Guoli Yan
  • Simon P. Liversedge
Abstract

本研究はスペースの空いている本を読んでいるときと,スペースの区切りのない中国語のテキストを読んでいるときの子供と大人の目の動きの行動を調査した. その結果,単語間のスペースは子供も大人も初回のリーディングタイムを減少させ,再固定確率が単語識別と促進された単語の間のスペースを示している. 単語の The present study examined children and adults’ eye movement behavior when reading word spaced and unspaced Chinese text. The results showed that interword spacing reduced children and adults’ first pass reading times and refixation probabilities indicating spaces between words facilitated word identification. Word spacing effects occurred to a similar degree for both children and adults, though there were differential landing position effects for single and multiple fixation situations in both groups; clear preferred viewing location effects occurred for single fixations, whereas landing positions were closer to word beginnings, and further into the word for adults than children for multiple fixation situations. Furthermore, adults targeted refixations contingent on initial landing positions to a greater degree than did children. Overall, the results indicate that some aspects of children’s eye movements during reading show similar levels of maturity to adults, while others do not.

Learning Phonemes With a Proto-Lexicon

  • a,b) Andrew Martin a) Sharon Peperkamp a) Emmanuel Dupoux
  • Cognitive Science 37 (2013) 103–124
  • Copyright © 2012 Cognitive Science Society, Inc. All rights reserved.
  • ISSN: 0364-0213 print / 1551-6709 online
  • DOI: 10.1111/j.1551-6709.2012.01267.x
  1. Laboratoire de Sciences Cognitives et Psycholinguistique (EHESS-ENS-CNRS)
  2. Laboratory for Language Development, RIKEN Brain Science Institute
  • Received 26 May 2011; received in revised form 28 November 2011; accepted 5 March 2012
Abstract

Before the end of the first year of life, infants begin to lose the ability to perceive distinctions between sounds that are not phonemic in their native language. It is typically assumed that this developmental change reflects the construction of language-specific phoneme categories, but how these categories are learned largely remains a mystery. Peperkamp, Le Calvez, Nadal, and Dupoux (2006) present an algorithm that can discover phonemes using the distributions of allophones as well as the phonetic properties of the allophones and their contexts. We show that a third type of information source, the occurrence of pairs of minimally differing word forms in speech heard by the infant, is also useful for learning phonemic categories and is in fact more reliable than purely distributional information in data containing a large number of allophones. In our model, learners build an approximation of the lexicon consisting of the high-frequency n-grams present in their speech input, allowing them to take advantage of top-down lexical information without needing to learn words. This may explain how infants have already begun to exhibit sensitivity to phonemic categories before they have a large receptive lexicon.

  • Keywords: First language acquisition; Statistical learning; Phonemes; Allophonic rules
1.Introduction

Infants acquire the fundamentals of their language very quickly and without supervision. In particular, during the first year of life, they converge on the set of phonemic categories for their language (Kuhl et al., 2008; Werker & Tees, 1984), they become attuned to its phonotactics (Jusczyk, Friederici, Wessels, Svenkerud, & Jusczyk, 1993), and they begin to extract words from continuous speech (Jusczyk & Aslin, 1995). Despite a wealth of studies documenting these changes (see reviews in Jusczyk, 2000; Kuhl, 2004; Werker & Tees, 1999), very little is known about the computational mechanisms involved in this rapid acquisition process.

  • Correspondence should be sent to Andrew Martin, Laboratory for Language Development, RIKEN Brain Science Institute, 2-1 Hirosawa, Wako-shi, Saitama, 351-0198, Japan. E-mail: amartin@brain.riken.jp

One reason for this state of affairs is that most computational modeling studies have narrowly focused on one learning sub-problem, assuming the others can be solved independently. For instance, proposed mechanisms for word segmentation have typically assumed that phonemic categories have been acquired beforehand (Brent, 1999; Goldwater, Griffiths, & Johnson, 2009; Venkataraman, 2001). Similarly, models of the phonetic ⁄ phonological buildup of phonemic categories assume that these categories have to be constructed or refined through generalization over lexical items (Pierrehumbert, 2003). Of course, this presupposes that a lexicon has been acquired beforehand. Finally, models of grammar induction for learning stress systems or phonotactics make the assumption that both the phonemic categories and the lexicon of underlying forms have already been learned (Dresher & Kaye, 1990; Tesar & Smolensky, 1998). These circularities illustrate what is known as the ‘‘bootstrapping problem,’’ which is the apparently insoluble problem that faces infants when they have to acquire several co-dependent levels of linguistic structure. Therefore, while the existing computational studies have allowed us to better understand individual pieces of the puzzle, taken together, they do not coalesce into a coherent theory of early language acquisition. This is especially true when one considers that the experimental data show that infants do not learn phonemic categories, lexical entries, and phonotactic regularities one after the other, but rather, start learning these levels almost simultaneously, between 5 and 9 months of age (see Kuhl, 2004 for a review).

A second reason for our lack of comprehension of this learning process is that most proposed algorithms have only been tested on artificial or simplified language inputs, and only assume that they can scale up to more realistic inputs like raw speech; in several instances, however, such assumptions have turned out to be incorrect. For example, it has been claimed that phonetic categories emerge through an unsupervised statistical learning process whereby infants track the modes of the distributions of phonetic exemplars (Maye, Weiss, & Aslin, 2008; Maye, Werker, & Gerken, 2002) or perform perceptual warping of irrelevant phonetic dimensions (Jusczyk, 1993; Kuhl et al., 2008). Modeling studies have found that Self Organizing Maps (Gauthier, Shi, & Xu, 2007a,b; Guenther & Gjaja, 1996), Gaussian mixtures (de Boer & Kuhl, 2003; Mcmurray, Aslin, & Toscano, 2009; Vallabha, McClelland, Pons, Werker, & Amano, 2007), or Hidden Markov Models (Goldsmith & Xanthos, 2009) can converge on abstract categories for tones, vowels, or consonants in an unsupervised fashion, that is, without any kind of lexical or semantic feedback. However, these clustering algorithms have only been applied to a fragment of the phonology or to individual speech dimensions segmented or extracted by hand (such as F0 contours, F1 and F2, vowel duration, or VOT). Do they scale up to the full complexity of unsegmented speech signals? Varadarajan, Khudanpur, and Dupoux (2008) applied unsupervised clustering (a version of Gaussian modeling using successive state splitting) on 40 h of raw conversational speech and found that the algorithm constructed a network of states that failed to correspond in a one-to-one fashion to phonemic categories. Although the states were sufficient to sustain state-of-the-art speech recognition performance, they did so by encoding a very large number of heavily context-dependant discrete acoustic (subphonemic) events. Evidently, blind unsupervised clustering on raw speech needs to be supplemented by further processes in order to adequately model the early construction of phonemic categories by infants.

It is important to realize that the failure to derive abstract phonemic categories from the raw signal using unsupervised clustering in Varadarajan et al. (2008) is not a fluke of the particular algorithm they used but reflects a deep property of running speech signals: The acoustic realization of phonemes is massively context-dependent, creating a great deal of overlap between phoneme categories (see Pierrehumbert, 2003). One could argue that gradient coarticulation effects and abstract linguistic rules are fundamentally different, in that the former are the result of universal principles which could be undone by infants without needing any language-specific knowledge. If this were the case, perhaps the wealth of variants discovered by ASR systems could be reduced to a small, orderly set of allophones. There is substantial evidence, however, that even fine-grained coarticulation varies across languages ¨ (Beddor, Harnsberger, & Lindemann, 2002; Choi & Keating, 1991; Manuel, 1999; Ohman, 1966), and that perceptual compensation on the part of listeners shows language-specific effects (Fowler, 1981; Fowler & Smith, 1986; Lehiste & Shockey, 1972). The boundary between gradient phonetic effects and discrete phonological rules is thus difficult to draw, especially from the viewpoint of infants who have not yet acquired a system of discrete categories.

Constructing a single context-independent Gaussian model for each abstract phoneme is therefore bound to yield more identification errors when compared to finer-grained models that make use of contextual information. This is well known in the speech recognition literature, where HMM models of abstract phonemes yield worse performance than models of contextual diphone- or triphone-based allophones (Lee, 1988; Makhoul & Schwartz, 1995). Typically, a multi-talker recognition system requires between 2,000 and 12,000 contextual allophones (representing between 50 and 300 allophones per phoneme) in order to achieve good performance (Jelinek, 1998).

Of course, one could propose that infants adopt the same strategy—simply compile a large number of such fine-grained allophonic categories and use them for word learning. But this would be unwise, since massive allophony causes the performance of word segmentation algorithms to drop dramatically (Rytting, Brew, & Fosler-Lussier, 2010). Boruta, ´ Peperkamp, Crabbe, and Dupoux (2011) ran two such algorithms—the MBDP-1 of Brent (1999) and the NGS-u of Venkataraman (2001)—on transcripts of child-directed speech in which contextual allophony has been implemented. They found that when the input contains an average of 20 allophones per phoneme, the performance of both algorithms falls below that of a control algorithm which simply inserts word boundaries at random. In addition, there is empirical evidence that infants do not do this, since before they have a large comprehension lexicon toward the end of their first year of life,1 they have begun to lose the ability to discriminate within-category distinctions (e.g., Dehaene-Lambertz & Baillet, 1998; Werker & Tees, 1984) and pay less attention to allophonic contrasts compared to phonemic ones (Seidl, Cristia, Bernard, & Onishi, 2009).

This does not mean that fine-grained phonetic information is necessarily disregarded, as speakers could make use of both phonetic detail and abstract categories (Cutler, Eisner, McQueen, & Norris, 2010; Ramus et al., 2010). Evidence that phonetic detail is represented includes results showing that both infants and adults use allophonic differences for the purposes of word segmentation (Gow & Gordon, 1995; Jusczyk, Hohne, & Bauman, 1999; Nakatani & Dukes, 1977), and that adults are able to remap allophones to abstract phonemes with little training (Kraljik & Samuel, 2005; Norris, McQueen, & Cutler, 2003). Similarly, both adults and infants make use of within-category phonetic variation during lexical access (McMurray & Aslin, 2005; McMurray, Aslin, Tanenhaus, Spivey, & Subik, 2008).

In this paper, we revisit the issue of early acquisition of phoneme categories, while attempting to address simultaneously the two problems mentioned above, that is, the circularity problem and the scalability problem. We build our approach on the work described in two recent papers: Peperkamp et al. (2006) and Feldman, Griffiths, and Morgan (2009).

Peperkamp et al. (2006) proposed that infants construct phoneme categories in two steps: Starting with raw signal, they first construct a rather large number of detailed phonetic categories (allophones), and in a second step they cluster these allophones into abstract phonemes. They demonstrated that two potential sources of information, distributional and phonetic, are potentially helpful in performing this clustering step. Distributional information is useful for category formation because variants of a single phoneme occur in distinct contexts, while phonetic information is relevant due to the fact that such variants tend to be acoustically or articulatorily similar to each other and to share phonetic features with their contexts. However, the phonetic part of the study may have created a circularity problem since it assumed the prior acquisition of phonetic features (which may itself require learning the phonological system; see Clements, 2009). In addition, the entire study may have a scalability problem, as it only tested a very small number of allophones in each language (7 allophones of French in Peperkamp et al., 2006; 15 allophones of Japanese in Le Calvez, Peperkamp, & Dupoux, 2007). This falls short of the range of allophonic ⁄ coarticulatory variants that are needed to achieve reasonable performance in speech recognition, which are on the order of a few thousand variants. Since we do not know the granularity of the categories constructed by actual infants, in the present study we manipulate the number of variants in a parametric fashion, from a few dozen to a few thousand.

Feldman et al. (2009) showed, using a Bayesian model, that it can be more beneficial to simultaneously learn a lexicon and phoneme categories than to learn phoneme categories alone. While non-intuitive, this result makes an extremely important conceptual point, as it shows that potential circularities between learning at the lexical and sublexical levels can be broken up using appropriate learning algorithms, where bottom-up and top-down information are learned simultaneously and constrain each other (see also Swingley, 2009). However, their study may also rest on a potential circularity problem since the words provided to the model were all segmented by hand (whereas, as we know, segmentation depends on the availability of abstract enough phonemic categories; Boruta et al., 2011). It may also have a scalability problem, as it modeled the acquisition problem with toy examples, consisting of a small number of artificial or phonetically idealized categories and a small number of words. Here, we will expand on Feldman et al.’s (2009) idea of simultaneous learning of phonemic units and words, while using a realistically sized corpus as in Peperkamp et al. (2006), incorporating the full phoneme inventory, phonotactic constraints and lexicon. Importantly, in order to avoid the circularity problem mentioned above, we will use an approximate proto-lexicon derived in an unsupervised way from the sublexical representation, without any kind of hand segmentation.

In brief, we combine ideas from both Peperkamp et al. (2006) and Feldman et al. (2009). Following the former, we assume that some kind of initial clustering has yielded a set of discrete segment candidates (the allophonic set). The model’s task is then to group them into equivalence classes in order to arrive at abstract phonemes (the phonemic set). Our approach is not aimed at producing a realistic simulation or an instantiated theory of phonological acquisition in infants. Rather, we are interested in quantitatively evaluating the usefulness of different kinds of information that are available in infant’s input for the purpose of learning phonological categories.

In Experiment 1 we study the scalability of the bottom-up distributional measure used in Peperkamp et al. (2006) and Le Calvez et al. (2007) when the number of allophones is increased beyond an average of two per phoneme. We also implement two types of allophone: those defined by the following context (Experiments 1 and 2), as in Peperkamp et al. (2006) and Le Calvez et al. (2007), and those conditioned by bilateral contexts (Experiment 3), which mimic the triphones used in ASR and allow us to test an even larger number of allophones per phoneme. In Experiments 2 and 3 we implement a new algorithm, incorporating the idea of Feldman et al. (2009) and Swingley (2009) that feedback from a higher level can help the acquisition of a lower level, even if the higher level has not yet been perfectly learned. To assess this quantitatively, we compare the effect of a perfect (supervised) lexicon with that of an approximate proto-lexicon derived by means of a simple unsupervised segmentation mechanism. We conclude by discussing the predictions of this new model regarding the existence and role of approximate word forms during the first year of life.

2.Experiment 1

In this experiment, we examine the performance of Peperkamp et al.’s (2006) algorithm, which uses Kullback–Leibler (KL) divergence as a measure of distance between contexts to detect pairs of allophones that derive from the same phoneme, on corpora with varying numbers of allophones. A pair of segments with high KL divergence (i.e., having dissimilar distributions) is deemed more likely to belong to the same phoneme than one with low KL divergence. The robustness of this algorithm has been demonstrated using pseudo-random artificial corpora, as well as transcribed corpora of French and Japanese infant-directed speech in which nine and fifteen allophones, respectively, had been added to the phoneme inventory (Le Calvez et al., 2007; Peperkamp et al., 2006). It has also been successfully used on the consonants in the TIMIT English database, whose transcriptions include three allophones in addition to the standard phonemic symbols (Kempton & Moore, 2009). In this experiment we greatly increase the number of allophones in the training data in order to determine whether this method can scale up to systems of realistic complexity.

2.1. Method
2.1.1. Corpora

Starting with a corpus consisting of 7.5 million words of spoken Japanese phonemically transcribed by hand (Maekawa, Koiso, Furui, & Isahara, 2000), we created several versions of the corpus in which artificial rules are used to convert each phoneme into several contextdependent allophones, where a context is defined as a following phoneme or utterance boundary. Given that Japanese has 42 phonemes, the maximum number of distinct contexts for a given phoneme is 43.2 The corpora we created differ in the number of allophones per phoneme that are implemented, ranging from two (each phoneme has one realization occurring in some contexts and another one occurring in all other contexts) to 43 (each phoneme has as many realizations as there are possible distinct contexts).

For each phoneme p in a corpus with n allophones per phoneme, n rules were generated which convert p to one of n allophones p1…pn depending on context. Which contexts trigger which allophones was determined by randomly partitioning the set of all possible contexts into n partitions, one for each rule, and then randomly assigning contexts to each partition. Note that real allophonic contexts are typically grouped into natural classes based on similarity, a property not shared by our random procedure. Finally, the different versions of the corpus were created by applying the rules to the base corpus. Fig. 1 demonstrates the procedure on one utterance for a corpus with two allophones per phoneme (plus symbols (+) represent word boundaries, which are ignored for the purposes of rule application, and the pound symbol (#) represents an utterance boundary). The notation used in Fig. 1 is read as follows: A rule of the form X fi Y ⁄ __ {A, B, C} states that phoneme X is realized as allophone Y when followed by A, B, or C.

Many of the rules that do not apply in this utterance apply elsewhere in the corpus. Some rules, however, never apply, due to phonotactic sequencing constraints. For instance, the rule assigning the allophone [g1] to ⁄ g ⁄ before any of the segments ⁄ w, bj, m, p, z, pj, dj, f ⁄ cannot apply because ⁄ g ⁄ only occurs before vowels in Japanese. We therefore measure the allophonic complexity of a given corpus by referring to the total number of distinct segments actually occurring in the corpus. The utterance in Fig. 1, for example, contains five phonemes (a, t, m, g, i) but has an allophonic complexity of seven, because after rule application it contains seven unique segments (a1, a2, t1, m2, g2, i1, i2).

  • Fig. 1. Example of rule application on the utterance atama ga itai ‘‘(my) head hurts,’’ using artificial rules which assign each phoneme one of two allophones.
2.1.2. Procedure
We
ran the learning algorithm of Peperkamp et al. (2006) on the corpora as follows.

For each corpus, we began by constructing a list of the segments occurring in that corpus. We then listed all logically possible pairs of the attested segments. In order to assess the algorithm’s performance on each corpus, the remaining list of segment pairs was divided into two types: those that are allophones of the same phoneme (labeled same) and those that are allophones of different phonemes (labeled different). The task of the algorithm is to assign the correct label to each pair of allophones, given the corpus as input. This is done by computing the symmetrized KL divergence, a measure of the difference between two probability distributions, for each pair (Appendix). Because allophones of the same phoneme occur in complementary sets of environments, the KL values for such pairs should tend to be higher than the KL values of pairs of allophones of different phonemes.

2.2. Results and discussion

The KL measure can be used to label pairs of sounds by selecting a cutoff value and labeling those pairs with a KL higher than the cutoff as same and those pairs with a lower KL as different. We evaluate classification performance by means of the q statistic (Bamber, 1975), which represents the probability that, given one same pair and one different pair, each chosen at random, KL divergence assigns a higher value to the same pair. Chance is thus represented by a q of 0.5, and perfect performance (in which there is no overlap between the two categories) by a q of 1.0. Table 1 lists q values for each of the corpus types in Experiment 1.

These results show that for the corpora with the lowest allophonic complexity, KL divergence is fairly effective at distinguishing same from different. This makes sense, because there are many possible ways to divide up the set of contexts, meaning that the probability of two unrelated allophones in a different pair happening to have complementary distributions is relatively low. For the corpora with the maximum number of rules, however, the algorithm performs much less well. When every segment has an extremely narrow distribution, complementary distribution is the rule rather than the exception, and so it is no longer a reliable indicator of pairs of allophones derived from the same phoneme. Unless infants are first able to greatly reduce the number of allophones, examining the distributions of individual segments is not a very efficient way to assign allophones to the appropriate phonemic categories.

  • Table 1 : Performance of KL divergence as a function of allophonic complexity, expressed as q-scores

Mean Allophonic Complexity, q 79.0 , 0.852 164.4, 0.692 269.4, 0.632 425.8, 0.592 567.6, 0.562 737.2, 0.548

  • Note. All values are averaged over five corpora generated with the same parameters.
3.Experiment 2

The solution we propose to the dilemma posed by high allophonic complexity takes advantage of the fact that the linguistic input is composed of words. A pair of phonological rules which change phoneme x into x1 when followed by y and into x2 when followed by z will cause most words that end in the phoneme x to occur in two variants: one that ends in x1 (when the word is followed by y) and one that ends in x2 (when the word is followed by z). Thus, encountering a pair of word forms that differ only in that one ends in x1 and the other ends in x2 is a clue that x1 and x2 are allophones of the same phoneme—conversely, never encountering such a word form pair (in a sufficiently large sample) is a clue that x1 and x2 are allophones of different phonemes. The Japanese word atama ‘‘head,’’ for example, appears as a1t1a2m2a2 before the nominative marker ga in the utterance in Fig. 1, but it would appear as a1t1a2m2a1 before the word to ‘‘and.’’ The presence of both word forms in the infant’s input is evidence that a1 and a2 are allophones of the phoneme ⁄ a ⁄ . This is not an infallible learning strategy, since every language contains minimal pairs, different words which by chance differ only in a single segment (e.g., kiku ‘‘listen’’ and kiki ‘‘crisis’’ in Japanese), which could result in allophones of different phonemes being misclassified as belonging to the same phoneme. However, as long as the number of minimal pairs is small relative to the number of word form pairs derived from the same word, the strategy will be effective.

A learner who knows where the word boundaries are could filter out those segment pairs that are not responsible for multiple word forms, before using KL divergence to classify the remaining pairs. Of course, such a word form filter would unrealistically require perfect knowledge of word boundaries, and thus the set of attested word forms, on the part of the learner. Infants who have not discovered word boundaries yet, however, can approximate the set of word forms by compiling a list of high-frequency strings that occur in the input. In this experiment, we implement both a word form filter and an n-gram filter; the latter is identical to the former except that the set of strings used to construct it is made up of the most frequent n-grams occurring in the corpus for a range of values of n.

3.1. Method
3.1.1. Corpora

We used both the same Japanese corpora as in Experiment 1 and Dutch corpora constructed in a similar manner. Dutch was added as a test language to ensure that any results are not due to specific properties of Japanese. The base corpus was a nine-million-word corpus of spoken Dutch (Corpus Gesproken Nederlands—Spoken Dutch Corpus; Oostdijk, 2000). The orthographic transcriptions in the Dutch corpus were converted to phonemic transcriptions using the pronunciation lexicon supplied with the corpus.3 Rules triggered by the following context were then implemented as described for Japanese in Experiment 1, with the difference that the maximum number of contexts and hence of allophones per phoneme in Dutch is 51 (50 phonemes plus the utterance boundary).

3.1.2. Procedure

For the word form filter, two segments A and B were considered potential allophones of the same phoneme if the corpus contained at least one pair of words XA and XB, where X is a string containing at least three segments. Words shorter than four segments were ignored because of the higher probability of minimally differing pairs occurring by chance among these words. Any segment pairs that did not meet these conditions were labeled as different; then, KL divergence was calculated for all remaining pairs as described in section 2.1. The n-gram filter works in the same way, except that XA and XB are frequent n-grams rather than words. We used the top 10% most frequent strings of lengths 4, 5, 6, 7, and 8 as surrogate word forms (very short strings tend to generate too many false alarms, while very long strings occur too infrequently to be informative).

Otherwise, the procedure is the same as in Experiment 1.

3.2. Results and discussion

As in Experiment 1, we evaluated the algorithms by means of the q statistic. For the word form and n-gram filters, KL divergence was computed as in Experiment 1, but only for those pairs passed by the filter. All pairs labeled different by the filter were assigned a KL value of )1, so that they were lower than the values of all other pairs. Fig. 2 compares the results of KL divergence in combination with the word form filter and the n-gram filter to the results of Experiment 1 (KL alone).

As in Experiment 1, the performance of the KL measure alone degrades as allophonic complexity increases, eventually approaching chance level. This degradation appears to be exponential in shape; that is, q drops below 0.7 for corpora containing between 200 and 300 unique segments, showing a rapid loss of performance in the presence of moderate allophonic complexity. In sharp contrast, the performance of the algorithm using the word form filter either increases or slowly decreases with allophonic complexity, with q remaining above 0.7 even on corpora of maximal allophonic complexity. Finally, the performance of the algorithm using the n-gram filter is intermediate, showing only a moderate decrease in performance with allophonic complexity and a q higher than 0.65 on corpora of maximal allophonic complexity.

These results attest to the usefulness of top-down lexical information in learning phonemes, even for data that contain a large number of allophones. Crucially, the n-gram filter, although not as effective as the word form filter, is substantially more resistant to allophonic complexity than the KL measure alone.

  • Fig. 2. Performance of allophone clustering (q-score) as a function of allophonic complexity measured by the number of following-context allophones in the corpus, for three algorithms (KL alone, KL + word form filter, and KL + n-gram filter), on Japanese input (left panel) and Dutch input (right panel).

Each point represents the mean performance of the algorithm on five corpora randomly generated using identical parameters. Error bars indicate the standard error across all five corpora.

In order to assess the added value of the KL measure, we compared the n-gram filter with KL to the n-gram filter with a random measure that simply assigns same and different labels with a probability of 0.5 each. The extent to which KL contributes to the q value above and beyond the contribution made by the filter will be reflected in the size of the difference between the performance of KL and that of the random measure. Table 2 displays this difference (i.e., q(filter + KL measure) – q(filter + random measure)) for each of the corpora described in Fig. 2.

The table demonstrates that the only corpora for which the KL measure substantially contributes to discriminability are the simplest ones, those averaging fewer than two allophones per phoneme. On all other corpora, KL improves performance either only slightly or not at all. Hence, for all but the simplest corpora, the filter does almost all of the work of increasing discriminability, with little or no contribution made by the KL measure.

In the results presented above, we used n-grams ranging from four to eight segments in length. In order to justify this choice of values, and discuss the effects on performance of using n-grams of different lengths, we present in Fig. 3 the discrimination performance achieved if individual values for n are used instead of the combination of multiple lengths we used. Several things are clear from this chart. First, using 3-grams alone results in substantially worse performance than any other n-gram length. Second, although performance improves as n-gram length increases, there is very little difference in the range from 4- to 9-grams.Third, the effects of n-gram length on performance are very similar for the two languages. The fact that n-grams behave similarly in languages as different as Japanese and Dutch offers hope that the 4- to 8-gram range will prove effective for a wide range of languages. Finally, Fig. 3 demonstrates that combining 4- through 8-grams results in better performance than any n-gram length by itself.

  • Table 2 : Difference in performance of allophone clustering (q-score) between KL + n-gram filter and random measure + n-gram filter.

Japanese ,Dutch Mean Allophonic Complexity, Mean Advantage of KL, Mean Allophonic Complexity, Mean Advantage of KL 79.0 0.223 92.8 0.180 164.4 0.051 220.8 0.051 269.4 0.006 409.6 0.003 425.8 -0.003 741.8 -0.004 567.6 -0.004 1,037.8 -0.002 737.2 0.000 1,334.2 -0.001

1,633.0 0.000
  • Note. All values are averaged over five corpora generated with the same parameters.
  • Fig. 3. Performance of allophone clustering (q-score) as a function of allophonic complexity measured by the number of following-context allophones in the corpus, for n-grams of lengths 3, 4, 5, 6, 7, 8, and 9 on Japanese input (circles) and Dutch input (triangles). The rightmost point indicates the performance of a combination of 4- through 8-grams. Each point indicates the mean performance over the entire range of corpora for that language.

Because both the word and n-gram filters rely on minimally differing pairs of word forms, they are vulnerable to noise caused by the occurrence of pairs of words in the input that have different meanings but happen to differ by a single segment. For example, in Japanese verbs whose non-past forms end in -ku have a corresponding imperative form ending in -ke, as in aruku ‘‘walk’’ and aruke ‘‘walk-imp.’’ Despite the fact that the vowels ⁄ a ⁄ and ⁄ e ⁄ are different phonemes in Japanese, the existence of such verbal pairs may prevent the word filter from recognizing these vowels as different phonemes. The extent to which this is a problem for the algorithm, of course, depends on the number of such minimally differing word pairs, compared to the number of word pairs created by the phonological rules. Tables 3 (Japanese) and 4 (Dutch) give, for each corpus type, the numbers of hits—allophone pairs correctly passed by the word filter—as well as the number of false alarms—allophone pairs derived from different phonemes that are incorrectly passed by the word filter due to the presence of minimal word pairs.

Two trends may be observed in these results. First, unsurprisingly, the number of hits increases as the allophonic complexity increases. This is a straightforward consequence of the fact that the overall number of allophone pairs increases with complexity. More unusual is the relationship between corpus complexity and false alarms—as the number of allophones in the corpus increases, the number of false alarms triggered by minimally differing words at first increases, and then decreases. This U-shaped trend is caused by two opposing forces: first, as with hits, the number of possible allophone pairs increases with the number of allophones. Second, however, as the number of allophones increases, the range of contexts assigned to each allophone shrinks. This means that the penultimate segments in the two words will be less likely to be grouped in the same phonemic category. To use the Japanese word for ‘‘walk’’ as an example, the only way that aruku and aruke will be mistakenly categorized as variants of the same word is if the allophones of ⁄ k ⁄ that occur in each word are also treated as belonging to the same phonemic category, a mistake which is unlikely if ⁄ k ⁄ is split into a high number of allophones, and impossible if ⁄ k ⁄ is split into the maximum possible number of allophones (since ‘‘ ⁄ k ⁄ before ⁄ u ⁄ ’’ and ‘‘ ⁄ k ⁄ before ⁄ e ⁄ ’’ will always be treated as different allophones).

  • Table 3 : Numbers of hits versus false alarms passed by the word form filter in Japanese corpora.

Mean Allophonic Complexity, Same Pairs Passed by Filter (hits), Different Pairs Passed by Filter (false alarms) 79.0 , 11.0 , 210.0 164.4, 109.2 , 942.6 269.4, 474.8 , 2,221.4 425.8, 1,818.8, 3,176.8 567.6, 3,974.0, 2,595.2 737.2, 7,802.8, 0.0

  • Note. All values are averaged over five corpora generated with the same parameters.

Table 4 : Numbers of hits versus false alarms passed by the word form filter in Dutch corpora. Mean Allophonic Complexity, Same Pairs Passed by Filter (hits), Different Pairs Passed by Filter (false alarms) 92.8 , 29.2 , 297.8 220.8 , 247.6 , 707.0 409.6 , 918.4 , 969.4 741.8 , 2,724.6 , 1,198.6 1,037.8, 5,215.4 , 960.2 1,334.2, 8,099.2 , 276.4 1,633.0, 11,557.0, 0.0

  • Note. All values are averaged over five corpora generated with the same parameters.

In short, the more attention the learner pays to the fine phonetic detail of each allophone, the odds of accidentally mistaking a pair of different words for a pair of word form variants of a single word decrease. This type of mistake only becomes dangerous once the infant has constructed fairly large and abstract categories, meaning that minimal pairs like aruku and aruke are unlikely to pose a serious problem in the early stages of category learning.

4. Experiment 3

Experiments 1 and 2 use phonological rules that are unilaterally conditioned, in particular, rules that are triggered by the phoneme’s following context. Actual phonological processes in natural languages, however, are often conditioned by bilateral contexts. In Korean, for example, stop consonants become voiced when both preceded and followed by a voiced segment (Cho, 1990). In this Experiment we therefore test the algorithms used in Experiments 1 and 2 on data in which allophones are dependent on both the preceding and the following segment.

4.1. Method
4.1.1. Corpora

The corpora in Experiment 3 are based on the same Japanese and Dutch corpora used in Experiment 2. Allophonic rules were implemented as in the previous experiment, with the difference that each context consisted of both a preceding and a following segment.

4.1.2. Procedure

The implementation of the word form and n-gram filters was performed as in Experiment 2, with the exception of how relevant word form pairs were identified. Because the corpora in this experiment contain allophones that are conditioned by bilateral contexts, a pair of word forms (or n-grams) was considered relevant if either or both the initial and final segments differ. Thus, if the corpus contains two word forms AXC and BXD, where X is a string containing at least two segments, the pair of segments A and B are considered potential allophones of the same phoneme, as is the pair C and D. This procedure is able to discover both unilaterally conditioned and bilaterally conditioned rules, and it would be effective in a language with both types of rule. In this experiment, however, we implement only bilateral rules in the training data, as the large number of contexts make this the most complex possible learning scenario.

4.2. Results and discussion

As in Experiment 2, we compare the results of KL divergence in combination with the word form filter and the n-gram filter to the results of Experiment 1 (KL alone). The results are shown in Fig. 4.

As in the previous experiments, the KL measure alone yields an exponential drop in performance as the allophonic complexity increases, while the algorithms incorporating a word form filter or an n-gram filter display a stronger resistance to allophonic complexity; for the latter algorithms, performance is around 0.8 and 0.7, respectively, for corpora with maximum allophonic complexity. A comparison of these results with the ones of Experiment 2 reveals that increasing the complexity of the rules themselves by making them sensitive to bilateral contexts does not substantially affect the performance of the two filters—in fact, performance is slightly better for bilateral contexts (this Experiment) than for unilateral contexts (Experiment 2).

  • Fig. 4. Performance of allophone clustering (q-score) as a function of allophonic complexity measured by the number of bilateral allophones in the corpus, for three algorithms (KL alone, KL + word form filter, and KL + ngram filter), on Japanese input (left panel) and Dutch input (right panel). Each point represents the mean performance of the algorithm on five corpora randomly generated using identical parameters. Error bars indicate the standard error across all five corpora.5
5. General discussion

The development of phonetic perception over the first year of life poses a conundrum. By the end of this period, despite having little access to semantic information, infants treat semantically meaningful (i.e., phonemic) and meaningless (i.e., allophonic) phonetic distinctions differently. We have argued that one way out of this conundrum for infants involves building phonemic categories, effectively classifying these distinctions as either phonemic or allophonic, using a procedure that exploits the lexical structure of the input.

We have shown, firstly, that when searching for phonemic categories, a bottom-up procedure which looks for sounds that are in complementary distribution becomes extremely inefficient when allophonic complexity (i.e., the number of allophones) increases. Secondly, we found that adding top-down lexical word form information allows for robust discrimination among segment pairs that belong to the same phoneme and those that belong to different phonemes, even in the presence of increased allophonic complexity. Finally, we have shown that lexical word forms can be crudely approximated with n-grams, which still yield results that are both good and resistant to allophonic complexity. These results are obtained with the same types of contextual variants as used in Peperkamp et al. (2006) and Le Calvez et al. (2007), that is, allophones that depend upon the following context (Experiments 1 and 2), as well as with bilaterally conditioned allophones (Experiment 3). Moreover, the results hold for both Dutch and Japanese, two languages that are very different from the viewpoint of syllable structure and phonotactics.

The reason for the lack of robustness of the bottom-up complementary distribution approach is fairly simple to grasp: as the number of allophones increases, the allophones become more and more tied to highly specific contexts. Ultimately, in a language where all possible bilateral allophones are instantiated, every segment is in complementary distribution with almost every other one, rendering distributional information nearly useless for phonemic categorization. Only when the number of allophonic variants is very small (not more than two per phoneme), and complementary distribution of segments thereby relatively rare, is this type of distributional information useful. Unfortunately for the learner, this means that looking for complementary distribution between segments in the input is only an efficient strategy when the problem has already been almost completely solved.

The top-down approach is successful because it relies on the fact that allophony changes not just individual sounds, but entire word forms, and that for sufficiently long words, the probability that two different words happen to be identical except for their final sounds is very low. Crucially, this fact is independent of the allophonic complexity of the input. Of course, this criterion alone is not sufficient, as there are true minimal pairs in many languages, but they are relatively rare (especially for longer words) and non-systematic, unlike the minimal pairs created by contextual allophony. Finally, the reason for the success of the n-gram strategy is that the low probability of true minimal pairs also applies to frequent n-grams, and this probability does not depend on allophonic complexity.

Our algorithm could be improved and extended in a number of ways. First, instead of using top-down information in an all-or-none fashion, we could implement a statistical model of possible word forms (Swingley, 2005) and use it to compute the probability of obtaining the observed pattern of minimal pairs by accident. Such a procedure would allow us to probabilistically integrate the effect of word length instead of postulating a fixed window of word lengths as in the present study, and it would also be less sensitive to the occasional existence of true minimal pairs. Second, the crude n-gram segmentation strategy could be replaced by some less crude—although still sub-optimal—lexical segmentation procedure (e.g., Brent, 1999; Daland & Pierrehumbert, 2010; Goldwater et al., 2009; Monaghan & Christiansen, 2010; Venkataraman, 2001). Third, as noted in Peperkamp et al. (2006), performance can be improved by providing phonetic constraints as to the nature of allophonic processes. Peperkamp et al., 2006. proposed two such constraints, one to the effect that allophones of the same phoneme must be minimally different from a phonetic point of view, and another to the effect that allophones tend to result from contextual assimilation (feature spreading). How such phonetic constraints can be implemented in a language encoded with massive allophony remains to be assessed. Another example of a possibly helpful linguistic constraint is rudimentary semantic knowledge, which could serve as further evidence that two word forms are in fact realizations of a single word. Even if infants do not know many words, the few words they do know could improve the performance of the n-gram filter (Swingley, 2009). Fourth, our model could be extended to learn patterns of phonological variations that go beyond contextual allophony. Word forms can also vary through processes of insertion or deletion of segments, yielding a multiplication of closely related word forms (Greenberg, 1998). A proto-lexicon has the potential of capturing some of these variations, which would be impossible to do in a purely bottom-up fashion (Peperkamp & Dupoux, 2002).

Of course, the procedure we have described represents only the first step in the learning of phonemic categories. Our algorithm assigns each pair of allophones a rating which indicates how likely that pair is to belong to the same phoneme; a real infant would need to then use these ratings to group allophones into phoneme-sized clusters of allophones. This next step is not trivial. Given pairs of allophones and ratings, the infant must decide on an appropriate cutoff value above which allophones will be considered members of the same cluster. Choosing the optimal cutoff will depend on the relative costs of false alarms (allophones incorrectly grouped together) and misses (allophones incorrectly placed in different clusters). At this point, very little is known about what these costs might be or how easy it is for infants to recover from errors at this stage of learning. Although modeling this entire process is thus beyond the scope of the present article, we hope that our results provide a foundation for future progress on these questions.

We also emphasize that, although we have couched our proposal in terms of a specific algorithm, our findings allow us to draw more general conclusions that go beyond the question of whether infants use this exact procedure to learn phonemes. These results demonstrate, first, that the lexical structure of speech input contains information on phonemic categories that is missed by approaches which focus only on sublexical units, and second, that this lexical information can be extracted from data of realistic complexity, even using an extremely simple procedure. It is therefore likely that any approach to learning phonemes would benefit from making use of top-down lexical information.

We should, however, point out that our approach employs a number of simplifications that make it unable to address the entire complexity of the acquisition problem. First, the assumption that the learner starts by establishing a large set of discrete allophones may not adequately capture some of the phonetic effects of between-talker variation, nor the more continuous effects of speaking rate, or variability induced by noise in the transmission channel. Clearly, adequate signal preprocessing ⁄ normalization is needed if a running speech application is envisioned. Second, as mentioned earlier, the use of a minimal pair constraint may be problematic in languages with mono-segmental inflectional affixes that create systematic patterns of word-final or word-initial minimal pairs (as in the Japanese aruku-aruke example mentioned in section 3.2). To solve these cases, the proto-lexicon of word forms must be supplemented with semantic information which may only be acquired later during development (Regier, 2003). This is consistent with the view that the acquisition of phonemic categories is not closed by the end of the first year of life but continues to be refined thereafter (Sundara, Polka, & Genesee, 2006). Third, we should note that our procedure can only discover phonological rules that operate at word boundaries; a context-dependent rule that only ever applies within a word will not create word form variants in the way discussed here and will have to be learned in some other way. But precisely because such rules, if applied consistently, do not create multiple word forms, they do not impede word recognition or segmentation and so are not as crucial for a language-learning infant.

The traditional bottom-up scenario of early language acquisition holds that infants begin by learning the phonemes and constraints upon their sequencing during the first year of life (Jusczyk et al., 1993; Pegg & Werker, 1997; Werker & Tees, 1984) and then learn the lexicon on the basis of an established phonological representation (Clark, 1995). While infants have been shown to be capable of extracting phonological regularities in a non-lexical, bottom-up fashion (Chambers, Onishi, & Fisher, 2003; Maye et al., 2002; Saffran & Thiessen, 2003; White, Peperkamp, Kirk, & Morgan, 2008), our results cast serious doubt on the idea that such a mechanism is by itself sufficient to establish phonological categories. Indeed, we have shown that attempts to re-cluster allophones in a bottom-up fashion based on complementary distributions is inefficient in the face of massive allophony. However, we also showed that it is possible to replace the bottom-up scenario with one that is nearly the reverse, in which an approximate lexicon of word forms is used to acquire phonological regularities. In fact, an interactive scenario could be proposed, in which an approximate phonology is used to yield a better lexicon, which in turn is used to improve the phonology, and so on, until both phonology and the lexicon converge on the adult form (Swingley, 2009).

The present approach opens up novel empirical research questions in infant language acquisition. For instance, does a proto-lexicon exist in infants before 12 months of age? If so, what are its size and growth rate? Ngon et al. (in press) provide preliminary answers to these questions. They found that 11-month-old French-learning infants are able to discriminate highly frequent n-grams from low-frequency n-grams, even when neither set of stimuli contained any actual words. This suggests that at this age, infants have indeed constructed a proto-lexicon of high-frequency sequences which consists of a mixture of words and nonwords. The Ngon et al. (in press). study raises the question of how to estimate the size and composition of the proto-lexicon as a function of age. Such an estimation should be linked to modeling studies like the present one, in order to determine the extent to which the protolexicon can help the acquisition of phonological categories.4

There are other questions raised by our results that will be more difficult to answer. In particular, does the growth of the proto-lexicon predict the acquisition of phonological categories and of phonological variation? And how does the acquisition of phonology help the acquisition of a proper lexicon of word forms? These questions have been neglected in the past, perhaps because of the belief that a proper lexicon cannot be learned before phonemic categories are acquired. The present results, however, suggest that an understanding of lexical acquisition will be a fundamental component of a complete theory of phonological acquisition. Clearly, more research is needed to understand the mechanisms that could make it possible to simultaneously learn lexical and phonological regularities, and whether infants can use these mechanisms during the first year of life.

Appendix: Kullback-Leibler measure of dissimilarity between two probability distributions

Let s be a segment, c a context, and PðcjsÞ the probability of observing c given s. Then the Kullback-Leibler measure of dissimilarity between the distributions of two segments s1 and s2 is defined as:

数式

with 数式

with n(c,s) the number of occurrences of the segment s in the context c (i.e., the number of occurrences of the sequence sc),

  • n(s) the number of occurrences of the segment s,
  • and N the total number of contexts.
  • In order to smooth the probability estimates of the distributions in finite samples, 1 ⁄ N occurrence of each segment is added in each context, where N is the total number of contexts.
Notes
  1. Dale and Fenson (1996) found that English-learning 11-month-old infants comprehended an average of 113 words.
  2. Note that contexts for our rules are limited to underlying phonemes. In actual languages, the outputs of some rules can serve as the inputs to other rules, further complicating the learning process.
  3. A number of words present in the Dutch orthographic corpus (largely proper nouns) were not listed in the pronunciation lexicon. We eliminated any utterances containing these words from our corpora, resulting in a roughly 20% reduction in corpus size.
  4. For instance, the present algorithm uses the 10% most frequent n-grams as a protolexicon. Given the size of the corpus, this turns out to be a rather large set (over a million word forms). The use of a more realistic segmentation procedure would certainly cut down this number and bring it closer to the size of the protolexicon as it could be measured in infants.
  5. With bilateral contexts, implementing our algorithm becomes computationally prohibitive on the most complex corpora. The rightmost points in Fig. 4 represent the most complex corpora we were able to process given our available resources.
Acknowledgments

This research was made possible by support from the Centre National de la Recherche Scientifique and the RIKEN Brain Science Institute, in addition to grants ANR-2010BLAN-1901-1 from the Agence Nationale pour la Recherche and ERC-2011-AdG295810 from the European Research Council.

References
  • Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology, 12, 387–415.
  • Beddor, P. S., Harnsberger, J. D., & Lindemann, S. (2002). Language-specific patterns of vowel-to-vowel coarticulation: acoustic structures and their perceptual correlates. Journal of Phonetics, 30 (4), 591–627.
  • de Boer, B., & Kuhl, P. K. (2003). Investigating the role of infant-directed speech with a computer model. Acoustics Research Letters Online, 4(4), 129–134.
  • Boruta, L., Peperkamp, S., Crabbe, B., & Dupoux, E. (2011). Testing the robustness of online word segmentation: Effects of linguistic diversity and phonetic variation. Proceedings of CMCL, ACL., 2, 1–9.
  • Brent, M. R. (1999). An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34(1–3), 71–105.
  • Chambers, K. E., Onishi, K. H., & Fisher, C. (2003). Infants learn phonotactic regularities from brief auditory experience. Cognition, 87(2), B69–B77.
  • Cho, Y. Y. (1990). Syntax and phrasing in Korean. In S. Inkelas & D. Zec (Eds.), The phonology-syntax connection (pp. 47–62). Chicago: University of Chicago Press.
  • Choi, J. D., & Keating, P. (1991). Vowel-to-vowel coarticulation in three Slavic languages. UCLA Working Papers in Phonetics, 78, 78–86.
  • Clark, E. V. (1995). The lexicon in acquisition. Cambridge, England: Cambridge University Press.
  • Clements, G. N. (2009). The role of features in phonological inventories. In E. Raimy & C. E. Cairns (Eds.),
  • Contemporary views on architecture and representations in phonological theory (pp. 19–68). Cambridge, MA: MIT Press.
  • Cutler, A., Eisner, F., McQueen, J., & Norris, D. (2010). How abstract phonemic categories are necessary for coping with speaker-related variation. Papers in Laboratory Phonology, 10, 91–111.
  • Daland, R., & Pierrehumbert, J. (2010). Learning diphone-based segmentation. Cognitive Science, 35(1), 119–155.
  • Dale, P., & Fenson, L. (1996). Lexical development norms for young children. Behavior Research Methods, 28(1), 125–127.
  • Dehaene-Lambertz, G., & Baillet, S. (1998). A phonological representation in the infant brain. NeuroReport, 9(8), 1885.
  • Dresher, B. E., & Kaye, J. D. (1990). A computational learning model for metrical phonology. Cognition, 34(2), 137–195.
  • Feldman, N., Griffiths, T., & Morgan, J. (2009). Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference of the Cognitive Science Society, 2208–2213.
  • Fowler, C. A. (1981). Production and perception of coarticulation among stressed and unstressed vowels. Journal of Speech and Hearing Research, 24, 127–139.
  • Fowler, C. A., & Smith, M. (1986). Speech perception as ‘‘vector analysis’’: An approach to the problems of segmentation and invariance. In J. S. Perkell & D. Klatt (Eds.), Invariance and variability of speech processes (pp. 123–136). Hillsdale, NJ: Lawrence Erlbaum Associates.
  • Gauthier, B., Shi, R., & Xu, Y. (2007a). Learning phonetic categories by tracking movements. Cognition, 103 (1), 80–106.
  • Gauthier, B., Shi, R., & Xu, Y. (2007b). Simulating the acquisition of lexical tones from continuous dynamic input. Journal of the Acoustical Society of America, 121(5), EL190–EL195.
  • Goldsmith, J., & Xanthos, A. (2009). Learning phonological categories. Language, 85(1), 4–38.
  • Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112, 21–54.
  • Gow, D. W. Jr, & Gordon, P. C. (1995). Lexical and prelexical influences on word segmentation: Evidence from priming. Journal of Experimental Psychology: Human Perception and Performance, 21(2), 344–359.
  • Greenberg, S. (1998). Speaking in shorthand—A syllable-centric perspective for understanding pronunciation variation. Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, 47–56.
  • Guenther, F. H., & Gjaja, M. N. (1996). The perceptual magnet effect as an emergent property of neural map formation. Journal of the Acoustical Society of America, 100(2), 1111–1121.
  • Jelinek, F. (1998). Statistical Methods of Speech Recognition. Cambridge, MA: MIT Press.
  • Jusczyk, P. W. (1993). From general to language-specific capacities: The WRAPSA model of how speech perception develops. Journal of Phonetics, 21, 3–28.
  • Jusczyk, P. (2000). The discovery of spoken language. Cambridge, MA: MIT Press.
  • Jusczyk, P. W., & Aslin, R. N. (1995). Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychology, 29(1), 1–23.
  • Jusczyk, P. W., Friederici, A., Wessels, J., Svenkerud, V., & Jusczyk, A. (1993). Infants’ sensitivity to the sound patterns of native language words. Journal of Memory and Language, 32 (3), 402–420.
  • Jusczyk, P. W., Hohne, E. A., & Bauman, A. (1999). Infants’ sensitivity to allophonic cues for word segmentation. Perception & Psychophysics, 61(8), 1465.
  • Kempton, T., & Moore, R. K. (2009). Finding allophones: An evaluation on consonants in the TIMIT corpus. Interspeech, 2009, 1651–1654.
  • Kraljik, T., & Samuel, A. (2005). Perceptual learning in speech: Is there a return to normal? Cognitive Psychology, 51, 141–178.
  • Kuhl, P. K. (2004). Early language acquisition: cracking the speech code. Nature Reviews Neuroscience, 5(11), 831–843.
  • Kuhl, P. K., Conboy, B. T., Coffey-Corina, S., Padden, D., Rivera-Gaxiola, M., & Nelson, T. (2008). Phonetic learning as a pathway to language: New data and native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B, 363, 979–1000.
  • Le Calvez, R., Peperkamp, S., & Dupoux, E. (2007). Bottom-up learning of phonemes: a computational study. Proceedings of the Second European Cognitive Science Conference, 2, 167–172.
  • Lee, K.-F. (1988). On the use of triphone models for continuous speech recognition. JASA, 84(S1), S216–S216.
  • Lehiste, I., & Shockey, L. (1972). On the perception of coarticulation effects in English VCV syllables. Journal of Speech, Language, and Hearing Research, 15 (3), 500–506.
  • Maekawa, K., Koiso, H., Furui, S., & Isahara, H. (2000). Spontaneous speech corpus of japanese. Proceedings of LREC, 2, 947–952.
  • Makhoul, J., & Schwartz, R. (1995). State of the art in continuous speech recognition. Proceedings of the National Academy of Sciences, 92, 9956–9963.
  • Manuel, S. (1999). Cross-language studies: relating language-particular coarticulation patterns to other language-particular facts. In W. J. Hardcastle & N. Hewlett (Eds.), Coarticulation: Theory, data and techniques (pp. 179–198). Cambridge, UK: Cambridge University Press.
  • Maye, J., Weiss, D. J., & Aslin, R. N. (2008). Statistical phonetic learning in infants: Facilitation and feature generalization. Developmental Science, 11 (1), 122–134.
  • Maye, J., Werker, J. F., & Gerken, L. A. (2002). Phonetic details in perception and production allow various patterns in phonological change. Cognition, 82(3), B101–B111.
  • McMurray, B., & Aslin, R. (2005). Infants are sensitive to within-category variation in speech perception. Cognition, 95(2), B15–B26.
  • McMurray, B., Aslin, R. N., Tanenhaus, M. K., Spivey, M. J., & Subik, D. (2008). Gradient sensitivity to within-category variation in words and syllables. Journal of Experimental Psychology: Human Perception and Performance, 34(6), 1609–1631.
  • Mcmurray, B., Aslin, R., & Toscano, J. (2009). Statistical learning of phonetic categories: insights from a computational approach. Developmental Science, 12(3), 369–378.
  • Monaghan, P., & Christiansen, M. H. (2010). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. J. Child Lang., 37(03), 545.
  • Nakatani, L. H., & Dukes, K. D. (1977). Locus of segmental cues for word juncture. Journal of the Acoustical Society of America, 62(3), 714–719.
  • Ngon, C., Martin, A. T., Dupoux, E., Cabrol, D., Dutat, M., & Peperkamp, S. (In press). Nonwords, nonwords, nonwords: Evidence for a proto-lexicon during the first year of life. Developmental Science.
  • Norris, D., McQueen, J., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology, 47(2), 204–238.
  • Ohman, S. E. G. (1966). Coarticulation in VCV utterances: Spectrographic measurements. The Journal of the Acoustical Society of America, 39, 151–168.
  • Oostdijk, N. (2000). The Spoken Dutch Corpus. Overview and first evaluation. Proceedings of LREC-2000, Athens, 2, 887–894.
  • Pegg, J. E., & Werker, J. F. (1997). Adult and infant perception of two English phones. Journal of the Acoustical Society of America, 102(6), 3742–3753.
  • Peperkamp, S., & Dupoux, E. (2002). Coping with phonological variation in early lexical acquisition. In I. Lasser (Ed.), The Process of Language Acquisition: Proceedings of the 1999 GALA Conference (pp. 359–385). Frankfurt: Peter Lang.
  • Peperkamp, S., Le Calvez, R., Nadal, J., & Dupoux, E. (2006). The acquisition of allophonic rules: Statistical learning with linguistic constraints. Cognition, 101(3), B31–B41.
  • Pierrehumbert, J. B. (2003). Phonetic diversity, statistical learning, and acquisition of phonology. Language and Speech, 46(2-3), 115–154.
  • Ramus, F., Peperkamp, S., Christophe, A., Jacquemot, C., Kouider, S., & Dupoux, E. (2010). A psycholinguistic perspective on the acquisition of phonology. Papers in Laboratory Phonology, 10, 311–340.
  • Regier, T. (2003). Emergent constraints on word-learning: A computational perspective. Trends in Cognitive Sciences, 7(6), 263–268.
  • Rytting, C., Brew, C., & Fosler-Lussier, E. (2010). Segmenting words from natural speech: subsegmental variation in segmental cues. Journal of Child Language, 37(3), 513.
  • Saffran, J. R., & Thiessen, E. D. (2003). Pattern induction by infant language learners. Developmental Psychology, 39(3), 484–494.
  • Seidl, A., Cristia, A., Bernard, A., & Onishi, K. H. (2009). Allophonic and phonemic constrasts in infants’ learning of sound patterns. Language Learning and Development, 5(3), 191–202.
  • Sundara, M., Polka, L., & Genesee, F. (2006). Language-experience facilitates discrimination of ⁄ d- ⁄ in monolingual and bilingual acquisition of English. Cognition, 100(2), 369–388.
  • Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132.
  • Swingley, D. (2009). Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1536), 3617–3632.
  • Tesar, B., & Smolensky, P. (1998). Learnability in optimality theory. Linguistic Inquiry, 29 (2), 229–268.
  • Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., & Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech. PNAS, 104(33), 13273–13278.
  • Varadarajan, B., Khudanpur, S., & Dupoux, E. (2008). Unsupervised learning of acoustic sub-word units. Proceedings of ACL-08: HLT, Short Papers (Companion Volume), 46, 165–168.
  • Venkataraman, A. (2001). A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3), 351–372.
  • Werker, J. F., & Tees, R. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7, 49–63.
  • Werker, J. F., & Tees, R. C. (1999). Influences on infant speech processing: Toward a new synthesis. Annual Review of Psychology, 50, 509–535.
  • White, K. S., Peperkamp, S., Kirk, C., & Morgan, J. L. (2008). Rapid acquisition of phonological alternations by infants. Cognition, 107(1), 238–265.

Learning Phonetic Categories by Learning a Lexicon

語彙の学習による音声学的カテゴリの学習

  • Naomi H. Feldman (naomi feldman@brown.edu)
    • Department of Cognitive and Linguistic Sciences, Brown University, Providence, RI 02912 USA
  • Thomas L. Griffiths (tom griffiths@berkeley.edu)
    • Department of Psychology, University of California at Berkeley, Berkeley, CA 94720 USA
  • James L. Morgan (james morgan@brown.edu)
    • Department of Cognitive and Linguistic Sciences, Brown University, Providence, RI 02912 USA

注釈

この文章は菊池ゼミ院生ミーティング用に表題の論文を翻訳したものです.

この研究は私の研究と同じような参考文献を使用しており,かつ,方法論が似ていて, 音素の獲得に,ある程度,語彙的なものを導入するとhappyであるということを主張しています.

この論文の具体的なテーマと私のテーマの相違点は, 私の研究テーマが持続時間的な音素対立を扱っている点(ただし,この点に関しては表題の論文でもVOIを使用しているので,一部かぶります)と, 出現頻度の差による学習の困難さを問題意識にしている点です. この2つに関しては表題の論文では直接問題にはしていません.

音素の獲得という大枠で見た場合の,この論文の私の研究に対する優位点は,学習モデルとしてノンパラメトリックなモデルを使用している点です.

一方,この論文では,上位知識として単語(より精確に言えば,ある単語がどの音素を含んているのかの情報)を使用しており, この実装上,ある言語において音素がいくつであるのかを限定的にしている点があります. 私のモデルの場合,クラスタ数の推定を目的としており,具体的にどの音素を発話したか(弁別の問題ですね)より,基礎的な学習を行えているはずです.

そのため.基本的にはこの論文の優位点を私の研究にも応用できればよい(新規性が生まれる)わけで,私としての問題意識は,具体的にどのようなアルゴリズムのモデルであるのかを 理解すること,その際のモデルの解釈方法を参考に(できれば)することの二点です.

それとメタ的な話として,同じ畑の参考文献を引いているので,私の研究の背景を英語で書く場合,どのように書けばよいのかの参考にしたいと思います.

Abstract

乳児は母国語の音声学的なカテゴリを学習するのと同時期に,流暢な発話から単語のセグメントを学習している. しかし,音声学的なカテゴリの獲得の説明は典型的に音声の中に現れる単語についての情報を無視してきた. 我々は,ベイジアンモデルを使用して,単語のセグメントから,どの程度のフィードバックが音声学的なカテゴリ学習に制約を加え, 学習者が音声学的カテゴリのオーバーラップを明確にするを手助けするのかを例に示す. シミュレーションは人工的なレキシコン由来の情報は英語の母音カテゴリを上手に明確にすることが可能であることを示し, 分布の情報のみの場合と比べ,より頑健なカテゴリ学習を行った.

Infants learn to segment words from fluent speech during the same period as they learn native language phonetic categories, yet accounts of phonetic category acquisition typically ignore information about the words in which speech sounds appear. We use a Bayesian model to illustrate how feedback from segmented words might constrain phonetic category learning, helping a learner disambiguate overlapping phonetic categories. Simulations show that information from an artificial lexicon can successfully disambiguate English vowel categories, leading to more robust category learning than distributional information alone.

注釈

Keywords: language acquisition; Bayesian inference phonetic categories;

Introduction

彼らの母国語を学んでいる乳児は,様々なレベルで,知覚空間における音声学的カテゴリの位置や,流暢な発話から彼らは小分けにする単語の同定を含む構造体を抽出する必要がある. 乳児は最初に彼らの言語の音声学的なカテゴリを学習し,ついで,これらのカテゴリを単語のトークンを語彙的なアイテムにマッピングをするヒントに使うという,これらのステップが連続的に生じることは,しばしば暗黙的に当然のこととされている. しかし,乳児は流暢な発話から単語の区分けを,6ヶ月程度から始める(Jusczyk & Aslin, 1995; Jusczyk, Houston, & Newsome, 1999). 非母語発話の弁別は対照的に同じくらいの時期,6-12ヶ月にかけて減衰する(Werker & Tees, 1984). このことは,乳児が言語音と単語の両方をカテゴライズすることを同時に学習している場合における,従来とは異なり,潜在的に2つの学習のプロセスが相互作用する可能性のあるような学習の道筋を示唆している.

Infants learning their native language need to extract several levels of structure, including the locations of phonetic categories in perceptual space and the identities of words they segment from fluent speech. It is often implicitly assumed that these steps occur sequentially, with infants first learning about the phonetic categories in their language and subsequently using those categories to help them map word tokens onto lexical items. However, infants begin to segment words from fluent speech as early as 6 months (Bortfeld, Morgan, Golinkoff, & Rathbun, 2005) and this skill continues to develop over the next several months (Jusczyk & Aslin, 1995; Jusczyk, Houston, & Newsome, 1999). Discrimination of non-native speech sound contrasts declines during the same time period, between 6 and 12 months (Werker & Tees, 1984). This suggests an alternative learning trajectory in which infants simultaneously learn to categorize both speech sounds and words, potentially allowing the two learning processes to interact.

本稿では,我々は乳児が流暢な発話から切り分けるを単語は音声的なカテゴリの獲得のための役立つ情報源を提供することができるという仮説を検討した. 我々は,区切られた単語からの情報をフィードバックしたり音声学的的なカテゴリ学習を抑制したりできる相互作用的なシステムにおける音声学的カテゴリの学習の問題の本質を調査するためにベイジアンアプローチを使用した. 我々の相互作用的モデルは基礎的な語彙と音声学的なすべてのリストを同時に学習し,区切られたトークンの音響的な表現が同じか,違うのか(例えば, bed vs bad)を決定し,語彙的なアイテムは同じ母音を含んでいるのかいないのかを判断する(例えば,send vs act).

In this paper we explore the hypothesis that the words infants segment from fluent speech can provide a useful source of information for phonetic category acquisition. We use a Bayesian approach to explore the nature of the phonetic category learning problem in an interactive system, where information from segmented words can feed back and constrain phonetic category learning. Our interactive model learns a rudimentary lexicon and a phoneme inventory[1] simultaneously, deciding whether acoustic representations of segmented tokens correspond to the same or different lexical items (e.g. bed vs. bad) and whether lexical items contain the same or different vowels (e.g. send vs. act).

注釈

[1] We make the simplifying assumption that phonemes are equivalent to phonetic categories, and use the terms interchangeably.

[1] 我々は音素は音声的なカテゴリと同等であり,同じ意味の用語を使用するという単純化した仮説をおいています.

シミュレーションでは,セグメントされた単語からの情報を音声学的なカテゴリの獲得を制限するために使用することは,より眼瞼なカテゴリ学習を少ないデータ数からでも可能にし,特に,ある単語が特定の発話音声を含んでいるという情報を使う相互作用的な能力はオーバラップをしたカテゴリの曖昧性をなくすということを実証した.

Simulations demonstrate that using information from segmented words to constrain phonetic category acquisition allows more robust category learning from fewer data points, due to the interactive learner’s ability to use information about which words contain particular speech sounds to disambiguate overlapping categories.

本稿は次のように構成されている. 我々は,我々のモデルのための数学的なフレームワークに関する導入を行う.その後,その定性的な性質を示すためにおもちゃのシミュレーションを示す. 続いてシミュレーションは人工的な語彙からの情報は英語の母音カテゴリを関係する母音のフォルマントをはっきりさせることができることを示す. 最後に,言語獲得のための潜在的な影響について議論し,モデルの解釈を再検討し,後の研究のための方向性を示唆する.

The paper is organized as follows. We begin with an introduction to the mathematical framework for our model, then present toy simulations to demonstrate its qualitative properties. Next, simulations show that information from an artificial lexicon can disambiguate formant values associated with English vowel categories. The last section discusses potential implications for language acquisition, revisits the model’s assumptions, and suggests directions for future research.

Bayesian Model of Phonetic Category Learning

音声学的カテゴリの学習を扱っている最近の研究は,頻度学習の重要性に着目してきた. Maye, Werker, and Gerken (2002)は,発話に沿った音声の特別な頻度分布(バイモーダルかユニモーダルか)が, 乳児の連続体の終了点 [1] (多分,特徴量自体は連続して変化するわけだけど,そのカテゴリを知覚するための境界のこと) の弁別に影響することを発見した. つまり,乳児はバイモーダルの分布に親しんだ [2] とき, 境界の弁別がうまく行くことを示したのだ. この業績はガウス混合分布を使用した機械学習にインスパイアされたものであり,音声学的なカテゴリは音声のガウシアン分布,要は正規分布である,として,表現されていることを仮定しており, 学習者は彼らが聞いた音声の分布を最もよく再現できるガウス分布のカテゴリを発見すると仮定している. Boer and Kuhl (2003)はEMアルゴリズム(Dempster, Laird, & Rubin, 1977)をフォルマントデータから木構造の位置の学習を行うために使用している. McMurray, Aslin, and Toscano (2009) EMアルゴリズムに似た最急降下法 [3] を有声子音の閉鎖子音の学習に導入し,このアルゴリズムは母音と子音両方のデータ用に多次元に拡張された(Toscano & McMurray, 2008; Vallabha, McClelland, Pons, Werker, & Amano, 2007).

Recent research on phonetic category acquisition has focused on the importance of distributional learning. Maye, Werker, and Gerken (2002) found that the specific frequency distribution (bimodal or unimodal) of speech sounds along a continuum could affect infants’ discrimination of the continuum endpoints, with infants showing better discrimination of the endpoints when familiarized with the bimodal distribution. This work has inspired computational models that use a Mixture of Gaussians approach, assuming that phonetic categories are represented as Gaussian, or normal, distributions of speech sounds and that learners find the set of Gaussian categories that best represents the distribution of speech sounds they hear. Boer and Kuhl (2003) used the Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) to learn the locations of three such vowel categories from formant data. McMurray, Aslin, and Toscano (2009) introduced a gradient descent algorithm similar to EM to learn a stop consonant voicing contrast, and this algorithm has been extended to multiple dimensions for both consonant and vowel data (Toscano & McMurray, 2008; Vallabha, McClelland, Pons, Werker, & Amano, 2007).

我々のモデルはガウス混合アプローチを上記の先行モデルを採用したが,ノンパラメトリックなベイジアンのフレームワークを使用した. このフレームワークはモデルを単語レベルまで拡張し,構造体の複数のレベルを操作する際に,学習結果を調査することを可能にする. 先行モデルのように,我々のモデルにおける発話音声は安定状態のフォルマントやVOTなどの音声学的次元を使って再現された. 単語は音声学的な値 [4] の連続体で,そこは,それぞれの音素の音声学的な値の一つの非連続体のセット(例えば第一,第二フォルマント)と一致する場所である. Toy corpusの一つのフラグメント [5] を図1に示す. 発話音声を使用した音声学的な一覧 [6] は4つのカテゴリを持っており,A,B,C,Dというラベルが振られている. つまり,5つの単語が示されており,それぞれ,ADA,AB,D,AB,DCという語彙的な要素を表現している. 学習は,発話音声を使用したものと,相互作用的学習者の場合は,単語において,他の音のどれが現れるのかについての情報を含んでおり,コーパスを発生させた音声学的カテゴリを復元することが目的である.

Our model adopts the Mixture of Gaussians approach from these previous models but uses a non-parametric Bayesian framework that allows extension of the model to the word level, making it possible to investigate the learning outcome when multiple levels of structure interact. As in previous models, speech sounds in our model are represented using phonetic dimensions such as steady-state formant values or voice onset time. Words are sequences of these phonetic values, where each phoneme corresponds to a single discrete set (e.g. first and second formant) of phonetic values. A sample fragment of a toy corpus is shown in Figure 1. The phoneme inventory has four categories, labeled A, B, C, and D; five words are shown, representing lexical items ADA, AB, D, AB, and DC, respectively. Learning involves using the speech sounds and, in the case of an interactive learner, information about which other sounds appear with them in words, to recover the phonetic categories that generated the corpus.

図1: モデルに提示したコーパスの一つのフラグメント

注釈

図1の説明

アスタリスクは発話音声を表しており,ラインは単語の境界を示している. モデルはどのカテゴリが発話音声を発生させたのかは知らなく,データからA,B,C,Dのカテゴリを復元する必要がある.

Asterisks represent speech sounds, and lines represent word boundaries. The model does not know which categories generated the speech sounds, and needs to recover categories A, B, C, and D from the data.

シミュレーションでは2つのモデルを比較した. これらは,学習者に割り当てる仮説空間が異なる. 分布モデルでは,学習者の仮説空間は音素一覧に含まれており,そこは,音声学的空間における発話音声のガウス分布が音素と一致する. 語彙-分布モデルでは,学習者は上記と同じ音素一覧を考えるが,それらの音素一覧は音素の連続体からなる語彙的な要素を含む目録とのみ結合すると考える. そのため,語彙-分布モデル学習者は,音声学的カテゴリのセットの復元に音声学的情報のみでなく,それらの音を含む単語についての情報も使用できる.

Simulations compare two models that differ in the hypothesis space they assign to the learner. In the distributional model, the learner’s hypothesis space contains phoneme inventories, where phonemes correspond to Gaussian distributions of speech sounds in phonetic space. In the lexical-distributional model, the learner considers these same phoneme inventories, but considers them only in conjunction with lexicons that contain lexical items composed of sequences of phonemes. This allows the lexical-distributional learner to use not only phonetic information, but also information about the words that contain those sounds, in recovering a set of phonetic categories.

訳者注

[1]多分,特徴量自体は連続して変化するわけだけど,そのカテゴリを知覚するための境界のこと
[2]訓練されたくらいかな
[3]でいいと思う.wikipediaには”For the analytical method called “steepest descent”, see Method of steepest descent”って書いてあるし
[4]多分,特徴量自身のことじゃないかな.
[5]一つの発話区間のことかと.
[6]獲得したい音素の目録のことだと思う.
Distributional Model

分布モデルにおいては,学習者は我々が,音素イベントCと呼ぶ音韻のセットをコーパスの音声から復元する必要がある [7] . このモデルは単語,単語境界に関するすべての情報を無視し,音韻空間の発話音声の分布からのみ学習を行う. 発話音声は音韻イベントリから音素カテゴリ \(C\) を選択することによって生成されると仮定し, そのカテゴリに関連するガウシアン分布から,音韻の値 [#f42]_ を抽出している. カテゴリは,それらのデータの平均値 \(\mu_c\) ではなく, 共分散行列 \(\Sigma_c\) であり,発生頻度である. 以下の先行研究,形態論の先行研究(Goldwater, Griffiths, & Johnson, 2006),単語のセグメント(Goldwater, Griffiths, & Johnson, in press),そして文法の学習(Johnson, Griffiths, & Goldwater, 2007)では, 学習者の音素のイベントに関する事前知識はDirichlet process(Ferguson, 1973)と呼ばれるノンパラメトリックなベイズモデルを使用して実装されている. この分布は音素一覧のカテゴリ数に対するバイアス [#f43]_ と,これらのカテゴリの音響的パラメータに対するバイアスをエンコードしたものである. 音韻的カテゴリの数についての事前知識は, 学習者が潜在的に莫大な数のカテゴリ数を考慮することを可能にするが, 少ない数のカテゴリへと向かわせるバイアスを提供する.バイアスの強さについては パラメータ \(\alpha\) により制御されている. これは先行モデル(McMurray et al., 2009; Vallabha et al., 2007)で使用された, カテゴリの割り当てにおける”勝者総取りバイアス”に置き換えることができ, データを表現するのに必要なカテゴリの数の明示的な推定を可能にする.

In the distributional model, a learner is responsible for recovering a set of phonetic categories, which we refer to as a phoneme inventory \(C\), from a corpus of speech sounds. The model ignores all information about words and word boundaries, and learns only from the distribution of speech sounds in phonetic space. Speech sounds are assumed to be produced by selecting a phonetic category \(c\) from the phoneme inventory and then sampling a phonetic value from the Gaussian associated with that category. Categories differ in their means \(\mu_c\) , covariance matrices \(\Sigma_c\) , and frequencies of occurrence. Following previous work in morphology (Goldwater, Griffiths, & Johnson, 2006), word segmentation (Goldwater, Griffiths, & Johnson, in press), and grammar learning (Johnson, Griffiths, & Goldwater, 2007), learners’ prior beliefs about the phoneme inventory are encoded using a nonparametric Bayesian model called the Dirichlet process (Ferguson, 1973), \(C∼DP(\alpha, G_C)\). This distribution encodes biases over the number of categories in the phoneme inventory, as well as over phonetic parameters for those categories. Prior beliefs about the number of phonetic categories allow the learner to consider a potentially infinite number of categories, but produce a bias toward fewer categories, with the strength of the bias controlled by the parameter \(\alpha\).[2] This replaces the winner-take-all bias in category assignments that has been used in previous models (McMurray et al., 2009; Vallabha et al., 2007) and allows explicit inference of the number of categories needed to represent the data.

訳者注

[7]音素イベントC : 要はある言語のある音素のこと
[8]Phonetic value : 調べると音価(音楽においての音の長さのこと)って出てくんのよね
[9]biases over A : A にかかってるバイアス

注釈

筆者注2

[2] This bias is needed to induce any grouping at all; the maximum likelihood solution assigns each speech sound to its own category.

このバイアスはすべてからいくつかのグループに減らすために必要なものである. 最大尤度法はそれぞれの発話音声を自身のカテゴリに割り当てる

音韻パラメータの事前分布は \(G_C\) によって定義され,このモデルにおいては,カテゴリの分散 \(\Sigma_c∼IW(\nu_0,\Sigma_0 )\) に対する逆ウィシャート事前分布とカテゴリの平均値 \(\mu_c \mid \Sigma_c ∼ N(\mu_0 , {\Sigma_c \over \nu_0} )\) に対するガウシアン事前分布を含んでいる,ガウシアンな音韻カテゴリに対するものである. これらの分布のパラメータは擬似データの \(\mu_0\) , \(\Sigma_0\) , \(\nu_0\) が平均,共分散及び学習者がすでに新しいカテゴリに割り当てられたと想像する音声の数をどこでエンコードするのかを考えることが可能です. この音韻パラメータに対する事前分布は論理モデルの中心ではなく,計算を簡単にするための処理である.擬似データにおける発話音声の数は可能な限り少なくしている[3]ため,事前バイアスはリアルデータによって書き換えられることになる. 音響的な値の連続値を提示することで,学習者はこれらの音響的な値から発生するガウシアンのカテゴリセットを復元することが必要である. マルコフ連鎖モンテカルロ法の形で,ギブズサンプリング (Geman & Geman, 1984) は理想的な学習者がコーパスを生成した可能性が高いと考えている音素の目録の例を復元するために使用された. 発話音声ははじめはランダムな割り当てで与えら得ており,各sweepの間で,コーパスを通じて,順番にすべての現在の割り当てに基づいた新しいカテゴリの割り当てが与えられる.

The prior distribution over phonetic parameters is defined by \(G_C\) , which in this model is a distribution over Gaussian phonetic categories that includes an Inverse-Wishart prior over category variances, \(\Sigma_c∼IW(\nu_0,\Sigma_0 )\), and a Gaussian prior over category means, \(\mu_c \mid \Sigma_c ∼ N(\mu_0 , {\Sigma_c \over \nu_0} )\). The parameters of these distributions can be thought of as pseudo data, where \(\mu_0\) , \(\Sigma_0\) , and \(\nu_0\) encode the mean, covariance, and number of speech sounds that the learner imagines having already assigned to any new category. This prior distribution over phonetic parameters is not central to the theoretical model, but rather is included for ease of computation; the number of speech sounds in the pseudodata is made as small as possible[3] so that the prior biases are overshadowed by real data. Presented with a sequence of acoustic values, the learner needs to recover the set of Gaussian categories that generated those acoustic values. Gibbs sampling (Geman & Geman, 1984), a form of Markov chain Monte Carlo, is used to recover examples of phoneme inventories that an ideal learner believes are likely to have generated the corpus. Speech sounds are initially given random category assignments, and in each sweep through the corpus, each speech sound in turn is given a new category assignment based on all the other current assignments. The probability of assignment to category \(c\) is given by Bayes’ rule,

\[p(c \mid w_{ij} ) \propto p(w_{ij} \mid c) p(c)\]

where wi j denotes the phonetic parameters of the speech sound in position j of word i. The prior p(c) is given by the Dirichlet process and is

\[\begin{split}\cases{\frac{n_c}{\sum_c N_c + \alpha}&$for existing categories$\cr \frac{\alpha}{\sum_c N_c + \alpha}&for a new category\cr}\end{split}\]

making it proportional to the number of speech sounds \(n_c\) already assigned to that category, with some probability \(\alpha\) of assignment to a new category. The likelihood \(p(w_{ij} \mid c)\) is obtained by integrating over all possible means and covariance matrices for category \(c\) , \(\int\int p(w_{ij} \mid \mu_c , \sum_c)p(\mu_c \mid\sum_c )p(\sum_c )d\mu_c d\sum_c\) , where the probability distributions \(p(\mu_c \mid\sum_c)\) and \(p(\sum_c)\) are modified to take into account the speech sounds already assigned to that category.

注釈

[3] To form a proper distribution, nu 0 needs to be greater than d − 1, where d is the number of phonetic dimensions.

This likelihood function has the form of a multivariate tdistribution and is discussed in more detail in Gelman, Carlin, Stern, and Rubin (1995). Using this procedure, category assignments converge to the posterior distribution on phoneme inventories, revealing an ideal learner’s beliefs about which categories generated the corpus.

Lexical-Distributional Model

This non-parametric Bayesian framework has the advantage that it is straightforward to extend to hierarchical structures (Teh, Jordan, Beal, & Blei, 2006), allowing us to explore the influence of words on phonetic category acquisition. In the lexical-distributional model, the learner recovers not only the same phoneme inventory C as in the distributional model, but also a lexicon L with lexical items composed of sequences of phonemes. This creates an extra step in the generative process: instead of assuming that the phoneme inventory generates a corpus directly, as in the distributional model, this model assumes that the phoneme inventory generates the lexicon and that the lexicon generates the corpus. The corpus is generated by selecting a lexical item to produce and then sampling an acoustic value from each of the phonetic categories contained in that lexical item.

The prior probability distribution over possible lexicons is a second Dirichlet process, L ∼ DP(β, GL ) where GL defines a prior distribution over lexical items. This prior favors shorter lexical items, assuming word lengths to be generated from a geometric distribution, and assumes that a category for each phoneme slot has been sampled from the phoneme inventory C. Thus, the prior probability distribution over words is defined according to the phoneme inventory, and the learner needs to optimize the phoneme inventory so that it generates the lexicon. Parallel to the bias toward fewer phonetic categories, the model encodes a bias toward fewer lexical items but allows a potentially infinite number of lexical items.

Presented with a corpus consisting of isolated word tokens, each of which consists of a sequence of acoustic values, the language learner needs to recover the lexicon and phoneme inventory of the language that generated the corpus. Learning is again performed through Gibbs sampling. Each iteration now includes two sweeps: one through the corpus, assigning each word to the lexical item that generated it, and one through the lexicon, assigning each position of each lexical item to its corresponding phoneme from the phoneme inventory. In the first sweep we use Bayes’ rule to calculate the probability that word wi corresponds to lexical item k,

p(k|wi ) proptop(wi midk)p(k) (3)

Parallel to Equation 2, the prior is

nk / ∑ k_n_k +β for existing categories
p(k) = (4)
β / ∑ k_n_k +β for a new category

where nk is the number of word tokens already assigned to lexical item k. A word is therefore assigned to a lexical item with a probability proportional to the number of times that lexical item has already been seen, with some probability β reserved for the possibility of seeing a new lexical item. The likelihood is a product of the likelihoods of each speech sound having been generated from its respective category,

p(wi midk) = ∏ p(wi j midck j ) (5)
j

where j indexes a particular position in the word and ck j is the phonetic category that corresponds to position j of lexical item k. Any lexical item with a different length from the word wi is given a likelihood of zero, and samples from the prior distribution on lexical items are used to estimate the likelihood of a new lexical item (Neal, 1998). The second sweep uses Bayes’ rule

p(cmidw{k} j ) proptop(w{k} j midc)p(c) (6)

to assign a phonetic category to position j of lexical item k, where w{k} j is the set of phonetic values at position j in all of the words in the corpus that have been assigned to lexical item k. The prior p(c) is the same prior over category assignments as was used in the distributional model, and is given by Equation 2. The likelihood p(w{k} j midc) is again computed by integrating over all possible means and covariance maRR trices, FF ∏ wi ∈ k p(wi j midmuc , Sigma c )p(muc midSigma c )p(Sigma c )dmuc dSigma c , this time taking into account phonetic values from all the words assigned to lexical item k. The sampling procedure converges on samples from the joint posterior distribution on lexicons and phoneme inventories, allowing learners to recover both levels of structure simultaneously.

Qualitative Behavior of an Interactive Learner

このセクションでは,Toy simulation [10] がどの語彙が提供するのか

In this section, toy simulations demonstrate how a lexicon can provide disambiguating information about overlapping categories that would be interpreted as a single category by a purely distributional learner. We show that it is not the simple presence of a lexicon, but rather specific disambiguating information within the lexicon, that increases the robustness of category learning in the lexical-distributional learner. Corpora were constructed for these simulations using four categories labeled A, B, C, and D, whose means are located at -5, -1, 1, and 5 along an arbitrary phonetic dimension (Figure 2 (a)). All four categories have a variance of 1. Because the means of categories B and C are so close together, being separated by only two standard deviations, the overall distribution of tokens in these two categories is unimodal. To test the distributional learner, 1200 acoustic values were sampled from these categories, with 400 acoustic values sampled from each of Categories A and D and 200 acoustic values sampled from each of Categories B and C. Results indicate that these distributional data are not strong enough to disambiguate categories B and C, leading the learner to interpret them as a single category (Figure 2 (b)).[4] While this may be due in part to the distributional learner’s prior bias toward fewer categories, simulations in the next section will show that the gradient descent learner from Vallabha et al. (2007), which has no such explicit bias, shows similar behavior.

訳者注

[10]Toy simulation : 多分,この文章でいうtoyはダミーということじゃないかな.
[11]Phonetic value : 調べると音価(音楽においての音の長さのこと)って出てくんのよね
[12]biases over A : A にかかってるバイアス

注釈

[4] Simulations in this section used parameters alpha = β = 1, mu = 0, 0 Sigma 0 = 1, and nu 0 = 0.001; each simulation was run for 500 iterations.

_images/2.png

注釈

Figure 2: Toy data with two overlapping categories as (a) generated, (b) learned by the distributional model, (c) learned by the lexical-distributional model from a minimal pair corpus, and (d) learned by the lexical-distributional model from a corpus without minimal pairs.

Two toy corpora were constructed for the lexicaldistributional model from the 1200 phonetic values sampled above. The corpora differed from each other only in the distribution of these values across lexical items. The lexicon of the first corpus contained no disambiguating information about speech sounds B and C. It was generated from six lexical items, with identities AB, AC, DB, DC, ADA, and D. Each lexical item was repeated 100 times in the corpus for a total of 600 word tokens. In this corpus, Categories B and C appeared only in minimal pair contexts, since both AB and AC, as well as both DB and DC, were words. As shown in Figure 2 (c), the lexical-distributional learner merged categories B and C when trained on this corpus. Merging the two categories allowed the learner to condense AB and AC into a single lexical item, and the same happened for DB and DC. Because the distribution of these speech sounds in lexical items was identical, lexical information could not help disambiguate the categories.

The second corpus contained disambiguating information about categories B and C. This corpus was identical to the first except that the acoustic values representing the phonemes B and C of words AC and DB were swapped, converting these words into AB and DC, respectively. Thus, the second corpus contained only four lexical items, AB, DC, ADA, and D, and there were now 200 tokens of words AB and DC. Categories B and C did not appear in minimal pair contexts, as there was a word AB but no word AC, and there was a word DC but no word DB. The lexical-distributional learner was able to use the information contained in the lexicon in the second corpus to successfully disambiguate categories B and C (Figure 2 (d)). This occurred because the learner could categorize words AB and DC as two different lexical items simply by recognizing the difference between categories A and D, and could use those lexical classifications to notice small phonetic differences between the second phonemes in these lexical items.

In this model it is non-minimal pairs, rather than minimal pairs, that help the lexical-distributional learner disambiguate phonetic categories. While minimal pairs may be useful when a learner knows that two similar sounding tokens have different referents, they pose a problem in this model because the learner hypothesizes that similar sounding tokens represent the same word. Thiessen (2007) has made a similar observation with 15-month-olds in a word learning task, showing that infants may fail to notice a difference between similarsounding object labels, but are better at discriminating these words when familiarized with non-minimal pairs that contain the same sounds.

Learning English Vowels

自然言語におけるカテゴリのオーバーラップの典型例は母音のカテゴリである. 例えば,Hillenbrand, Getty, Clark, and Wheeler (1995)らは英語の母音に関して図4(a)を示している. したがって我々は英語の母音カテゴリを語彙的な分布学習者の実際の音韻カテゴリパラメータを基本にしたオーヴァラップをカテゴリの曖昧さをなくす能力をテストするために英語の母音を使用した.

The prototypical examples of overlapping categories in natural language are vowel categories, such as the English vowel categories from Hillenbrand, Getty, Clark, and Wheeler (1995) shown in Figure 4 (a).[5] We therefore use English vowel categories to test the lexical-distributional learner’s ability to disambiguate overlapping categories that are based on actual phonetic category parameters.

Hillenbrand et al. (1995)の母音フォルマントデータを基本にした音韻カテゴリを使用して2つのコーパスを作成した. 最初のコーパスのカテゴリは男性によって発話された母音をベースにしており,適度なオーバーラップしかない(図3 (a)). 2つ目のコーパスのカテゴリは男性,女性,子供によって発話された母音をベースにしており,オーバーラップの程度が大きい(図4 (a)). どちらのケースでも,12音素のカテゴリの平均と共分散行列は対応する母音トークンから算出した. それぞれのコーパスのために発生モデルを使用して,母音のみからなる語彙項目の仮想的な集合を作成し,トークンをガウスカテゴリパラメータの適当なセットから,この語彙をベースに5000語のトークンを作成した.

Two corpora were constructed using phonetic categories based on the Hillenbrand et al. (1995) vowel formant data. Categories in the first corpus were based on vowels spoken by men, and had only moderate overlap (Figure 3 (a)); categories in the second corpus were based on vowels spoken by men, women, and children, and had a much higher degree of overlap (Figure 4 (a)). In each case, means and covariance matrices for the twelve phonetic categories were computed from corresponding vowel tokens. Using the generative model, a hypothetical set of lexical items consisting only of vowels was generated for each corpus, and 5,000 word tokens were generated based on this lexicon from the appropriate set of Gaussian category parameters.

これらのコーパスは以下の3つのモデルにテストデータとして与えられた.

  • 語彙-分布モデル
  • 分布モデル
  • Vallabha et al.(2007)で使用された多次元勾配降下アルゴリズム[6]

男性発話をベースにしたコーパスを使用した結果を図3に示す. また,すべての話者の発話をベースにしたコーパスへの結果は図4である. それぞれの場合で,語彙-分布モデルは母音カテゴリのセットの復元に成功し,曖昧な近隣との境界推定に成功した. 一方,語彙がかけたモデルでは近隣の母音カテゴリのペアをいくつか誤ってマージしてしまった. そのため,語彙を仮定することは,語彙に含まれる音韻フォームが明示的に学習者に与えなくとも,学習者の母音カテゴリの重複の曖昧さをなくすことを助けることを示す証拠になる.

These corpora were given as training data to three models: the lexical-distributional model, the distributional model, and the multidimensional gradient descent algorithm used by Vallabha et al. (2007).[6] Results for the corpus based on men’s productions are shown in Figure 3, and results from the corpus based on all speakers’ productions are shown in Figure 4. In each case, the lexical-distributional learner recovered the correct set of vowel categories and successfully disambiguated neighboring categories. In contrast, the models lacking a lexicon mistakenly merged several pairs of neigh boring vowel categories. Positing the presence of a lexicon therefore showed evidence of helping the ideal learner disambiguate overlapping vowel categories, even though the phonological forms contained in the lexicon were not given explicitly to the learner.

ペアごとの正確性と完全性の対策は、モデルの性能(表1)の定量的尺度として、各学習者に対して計算された. これらの尺度では,正しく同じカテゴリーに入れた母音のトークンのペアを、ヒットとしてカウントし,同じカテゴリにされているべきときに、誤って別のカテゴリに割り当てられたトークンのペアはミスとしてカウントした. また異なるカテゴリにされているべきときに、誤って同じカテゴリに割り当てられたトークンのペアは、誤警報としてカウントした.

Pairwise accuracy and completeness measures were computed for each learner as a quantitative measure of model performance (Table 1). For these measures, pairs of vowel tokens that were correctly placed into the same category were counted as a hit; pairs of tokens that were incorrectly assigned to different categories when they should have been in the same category were counted as a miss; and pairs of tokens that were incorrectly assigned to the same category when they should have been in different categories were counted as a false alarm.

注釈

[5] These vowel data were obtained through download from http://homepages.wmich.edu/˜hillenbr/.

注釈

[6] Parameters for the Bayesian models were alpha = β = 1,

500 1 0
mu =[ ], Sigma_0 =[ ], and nu 0 = 1.001,and each simulation was run for 600 iterations.
1500 0 1

No attempt was made to optimize these parameters, and they were actually different from the parameters used to generate the data, as alpha = β = 10 was used to help produce a corpus that contained all twelve vowel categories. Using the generating parameters during inference did not qualitatively affect the results. Parameters for the gradient descent algorithm were identical to those used by Vallabha et al. (2007); optimizing the learning rate parameter produced little qualitative change in the learning outcome.

_images/3.png

注釈

Figure 3: Ellipses delimit the area corresponding to 90% of vowel tokens for Gaussian categories (a) computed from men’s vowel productions from Hillenbrand et al. (1995) and learned by the (b) lexical-distributional model, (c) distributional model, and (d) gradient descent algorithm.

_images/4.png

注釈

Figure 4: Ellipses delimit the area corresponding to 90% of vowel tokens for Gaussian categories (a) computed from all speakers’ vowel productions from Hillenbrand et al. (1995) and learned by the (b) lexical-distributional model, (c) distributional model, and (d) gradient descent algorithm.

accuracyの得点は”hits/ hits + false alarms”で計算し,completenessは”hits/hits+misses”として計算した. 両方の基準で語彙-分布モデルは高い点数を出したが,誤って複数の重複するカテゴリを合併しているという事実を反映して、accuracyの得点は実質的に純粋に分布だけから学習した場合よりも低かった.

The accuracy score was computed a hits/ hits + false alarms and the completeness score as hits/hits+misses . Both measures were high for the lexical-distributional learner, but accuracy scores were substantially lower for the purely distributional learners, reflecting the fact that these models mistakenly merged several overlapping categories.

この結果は,予測されるように,音韻のカテゴリに加えて、単語のカテゴリを学習に使用するモデルは、音素のカテゴリを学習したモデルよりも優れた表音カテゴリの学習結果が得られることを示している. 分布のモデルの学習者は、ちょうど最初の2フォルマントを超え次元を与えたりしている場合(Vallabha et al., 2007)や、学習中に、より多くのデータ·ポイントを与えられている場合は,より良い性能を示す可能性があることに注目してほしい. これら2つのソリューションは確かに,お互いに対して機能する.つまり,デモンストレーションを加えるごとに,おなじ学習結果を維持するのに必要なデータ数が増えていく. しかし,我々は純粋な分布の学習モデルは音韻のカテゴリを取得できないことを示唆する気はない. ここで紹介したシミュレーションは音韻カテゴリに実質的にオーバーラップのある言語において,学習者が特定の言語音を含む単語情報を使用できるインタラクティブなシステムは,表音カテゴリ学習の頑健さを高めることができることを実証するものである.

Results suggest that as predicted, a model that uses the input to learn word categories in addition to phonetic categories produces better phonetic category learning results than a model that only learns phonetic categories. Note that the distributional learners are likely to show better performance if they are given dimensions beyond just the first two formants (Vallabha et al., 2007) or if they are given more data points during learning. These two solutions actually work against each other: as dimensions are added, more data are necessary to maintain the same learning outcome. Nevertheless, we do not wish to suggest that a purely distributional learner cannot acquire phonetic categories. The simulations presented here are instead meant to demonstrate that in a language where phonetic categories have substantial overlap, an interactive system, where learners can use information from words that contain particular speech sounds, can increase the robustness of phonetic category learning.

Discussion

This paper has presented a model of phonetic category acquisition that allows interaction between speech sound and word categorization. The model was not given a lexicon a priori, but was allowed to begin learning a lexicon from the data at the same time that it was learning to categorize individual speech sounds, allowing it to take into account the distribution of speech sounds in words. This lexical-distributional learner outperformed a purely distributional learner on a corpus whose categories were based on English vowel categories, showing better disambiguation of overlapping categories from the same number of data points.

Infants learn to segment words from fluent speech around the same time that they begin to show signs of acquiring native language phonetic categories, and they are able to map these segmented words onto tokens heard in isolation (Jusczyk & Aslin, 1995), suggesting that they are performing some sort of rudimentary categorization on the words they hear. Infants may therefore have access to information from words that can help them disambiguate overlapping categories. If information from words can feed back to constrain phonetic category learning, the large degree of overlap be tween phonetic categories may not be such a challenge as is often supposed.

Table1
LexicalDistrib. Distrib. Gradient Descent
Accuracy 0.97 0.63 0.56
Completeness 0.98 0.93 0.94
Accuracy 0.99 0.54 0.40
Completeness 0.99 0.85 0.95

注釈

Table 1: Accuracy and completeness scores for learning vowel categories based on productions by (a) men and (b) all speakers. For the Bayesian learners, these were computed at the annealed solutions; for the gradient descent learner, they were based on maximum likelihood category assignments.

In generalizing these results to more realistic learning situations, however, it is important to take note of two simplifying assumptions that were present in our model. The first key assumption is that speech sounds in phonetic categories follow the same Gaussian distribution regardless of phonetic or lexical context. In actual speech data, acoustic characteristics of sounds change in a context-dependent manner due to coarticulation with neighboring sounds (e.g. Hillenbrand, Clark, & Nearey, 2001). A lexical-distributional learner hearing reliable differences between sounds in different words might erroneously assign coarticulatory variants of the same phoneme to different categories, having no other mechanism to deal with context-dependent variability. Such variability may need to be represented explicitly if an interactive learner is to categorize coarticulatory variants together.

A second assumption concerns the lexicon used in the vowel simulations, which was generated from our model. Generating a lexicon from the model ensured that the learner’s expectations about the lexicon matched the structure of the lexicon being learned, and allowed us to examine the influence of lexical information in the best case scenario. However, several aspects of the lexicon, such as the assumption that phonemes in lexical items are selected independently of their neighbors, are unrealistic for natural language. In future work we hope to extend the present results using a lexicon based on child-directed speech.

Infants learn multiple levels of linguistic structure, and it is often implicitly assumed that these levels of structure are acquired sequentially. This paper has instead investigated the optimal learning outcome in an interactive system using a non-parametric Bayesian framework that permits simultaneous learning at multiple levels. Our results demonstrate that information from words can lead to more robust learning of phonetic categories, providing one example of how such interaction between domains might help make the learning problem more tractable.

Acknowledgments.

This research was supported by NSF grant BCS-0631518, AFOSR grant FA9550-07-1-0351, and NIH grant HD32005. We thank Joseph Williams for help in working out the model and Sheila Blumstein, Adam Darlow, Sharon Goldwater, Mark Johnson, and members of the computational modeling reading group for helpful comments and discussion.

References

Boer, B. de, & Kuhl, P. K. (2003). Investigating the role of infantdirected speech with a computer model. Acoustics Research Letters Online, 4(4), 129-134.

Bortfeld, H., Morgan, J. L., Golinkoff, R. M., & Rathbun, K. (2005).

Mommy and me: Familiar names help launch babies into speechstream segmentation. Psychological Science, 16(4), 298-304.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1-38.

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2), 209-230. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. New York: Chapman and Hall.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE-PAMI, 6, 721-741.

Goldwater, S., Griffiths, T. L., & Johnson, M. (2006). Interpolating between types and tokens by estimating power-law generators. Advances in Neural Information Processing Systems 18.

Goldwater, S., Griffiths, T. L., & Johnson, M. (in press). A Bayesian framework for word segmentation: Exploring the effects of context. Cognition.

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099-3111.

Hillenbrand, J. L., Clark, M. J., & Nearey, T. M. (2001). Effects of consonant environment on vowel formant patterns. Journal of the Acoustical Society of America, 109(2), 748-763.

Johnson, M., Griffiths, T. L., & Goldwater, S. (2007). Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models. Advances in Neural Information Processing Systems 19.

Jusczyk, P. W., & Aslin, R. N. (1995). Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychology, 29, 1-23.

Jusczyk, P. W., Houston, D. M., & Newsome, M. (1999). The beginnings of word segmentation in English-learning infants. Cognitive Psychology, 39, 159-207.

Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82, B101-B111.

McMurray, B., Aslin, R. N., & Toscano, J. C. (2009). Statistical learning of phonetic categories: Computational insights and limitations. Developmental Science, 12(3), 369-378.

Neal, R. M. (1998). Markov chain sampling methods for Dirichlet process mixture models. Technical Report No. 9815, Department of Statistics, University of Toronto.

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581.

Thiessen, E. D. (2007). The effect of distributional information on children’s use of phonemic contrasts. Journal of Memory and Language, 56(1), 16-34.

Toscano, J. C., & McMurray, B. (2008). Using the distributional statistics of speech sounds for weighting and integrating acoustic cues. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society (p. 433-438). Austin, TX: Cognitive Science Society.

Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., & Amano, S. (2007). Unsupervised learning of vowel categories from infantdirected speech. Proceedings of the National Academy of Sciences, 104, 13273-13278.

Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7, 49-63.

Motherese in Interaction: At the Cross-Road of Emotion and Cognition? (A Systematic Review)

注釈

  • Citation: Saint-Georges C, Chetouani M, Cassel R, Apicella F, Mahdhaoui A, et al. (2013) Motherese in Interaction: At the Cross-Road of Emotion and Cognition? (A Systematic Review). PLoS ONE 8(10): e78103. doi:10.1371/journal.pone.0078103

  • Editor: Atsushi Senju, Birkbeck, University of London, United Kingdom

  • Received July 19, 2013; Accepted September 6, 2013; Published October 18, 2013 Copyright: © 2013 Saint-Georges et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Funding: The authors have no support or funding to report.

  • Competing interests: The authors have declared that no competing interests exist.
Abstract (要約)

Various aspects of motherese also known as infant-directed speech (IDS) have been studied for many years. As it is a widespread phenomenon, it is suspected to play some important roles in infant development. Therefore, our purpose was to provide an update of the evidence accumulated by reviewing all of the empirical or experimental studies that have been published since 1966 on IDS driving factors and impacts. Two databases were screened and 144 relevant studies were retained. General linguistic and prosodic characteristics of IDS were found in a variety of languages, and IDS was not restricted to mothers. IDS varied with factors associated with the caregiver (e.g., cultural, psychological and physiological) and the infant (e.g., reactivity and interactive feedback). IDS promoted infants’ affect, attention and language learning. Cognitive aspects of IDS have been widely studied whereas affective ones still need to be developed. However, during interactions, the following two observations were notable:

    1. IDS prosody reflects emotional charges and meets infants’ preferences,
  • and (2) mother-infant contingency and synchrony are crucial for IDS production and prolongation.

Thus, IDS is part of an interactive loop that may play an important role in infants’ cognitive and social development.

注釈

翻訳

対乳児音声 ( Infant-Directed Speech : IDS ) としても知られるマザリースの色々な側面は長い間研究されてきた. これは広く起きる現象であるため, 乳児の発達において重要な役割を果たすのではないかと思われてきている. したがって, 我々の目的は 1966年からの IDS の起きる要因と影響について公開されてきた経験的, 実験的研究のすべてをレビューすることによって, 蓄積された証拠の更新を行うことである. 2つのデータベースをスクリーニングした結果, 144個の関連研究が収集された. 一般に, IDS の 言語学的, もしくは 韻律的特徴 が 様々な言語で 母親に限らず 発見されている. IDS は 保護者 ( 例えば、文化的心理的、生理的) と 乳児 (例えば、反応性および対話型のフィードバック) に関係する要因によって変化する. IDS は 乳児の注目や言語獲得を促進に影響を与える. IDS の認知的な側面は情動的なものの発達にも必要とされるのか否かを広く研究している. しかし, 相互作用の間に, 以下の2つの観測が注目すべきものであった.

  1. IDSの韻律は感情的なものを反映しており、乳幼児の好みを満たしている
  2. 母-子 の偶発性と同期性は IDS 発話と延長にとって重要なものである

そのため, IDSは 乳児の認知的,社会的な発達に重要な役割をは立つ 相互作用的なループの一部である.

Introduction (導入)

Motherese, also known as infant-directed speech (IDS) or “baby-talk”, refers to the spontaneous way in which mothers, fathers, and caregivers speak with infants and young children. In a review of the various terms used to denote young children’s language environments, Saxton suggested the preferential use of “infant or child-directed speech” [1] . In 1964, a linguist [2] defined “baby-talk” as “a linguistic subsystem regarded by a speech community as being primarily appropriate for talking to young children”. He reported that “baby talk” was a well-known, special form of speech that occurred in a number of languages and included the following 3 characteristics:

    1. intonational and paralinguistic phenomena (e.g., a higher overall pitch),
    1. words and constructions derived from the normal language (e.g., the use of third person constructions to replace first and second person constructions),
    1. a set of lexical items that are specific for baby talk.

He provided a precise, documented study of IDS across several different languages. Since then, infant-directed speech has been studied extensively across a number of interactive situations and contexts, especially by researchers interested in understanding language acquisition. A recent review of “babytalk” literature focused on phonological, lexical and syntactic aspects of the input provided to infants from the perspective of language acquisition and comprehension [3] . Although Snow, in a review of the early literature on motherese [4] , claimed that “language acquisition is the result of a process of interaction between mother and child, which begins early in infancy, to which the child makes as important a contribution as the mother, and which is crucial to cognitive and emotional development as well as language acquisition”, few experimental findings have sustained this assertion. Recent progresses in cognitive science and in interactional perspective suggest, however, that infant cognitive development is linked with social interaction (e.g., Kuhl et al., 2003). Motherese could be a crossroad for such a linkage. Here, we aim to review the available evidence relevant to motherese from an interactional perspective, with a specific focus on children younger than 2 years of age. In contrast with Soderstrom’s review (2007), we focus more preferentially on motherese’s prosodic and affective aspects to determine the factors, including interactive ones, associated with its production and variations, its known effects on infants and its suspected functions aside from language acquisition.

注釈

IDS や “ベイビー・トーク” としても知られているマザリーズとは, 母親, 父親, その他介護者が乳幼児と自然に話した際の話し方を指す. 乳児の言語環境を指すために使用される様々な用語のレビューにおいて, Saxton は ” 乳児-児童発話 ” の優先的な使用を示唆している. 1964 年に, ある言語学者が “ベイビー・トーク” を “幼い子どもに話しかけるための主要で適切であると発話コミュニティによってみなされる言語的なサブシステム” であると定義した. 彼は “ベイビー・トーク” は一般に広く知られている, 多くの言語で発生する, 以下の3つの特徴が含まれている特別な発話の形式であると報告した.

  1. イントネーションやパラ言語情報 (例えば, ピッチが高い)
  2. 通常の言語から派生した単語や構造 (例えば, 第一, 第二人称と交換される第三人称の使用)
  3. ベイビー・トーク のための特別な語彙のセット

彼は、複数の異なる言語を横断して、IDSに関する正確で文書化された研究を提供した。 その時以来、IDS は、特に言語獲得の理解に興味を持っている研究者によって、多くの対話の状況および文脈を横切って広範囲に研究された。 最近の “ベイビー・トーク” に関する文献は言語獲得と理解の側面から乳児に提供される、入力音声の 音韻的、 韻律的、文法的側面に注目している。 Snow がマザリーズに関する初期の文献のレビューの中で述べられている “言語獲得とは乳幼児期の初期に始まる、母親と子供のインタラクション過程の結果であり、 それに大して乳児は母親と同等の貢献を果たし、言語獲得と同様に認知および感情的な発達に不可欠である” という主張はいくつかの実験の結果支持されている。 しかし、認知科学とインタラクション的な知見の最近の発展では、乳児の認知的発達は社会的インタラクションと関連していることを示唆している(例えば、kuhl et al., 2003)。 マザリーズはこのような関連のための交差点であるかもしれない。 そこで、我々は2歳未満の乳児に焦点を絞り、インタラクションの視点からマザリーズに関連する利用可能な証拠を整理することを目的とする。 Soderstrom のレビュー (2007) とは対照的に, 我々は要因を特定しやすくするために、インタラクティブなものも含むマザリーズの韻律的と感情的側面について その生成および変形に関連した乳幼児への既存の効果と言語獲得側からでは疑いのある機能に関してより優先的に焦点をあわせる。

Methods (方法)

We searched the PubMed and PsycInfo databases from January 1966 to March 2011 using the following criteria: journal article or book chapter with ‘‘motherese’’ or ‘‘infant-directed speech’’ within the title or abstract, published in the English language and limited to human subjects. A diagram summarizing the literature search process is provided in :num:`Figure #diagram-flow` . We found 90 papers with PubMed and 134 with PsycInfo, of which 59 were shared across the databases, for a total of 165 papers. We excluded 50 papers because 11 were reviews or essays and 39 were experimental studies that did not aim to improve knowledge on IDS as they addressed other aims (see details in Annex S1). We found an additional 29 references by screening the reference lists of the 115 papers, leading to a total of 144 relevant papers.

注釈

我々は PubMed と PsycInfo ベータベースを 1966 年 1月 から 2011 年 3月まで以下の基準で検索を行った。

  • “マザリーズ” もしくは “infant-directed speech” という単語をタイトルもしくはアブストラクトに含む ジャーナルあるいは本の章
  • 英語で出版されているもの
  • 人間を対象にしているもの

文献検索の過程は :num:`Figure #diagram-flow` に示す。↩

_images/fig12.png

Diagram flow of the literature search.

  • doi: 10.1371/journal.pone.0078103.g001

注釈

我々は 90 本の論文を PubMed から見つけ、 PsyInfo からは 134 本見つけた。 そのうち 59 本はデータベース上に重複があったため、最終的には 165 本である。 我々はそのうち 50 本の論文を除外した。 これらのうち 11 本はレビューやエッセイであり、39 本は IDS に関する知見を向上する以外の目的の実験的研究であった(詳細は 付録 S1 を参照)。 我々は 115 本の論文の引用を再度スクリーニングし、 関連する論文 29 本を追加した。 合計 144 の関連する論文を発見した。

Results (結果)
1: General comments (総括)

Table 1 lists the relevant studies and the number of subjects included according to each domain of interest. The following observations are evident:

    1. certain points are well documented (e.g., IDS’s effect on language acquisition), whereas others have received less support (e.g., IDS production according to gender and the course of infants’ preference for IDS);
    1. the sample sizes between studies range from 1 to 276 with 1/3 of the studies having N≤15, 1/3 having 15<N<40, and 1/3 having N≥40;
  • and (3) methodologies vary greatly between studies with regard to design and sample characteristics (e.g., the type of locator and infants’ ages).

The results are presented in several sections. Concerning IDS production, we will first review its general characteristics and, then, its variations according to maternal language, infants’ age, gender, vocalizations, abilities and reactivities, and parental individual differences. For IDS’s effects on infants, we listed the following 4 main functions of IDS: communicating affect, facilitating social interaction through infants’ preferences, engaging and maintaining infants’ attention, and facilitating language acquisition. The discussion incorporates selected articles dealing with theoretical considerations and those that included the boundaries of the concept of motherese.

注釈

表1は関連研究とそれぞれの興味領域に含まれる被験者数のリストである. 以下の観察結果が証明されている.

  1. 他の領域があまりサポートされていないのに対し,特定のポイントに関してはよく文章化されている.
    • 例えば,性別や乳児の好みの方向に応じた IDS の発声に対して,言語獲得上の IDS の効果については研究がされている.
  2. 研究間のサンプルサイズの範囲は 1 人から 276 人まで
    • 三分の一の研究は15名以下,三分の一の研究は 15人 より大きく 40人より小さい,残りの三分の一は40名以上
  3. デザインとサンプルの特性に応じて方法論は研究間で大きく異る
    • 例えば,出身地や乳児の年齢の種類など

結果は複数の章で示している. IDS の発声に関しては,我々はその一般的な特徴をレビューする その後,母親の言語に応じて乳児の年齢,性別,発声,能力,反応性及び,親の個人差についてレビューする. 乳児に対するIDSの影響に関して,我々は以下の4つの機能をリストした.

  • コニュニケーションに影響を与える機能
  • 乳児の選好と通して社会的相互作用を促進する機能
  • 乳児の注目を引き,維持する機能
  • 言語獲得を容易にする機能

議論はマザリーズの概念の境界部分を含む研究もいれた論理的な考察を扱う.

Table 1. Characteristics of the studies included in the review.
Author-year N* subjects Design of the study Main objective to explore or assess… (motherese features and variations)
Durkin 1982 18 Cross-sectional observational Functions of use of proper names
Fernald 1984 24 Paired comparisons IDS/simulated IDS/ADS Prosodic features according to infant feed-back
Fisher 1995 20 Paired comparisons IDS/ADS Prosodic features on new/given words
Soderstrom 2008 2 Longitudinal case series Prosodic and linguistic features
Fernald 1991 18 Paired comparisons IDS/ADS Prosodic features on focused words
Ogle 1993 8 Cross-overIDS/ADS (electrolaryngography) Prosodic features (F0 measures)
Fernald 1989a 30 Paired comparisons IDS/ADS Prosodic features for mothers and fathers across 6 languages
Niwano 2003b 3 Paired comparisons IDS/ADS Prosodic features for mothers and fathers + infant’s responses
Shute 1999 16 Paired comparisons IDS/ADS Prosodic features for fathers speaking and reading aloud
Shute 2001 16 Paired comparisons IDS/ADS Prosodic features for grandmothers speaking and reading aloud
Nwokah 1987 16 Case-control (Mother/maid) Linguistic and functional features of maids’ IDS
Katz 1996 49 Paired comparisons with pragmatic categories of IDS Prosodic contours according to intention
Stern 1982 6 Case-series Prosodic contours according to intention & grammar and context
Papoušek 1991 20 Case-control (Chinese/English) Prosodic contours according to context in different languages
Slaney 2003 12 Paired comparisons (IDS with various intentions) Acoustic measures according to affect (automatic classification)
Trainor 2000 96 Paired comparisons IDS/ADS with various emotions Links between IDS and affective expression
Inoue 2011 24 Paired comparisons IDS/ADS Wether Mel-frequency cepstral coefficients discriminate IDS from ADS
Mahdhaoui 2011 11 Paired comparisons IDS/ADS Automatic detection based on prosodic and segmental features
Cristia 2010 55 Paired comparisons IDS/ADS Enhancement of consonantal categories
Albin 1996 16 Paired comparisons IDS/ADS Lengthening of word-final syllables
Swanson 1992 15 Paired comparisons IDS/ADS Vowel duration of content words as opposed to function words
Swanson 1994 22 Paired comparisons IDS/ADS Vowel duration of function-word in utterance final position
Englund 2006 6 Longitudinal Paired comparisons IDS/ADS Vowels and consonant specification throughout the first semester
Englund 2005a 6 Longitudinal Paired comparisons IDS/ADS Spectral attributes and duration of vowels throughout a semester
Englund 2005b 6 Longitudinal Paired comparisons IDS/ADS Evolution of voice onset time in stops throughout a semester
Lee 2010 10 Case-control (IDS/ADS) Segmental distribution patterns in English IDS
Shute 1989 8 Paired comparisons IDS/ADS Pitch Variations in British IDS (compared to American IDS)
Segal 2009 11 Longitudinal descriptive study Prosodic and lexical features in Hebrew IDS
Lee 2008 10 Paired comparisons IDS/ADS Segmental distribution patterns in Korean IDS
Grieser 1988 8 Paired comparisons IDS/ADS Prosodic features in a tone language IDS (Mandarin Chinese)
Liu 2007 16 Paired comparisons IDS/ADS Exaggeration of lexical tones in Mandarin IDS
Fais 2010 10 Paired comparisons IDS/ADS Vowel devoicing in Japanese IDS
Masataka 1992 8 Paired comparisons IDS/ADS Rhythm : repetition and gestual exaggeration in Japanese sign language
Reilly 1996 15 Longitudinal descriptive study Competition between affect and grammar in American sign language
Werker 2007 30 Cross-language comparison Differences in distributional properties of vowel phonetic categories
Kitamura 2003 12 Longitudinal Paired comparisons IDS/ADS Pitch and communicative intent according to age
Stern 1983 6 Longitudinal case-series Prosodic features evolution
Niwano 2002b 50 Longitudinal case-series Pitch and Prosodic contours according to age
Liu 2009 17 Longitudinal Paired comparisons IDS/ADS Prosodic and phonetic features according to age
Kajikawa 2004 2 Longitudinal case-series Adult conversational style (Speech overlap) emergence in Japanese IDS
Amano 2006 5 Longitudinal case-series Changes in F0 according to infant age and language acquisition stage
Snow 1972 12/24/6 Paired comparisons IDS/CDS Linguistic features according to children age
Kitamura 2002 22 Longitudinal Paired comparisons IDS/ADS Pitch according to infant age and gender in English and Thaï languages
Braarud 2008 32 Paired comparisons synchrony/dyssynchrony IDS quantity according to infant feed-back and synchrony
Smith 2008 18 Controlled trial (2 experimental groups) Pitch variations according to infant feed-back from the pitch
Shimura 1992 8 Correlation study Between mother and infant vocalizations (pitch & duration & latency & melody)
Van Puyvelde 2010 15 Correlation study Between mother and infant vocalizations (pitch & melody)
McRoberts 1997 1 Longitudinal case-study Mother & father and infant adjustment of pitch vocalizations during interaction
Reissland 1999 13 Case-control (premature/term infants) Timing and reciprocal vocal responsiveness of mothers and infants
Niwano 2003a 1 Paired comparisons (mother with twins) Pitch and contours variations according to infant reactivity
Reissland 2002 48 Case-control (age) + Correlation study Pitch of IDS surprise exclamation according to infant age/reaction to surprise
Lederberg 1984 15 Paired comparisons deaf/hearing children Adult adjustment in interaction with deaf children
Fidler 2003 36 Case-control (Down syndrome/other MR) Pitch’s mean and variance in parental IDS to Down syndrome/other MR
Gogate 2000 24 Case-control (5-8;9-17;21-30 months) Multimodal IDS according to infants’ levels of lexical-mapping development
Kavanaugh 1982 4 Longitudinal case-series Mother/father linguistic input according to apparition of productive language
Bohannon 1977 20 Correlation study MLU of IDS according to child’s feed-back of comprehension
Bohannon 20 Paired comparisons (manipulating feed-back)  
Bergeson 2006 27 Case-control (cochlear implant/control) IDS adjustment (pitch & MLU & rhythm) according to childs’ hearing experience
Kondaurova 2010 27 Longitudinal case-control IDS adjustment according to child’s hearing experience and age
Ikeda 1999 61 Paired comparisons IDS/ADS Variations according to various life experience (especially having sibling)
Hoff 2005 63 Prospective study Variations of linguistic input and teaching practices according to parental socioeconomic status or education & repercussions on child vocabulary-
Hoff 2005 662 Cross-sectional study  
Hoff-Ginsberg 1991 63 Cross-sectional study Variations of input according to parental socio-economic status (SES)
Matsuda 2011 65 Correlation study Functional RMI of adults listening to IDS according to gender & parental status
Gordon 2010 160 Prospective study Oxytocin level according to infant’s age and correlation with parenting
Bettes 1998 36 Case-control Maternal behavior (including IDS prosody) according to depression status
Herrera 2004 72 Case-control IDS content and touching according to maternal depression status
Kaplan 2001 44 Correlation study Variations according to maternal age and depression status
Wan 2008 50 Case-control Variations of IDS characteristics according to maternal schizophrenia status
Nwokah 1999 13 Case-control IDS amount & structure & content in maids compared with mothers
Burnham 2002 12 Paired comparisons IDS/ADS/petDS Pitch : affect (intonation + rhythm) and hyperarticulation in IDS versus petDS
Green 2010 25 Paired comparisons IDS/ADS Lip movements
Rice 1986 2 Case-series Description of speech in educational television programs compared with CDS
Fernald 1989b 5 Paired comparisons IDS/ADS with various intentions Adult’s detection of communicative intent according to prosodic contours
Bryant 2007 8 Paired comparisons IDS/ADS with various intentions Adult’s detection of communicative intent according to prosodic contours
Fernald 1993 120 Paired comparisons IDS/ADS with various intentions Communication of affect ( to infants) through prosodic contours
Papousek 1990 32 Paired comparisons approval/disapproval intent Communicating affect (looking response) through prosodic contours
Santesso 2007 39 Paired comparisons with various affects Psycho-physiological (ECG & EEG) responses to IDS with various affects
Monnot 1999 52 Correlation study IDS effects on infant’s development level and growth parameters
Santarcangelo 1988 6/4 Correlation study + paired comparisons IDS/ADS Developmentally disabled children’s preference (responsiveness & eye-gaze)
Werker 1989 60 Paired comparisons IDS/ADS with males/females Infant’s preference (looking & facial expression) for male and female IDS
Schachner 2010 20 Paired comparisons IDS/ADS Subsequent visual infant’s preference for the speaker
Masataka 1998 45 Paired comparisons IDS/ADS Infant’s preference for infant-directed (versus adult-directed) Sign Language
Cooper 1993 96 Paired comparisons IDS/ADS 1 month-old infant’s preference for IDS
Cooper 1990 28 Paired comparisons IDS/ADS Experimental (looking producing IDS) testing of 0-1 month-olds’ preference
Pegg 1992 92 Paired comparisons IDS/ADS Young infant’s attentional and affective preference for male and female IDS
Niwano 2002a 40 Paired comparisons with manipulated IDS Infant’s preference (through eliciting vocal response)
Hayashi 2001 8 Longitudinal paired comparisons IDS/ADS Developmental change in infant’s preference (according to age)
Newman 2006 90 Paired comparisons IDS/ADS at 3 ages/2 noise levels Change in infant’s preference according to developmental age and to noise
Panneton 2006 48 Paired comparisons with manipulated IDS at 2 ages  
Cooper 1997 20/20/23 3 Paired comparisons IDS/ADS in various conditions Change in infant’s preference according to age and speaker (mother/stranger)
Hepper 1993 30 Paired comparisons IDS/ADS New-born’s preference for maternal IDS or ADS
Kitamura 2009 24 3 Paired comparisons IDS with various contours Change in determinants of infant’s preference according to developmental age
Kaplan 1994 45/80 2 Paired comparisons IDS with various contours Change in determinants of infant’s preference according to developmental age
Spence 2003 42 3 Paired comparisons IDS with various intents Intent categorization ability according to age (4 months/6 months)
Johnson 2002 210 Paired comparisons IDS/ADS (prosody or content) Adult’s preference for IDS/ADS according to history of head injury
Cooper 1994 12/20/20/16 4 Paired comparisons manipulated IDS/ADS Do pitch contours determine 1-month-olds’ preference for IDS?
Fernald 1987 20 Paired comparisons with manipulated IDS Do pitch & amplitude or rhythm determine 4-month-olds’ preference for IDS?
Leibold 2007 57 Paired comparisons with manipulated sounds Acoustic determinants of 4-month-olds’ preference for IDS
Trainor 1998 16 Paired comparisons low or high pitched songs Acoustic determinants of infant’s preference for IDS
Singh 2002 36 Paired comparisons IDS/ADS with various affects Does affect (emotional intensity) determine infant’s preference for IDS ?
McRoberts 2009 144/62/24/48 4 Paired comparisons with manipulated IDS /ADS Does repetition influence infant’s preference for age-inappropriate IDS/ADS?
Saito 2007 20 Paired comparisons IDS/ADS Does IDS activate brain of neonates (near-infra-red spectroscopy)?
Kaplan 1996 104/78/80 3 Paired comparisons IDS/ADS Does IDS (paired with what facial expressions) increase conditioned learning?
Kaplan 1995 77/26 Paired comparisons IDS/ADS Does IDS engage and maintain infant’s attention?
Senju 2008 20 Paired comparisons IDS/ADS Does IDS engage infant’s joint attention (eye-tracking) ?
Nakata 2004 43 Paired comparisons maternal IDS/maternal singing Does IDS engage and maintain infant’s attention over singing?
Kaplan 2002 12 Paired comparisons depressed/non depressed IDS Does IDS increase conditioned learning & according to mother depression?
Kaplan 1999 225 Controlled trials with IDS varying in quality Does IDS increase conditioned learning & according to mother depressiveness?
Kaplan 2010a 134 Case-control Does mother depression duration affect infant’s learning with normal IDS?
Kaplan 2004 40 Paired comparisons with maternal/female/male IDS Does IDS speaker’s gender affect learning by infants of depressed mothers?
Kaplan 2010b 141 Case-control (2x2 ANOVA) How marital status and mother depression affect learning with male IDS?
Kaplan 2007 39 Case-control Does father depression affect infant’s conditioned learning with paternal IDS?
Kaplan 2009 55 Correlation study Does maternal sensitivity affect infant’s learning with maternal IDS?
Karzon 1985 192 Controlled trials: IDS/manipulated IDS/ADS Do supra-segmental features of IDS help polysyllabic discrimination?
Karzon 1989 64 Controlled trials: falling/rising contours Does IDS prosody help syllabic discrimination and how?
Vallabha 2007   Automatic computed vowels categorization Does IDS prosody help categorization of sounds from the native language?
Trainor 2002 96 Controlled trials How IDS high pitch / IDS exaggerated contours help vowel discrimination?
Hirsh-Pasek 1987 16/24 Paired comparisons with manipulated IDS Does IDS prosody help to segment speech into clausal units?
Kemler Nelson 1989 32 Randomized controlled trials with IDS/ADS Does IDS/ADS prosody help to segment speech into clausal units?
Thiessen 2005 40 Controlled trials with IDS/ADS Does IDS prosody help word segmentation?
D’Odorico 2006 18 Case-control late-talker/typical peers Does (prosodic and linguistic) maternal input help language acquisition?
Curtin 2005 24 Serie of 5 experiments Does lexical stress help language acquisition (speech segmentation)?
Singh 2008 40 Serie of 4 experiments (controlled trials) Does IDS vocal affect help word recognition?
Colombo 1995 27 Paired comparisons with manipulated sounds Does F0 modulation in IDS help words recognition in a noisy ambient?
Zangl 2007 19/17 Paired comparisons IDS/ADS at 2 ages Does IDS/ADS prosody activate brain for familiar and unfamiliar words?
Song 2010 48 Paired comparisons IDS/manipulated IDS Does IDS rhythm/hyper-articulation/pitch amplitude help word recognition?
Bard 1983 94 4 Paired comparisons IDS/ADS with adult listeners Does IDS help word recognition & according to word contextual predictability?
Bortfeld 2010 16/32/24/80 4 paired comparisons IDS words with various stress Does emphatic stress in IDS prosody help word recognition ?
Kirchhoff 2005 Automatic Paired comparisons IDS/ADS words Does IDS prosody help automatic speech recognition ?
Singh 2009 32 Longitudinal paired comparisons (?) IDS/ADS Does IDS prosody help word recognition over the long-term?
Golinkoff 1995 61/79 Randomized controlled trials IDS/ADS Does IDS prosody help adult word recognition in an unfamiliar language?
Newport 1977 12 Longitudinal prospective correlation study Does maternal IDS linguistic properties predict child language acquisition?
Gleitman 1984 6/6 Same as Newport 1977 New analyses on the same data but with 2 age-equated groups
Scarborough 1986 9 Longitudinal prospective correlation study Does maternal IDS linguistic properties predict child language acquisition ?
Furrow 1979 7 Longitudinal prospective correlation study Does maternal IDS linguistic properties predict child language acquisition ?
Rowe 2008 47 Prospective study Does input according to parental SES affect child’s vocabulary?
Hampson 1993 45 Longitudinal prospective study Does maternal IDS linguistic properties predict language acquisition ?
Waterfall 2010 12 Longitudinal study + computational analysis Does IDS linguistic properties help language acquisition?
Onnis 2008 44/29 Randomized controlled trials Overlap/not Does IDS properties (overlapping sentences) help word/grammar acquisition?
Fernald 2006 24 Paired comparisons with words isolated/not Which properties (isolated words/short sentences) help language acquisition?
Kempe 2005 72/168 Randomized controlled trials Invariance/not Does IDS diminutives (final syllable invariance) help word segmentation?
Kempe 2007 486 Randomized controlled trials Invariance/not Does IDS diminutives (final syllable invariance) help word segmentation?
Kempe 2003 46 Paired comparisons with diminutives/not Does IDS diminutives help gender categorization?
Seva 2007 24/22 Paired comparisons with diminutives/not Does IDS diminutives help gender categorization?
2: Motherese characteristics (マザリーズの特徴)

The general linguistic and paralinguistic characteristics of motherese have been described in several previous works. Compared with Adult Directed Speech (ADS), IDS is characterized by shorter [5] [6] [7] , linguistically simpler, redundant utterances, which include isolated words and phrases, a large number of questions [7] , and the frequent use of proper names [8] . Regarding rhythm and prosody, longer pauses, a slower tempo, more prosodic repetitions, and a higher mean f0 (fundamental frequency: pitch) and wider f0-range have been reported [5] [6] [9] , with these findings supported by electrolaryngographic measures [10] . Similar patterns of IDS have been observed for fathers and mothers across various languages [11] [12] [13] , except with regard to the wider f0-range, and also for grandmothers interacting with their grandchildren [14] . In contrast, a maid’s IDS differs significantly from a mother’s IDS with regard to the amount and types of utterances present [15] .

注釈

マザリーズの一般的な言語学的, パラ言語学的な特徴はいくつかの先行研究において記述されてきた。 ADS と比較して、IDS は 短く [5] [6] [7] 、言語的にシンプルで、冗長な発話であり、単独の単語やフレーズや多くの質問を含み [7] , 固有名詞をよく使用する [8] . リズムやプロソディーに関しては、より長いポーズで、ゆっくりとしたテンポで、よりプロソディー的な繰り返しが多く、平均 f0 が高く、 f0 の範囲が広いことが報告されている [5] [6] , [9] . これらの発見は電子的な可視化的計測によって支持されている。 IDS によく出てくるパターンは広い f0 のレンジ以外はいろいろな言語の母親、父親に対して観察されており [11] [12] [13] , 祖母と孫のインタラクションにおいてもみられる [14] . 対照的に 育児経験のない女性の IDS は 母親の IDS とは、表現される発話のタイプと量に関して異なる [15] .

Prosodic contours vary according to mothers’ intentions. Adults hearing content-filtered speech [16] or a language that they do not speak [17] were able to use the intonation to identify a mother’s intent (e.g., attention bid, approval, and comfort) with higher accuracy in IDS than in ADS. The prosodic patterns of IDS are more informative than those of ADS, and they provide infants with reliable cues about a speaker’s communicative intent. Indeed, f0 contour shape and f0 summary features (i.e., mean, standard deviation, and duration) discriminate the pragmatic categories (e.g., attention, approval, and comfort) from each other [18] . Mothers of 2 to 6month-old infants use rising contours when seeking to initiate attention and eye contact, but they use sinusoidal and bellshaped contours when seeking to maintain eye contact and positive affect with an infant who is already gazing and smiling. They also use specific contours for different sentence types, such as rise contours for yes-no questions, fall contours for “wh” questions and commands, and sinusoidal-bell contours for declarative sentences [19] . Moreover, across different languages, the same types of contours convey the same types of meanings, which include arousing/soothing, turn-opening/ turn-closing, approving/disapproving, and didactic modeling [20] . Using pitch and spectral-shape measures, a Gaussian mixture-model discriminator designed to track affect in speech classified ADS (neutral affect) and IDS with more than 80% accuracy and further classified the affective message of IDS with 70% accuracy [21] . Indeed, the prosodic features of IDS are related to the widespread expression of emotion towards infants compared with the more inhibited expression of emotion evident in typical adult interactions. Few acoustic differences exist between IDS and ADS when expressing love, comfort, fear, and surprise, yet robust differences exist across these emotions [22] . Furthermore, in contrast with ADS, speech and laughter often co-occur in IDS [23] . Finally, IDS directed at 6month-old human infants and pet-directed speech (PDS) [24] are similar in terms of heightened pitch and greater affect (i.e., intonation and rhythm). However, only IDS contains hyperarticulated vowels, which most likely aids in the emergence of language in human infants with both pragmatic and language teaching functions. Thus, IDS prosody appears to be crucial for communicating parents’ affect and intentions in a non-verbal way.

注釈

韻律の輪郭は母親の興味によって異なる。 コンテンツフィルタリング発話 [16] や 彼らが使用しない言語 [17] は ADS よりも IDS において高い精度で 母親の興味(例えば、意識を向けたり、受諾、快適性など)を特定するためにイントネーションを使用することができる。 IDS のプロソディー的なパターンは ADS のそれよりも情報が豊富で、これらは乳児に話し手の意思伝達の意図についての手がかりを提供する。 実際、f0 の輪郭や形状(例えば平均や標準偏差、時速時間)は実際的なカテゴリ(例えば、注意や承認、心地よさ)を互いに判断する。 2-6 ヶ月児に対するマザリーズはアイコンタクトや注意の開始を探す際に上昇している輪郭を使用するが、かれらは既に見つめ合って笑っている乳児の持つ肯定的な 効果とアイコンタクトの維持を探索する際に、正弦波とベル状の波を使用する。 また、彼らは、例えばイエス・ノークエスチョンでは上昇調を示し、”wh” クエスチョンや命令文では下降調を使用し、宣言的判断では正弦ベル調を使用するように、 異なる文章の型では異なる音調を使用する [19] . 加えて、異なる言語間にわたって、同じ種類の音調は同じ種類の意味を伝える。 これは, なだめ/喚起、 ターン開閉/ターン開閉、 承認/不承認、 そして 教訓的なモデリングを含む [20] . ピッチとスペクトル形状の指標を用いて, 発話における影響を追跡するように設計されたガウス混合分布モデル分類器は80%以上の精度で ADS(ニュートラルな影響)とIDS をクラスタリングし,さらに70%以上の精度で IDSに含まれる感情メッセージを分類した [21] . 確かに,IDS の韻律的特徴は,典型的な大人のインタラクションにおける感情証拠の隠された表現と比較して,乳児に向けた感情の広範な発現に関連している. 愛情,快適さ,恐怖,驚きを表現する際には,いくつかの音響的な違いはADS,IDS間に存在するが,これらの感情を越えた頑健な差も存在する [22] . 加えて,ADSとの比較に関して,発話と笑いがIDSにおいてしばしば協調して起きる [23] . 最後に,6ヶ月児に対する IDS と対ペット発話(PDS) [24] は高いピッチや大きな影響を与える部分(例えば,イントネーションやリズム)という面ではよく似ている. しかし,IDSのみが, ハイパーアーテキュレートな母音を含んでおり,人間の乳児において言語の出現中に実用的な機能と言語教育機能の両方を補助している可能性が最も強い. したがって,IDSの韻律は非言語的な方法で両親とコニュニケーションをするために影響を及ぼし,意図を使えるために重要であると思われる.

In motherese, prosodic and phonetic cues highlight syntax and lexical units, and prosody provides cues regarding grammatical units at utterance boundaries and even at utterance-internal clause boundaries [7] . Indeed, mothers reading to their children lengthen vowels for content words [25] and function words when they appear in a final position [26] . Mothers also position target words on exaggerated pitch peaks in the utterance-final position [27] but lengthen final syllables, even in utterance-internal positions [28] ,

注釈

マザリーズでは韻律的,音素的キューが構文や語彙単位を協調し,韻律は発話内部の句境界であっても発話境界の文法的な単位についてのキューを提供する. 確かに,母親が子供に読み上げる際には内容語や [25] 終端にある機能語の母音を長くする [26] . 母親は発話の最終点で強調されたピッチのピークにおいてターゲットワードを配置もするが [27] ,発話の内部の位置においても最終シラブルを延長する [28] .

Although IDS analyses generally focus on supra-segmental prosodic cues, recent works aiming to computerize the recognition of motherese show that IDS’s segmental and prosodic characteristics are intertwined [29] [30] . The vocalic and consonantal categories are enhanced even when controlling for typical IDS prosodic characteristics [31] . Throughout the first 6 months, the vowel space is smaller and the vowel duration is longer, with some consonants also differing in duration and voice onset time. These characteristics may enhance both auditory and visual aspects of speech [32] , [33] [34] . Along with acoustic characteristics, visual cues seem to be a part of motherese, which suggests that hyperarticulation in natural IDS may visually and acoustically enhance speech. Indeed, lip movements are larger during IDS than ADS [35] .

注釈

IDS は一般的に前掲の文節あるいは韻律的キューに焦点を当てて分析をしているが, 母親の認知のモデリングを目指す最近の研究で IDS は文節的,韻律的特性が密接に絡み合っていることを示している. 母音と子音のカテゴリが一般的なIDSの韻律特性を制御する際に向上する [31] . 最初の6ヶ月にわたって,母音空間は狭く,母音の持続時間は長く,いくつかの子音に関しては持続時間やボイスオンセットの時間も異なる. これらの特性は,音声の聴覚的,視覚的両側面を強化することができる [32] [33] [34] . 音響的特性に加えて,視覚的なキューはマザリーズの一部と思われ,自然なIDSのハイパーアーテキュレートを視覚的,音響的に高めることができる. 確かに唇の動きはADSよりIDSの方が大きくなっている.

3: Variations in motherese characteristics (マザリーズの特徴のバリエーション)
3.1. According to language (言語的側面)

Specific forms of IDS are evident across various languages, including Western European languages [11] [36] [37] , Hebrew [38] , Korean [39] , Mandarin [40,41]_ , Japanese [42] and even American Sign Language (ASL) between deaf mothers and their deaf children [43] [44] [45] . Although general trends in the form of IDS exist, they may be mediated by linguistic and cultural factors. French, Italian, German, Japanese, British English and American English IDS share some general features (i.e., higher mean f0, greater f0 variability, shorter utterances, and longer pauses) but maintain distinct characteristics. For example, American IDS exhibits the most extreme prosodic modifications [11] , whereas British IDS exhibits smaller increases in vocal pitch [37] and has language-specific segmental distribution patterns when compared with Korean IDS [36] . Moreover, observations suggest that mothers adapt their IDS to the language-specific needs of their infants, for example, Japanese mothers alter phonetic cues that are more relevant in Japanese, whereas English mothers alter cues that are more relevant in English [46] . However, when a conflict arises between motherese features and language specificities (because some IDS features may disturb language salience), IDS tends to preserve the cues that are essential in ADS. Indeed, IDS prosody does not distort the acoustic cues essential to word meaning at the syllable level in Mandarin, which is a “tone language” [41] , and this is also evident for the Japanese vowel devoicing [42] . When there is a conflict between grammatical and affective facial expressions in ASL IDS, mothers shift from stressing affect to grammar around the time of their children’s second birthday [45] .

注釈

IDF の特定の形状は種々の言語 (西ヨーロッパ言語 [11] [36] [37] , ヘブライ語 [38] , 韓国語 [39] , マンダリン [40] [41] 日本語 [42] , その他アメリカン手話言語(耳の聞こえない母親と同様の子供の間) [43] [44] [45] ) に渡り明白である。 IDSの形の一般的傾向は存在するが、それらは言語および文化的要素によって調整されるかもしれない。 フランス語、イタリア語、ドイツ語、日本語、英国英語、アメリカ英語の IDS は一般的な特徴量が共通している(例えば、 f0 の平均値が高い、f0 の分散が大きい、短い発話、長いポーズ)が、はっきり異なる特徴を残している。 例えば、アメリカン IDS はもっとも極端な韻律的変化が存在するが [11] 、一方で英国英語では母音のピッチの上昇は少なく [37] , 韓国語と比較した際には言語的に特殊なセグメンタルな分布のパターンを持つ [36] . 加えて、観察では母親が彼らの IDS を彼らの子供にとって言語的に必要なものに適応している。 例えば日本の母親は音素的キューを日本語に適したなものに変化させるし、一方で英語の母親は英語に適したものに IDS を適応させる [46] . しかし、マザリーズの特徴と言語的特殊性に競合が起きた場合には(いくつかの IDS の特徴は言語の特徴点を阻害するため)、IDS は ADS に潜在的なキューを保持する。 確かに、IDS の韻律はマンダリン(これは”トーン言語である”)のシラブルレベルでの語の意味への潜在的な音響的なキューを歪めない [41] し、これは日本語母音の無声化でも示されている [42] 。ASL-IDS において 文法的あるいは顔の表情表現の間で矛盾がある場合には、母親は子供の二回目の誕生日くらいまでは文法的に影響されたストレス型にシフトする [45]

3.2. According to infants’ age and gender (乳児の年齢、性別)

IDS quality and quantity vary as children develop. The mean f0 seems to increase from birth, peak at approximately 4 to 6 months, and decrease slowly until the age of two years or older [47] [48] . Acoustic exaggeration is also smaller in child-directed speech (CDS) than in IDS [49] . Prosodic contours vary with infants’ age [50] , with “comforting” prevalent between 0 and 3 months and then decreasing with age, “expressing affection” and “approval” peaking at 6 months and being least evident at 9 months, and “directive” utterances, which are rare at birth, peaking at 9 months of age [47] . This is consistent with a change in pragmatic function between 3 and 6 months of age, as parental speech becomes less affective and more informative [3] . Variations in the mean length of utterances (MLU) are more controversial, and Soderstrom emphasized that some properties, such as linguistic simplifications, could be beneficial at one age but problematic at another age. In fact, two small sample studies suggest that mothers adjust their IDS as a function of their children’s language ability. Around the two-word utterance period, an adult-like conversational style with frequent overlaps emerges in Japanese IDS [51] , which has a mean f0 that reaches approximately the same value as that of ADS [52] . Mothers may continue to adjust their speech syntax to their children’s age and to the child’s feedback as children grow older [53] , but more longitudinal studies investigating the evolution of the linguistic aspects of IDS are needed.

注釈

IDS の質と量は子供の発達に従って変化する。 f0 の平均値は誕生とともに増加するようで、約 4~6 ヶ月でピークになり、二歳以上になるまでにゆっくりと低下する [47] [48] . 音響的強調も IDS よりも CDS は小さい [49] . 韻律的な外郭も乳児の年齢によって異なり [50] , 0~3 ヶ月までは普遍的に”心地よい”感じだが、それ以降は減少していく。 “明瞭な愛着”や”肯定感”は 6 ヶ月でピークになり、9 ヶ月までは続き、”直接的な”発話は, 誕生時には少ないが、9ヶ月にはピークに達する [47] . これは 3~6 ヶ月の実用的な機能の変化と一致しており、個人的な発話は感情的ではなくなり情報的になる [3] . 発話の平均的な長さ( Mean Length of Utterance: MLU) はより話題になり、Soderstrom らはいくつかの特徴、例えば、言語的な単純化、は一歳までは有益だが、それ以上では問題になることを強調している。 実際、二つの小さな事例研究では母親は IDS を彼らの子供の言語的な能力の機能に応じて変化させていることを示唆している 二語発話期周辺では、しばしば重複の起きる大人のような会話様式が日本語 IDS では起きる [51] し、 ADS と同じような値に達する f0 平均値を持つ [52] . 母親は子供が成長するに従って、子供の年齢やフィードバックに合わせた発話文法の変化をするようになる [53] が, IDS の言語的な側面の進化を調査するより長期的な研究が必要である。

Infants’ gender may also modify IDS characteristics. When using IDS with their 0 to 12-month-old infants, Australian mothers used higher f0 and f0 ranges and more rising utterances for girls than boys, whereas Thai mothers used a more subdued mean f0 and more falling utterances for girls than boys [54] . Given that the gender of the infant is not neutral in interactional processes (see, for example 55), its impact on motherese should be further explored in motherese studies.

注釈

乳児の性別も IDS の特徴を変化させる。 乳児が 0~12 ヶ月で IDS を使用している際、オーストラリアの母親は男子よりも女子に対して、より高い f0, f0 の範囲、上昇した発話を行い、タイの母親も男子よりも女子に対し、 より柔らかな f0 の平均値と発話の下降を行う [54] 。 乳児の性差はインタラクションのプロセスにおいて自然でないことを考慮すれば、マザリーズにおける影響はマザリーズ研究の将来の調査課題になるだろう。

3.3. According to infant vocalizations, ability and reactivity (乳児の発話、能力と準備に関して)

Gleason suggested that children’s feedback helps shape the language behavior of those who speak to them [56] . Indeed, Fernald, in a comparison of IDS, simulated IDS (to an absent baby) and ADS, showed that an infant’s presence facilitates IDS production. In simulated IDS, the mean f0 did not rise significantly compared with that of ADS, and other features, though they differed significantly from ADS, were intermediately between those of IDS and ADS [9] . In fact, IDS is dynamically affected by infants’ feedback. For example, IDS is reduced when the contingency of an infant’s responses is disturbed by decoupling TV sequences of the mother-infant interaction [57] . Furthermore, mothers produce higher IDS pitch when, through an experimental manipulation, IDS high pitch seems to elicit infants’ engagement, compared to another manipulation in which low pitch seems to strengthen infant’s engagement [58] . Mothers may also match their pitch to infants’ vocalizations. In the first 3 months, IDS and infants’ vocalizations are correlated in pitch, and even melody types are correlated in some mother-infant pairs [59] with tonal synchrony [60] . This correlation may be due to the parents, given that, in a longitudinal case study, parents consistently adjusted their vocal patterns to their 3 to 17-month-old infants’ vocal patterns, whereas infants did not adjust their vocal patterns to their parents’ vocal patterns [61] .

注釈

Gleason は子供のフィードバックが彼らに話かけられる言語行動の決定を手助けすることを示唆している [56] . 加えて、 Fernald は、 IDS の比較において、不在の乳児に対するシミュレーションされた IDS と ADS で乳児の存在が IDS の生成に重要であることを示した。 シミュレーションされた IDS は f0 の平均値は ADS と比較して有意に上がることはなく、他の特徴に関しては有意な差は出るものの ADS と IDS の中間的なものであった [9] 。 実際、 IDS は乳児のフィードバックに動的に影響される。 例えば、 乳児の反応が, TV によって母子インタラクションが阻害されると IDS は減少する [57] . 加えて、母親は, 実験的に操作され 高い IDS ピッチを生成する際に IDS の高いピッチは乳児の関与を誘発し、その他低いピッチの操作に比べて、乳児の関与を誘発すると思われる [58] 。 母親は乳児の発話能力に応じてピッチを変化させているようだ。 最初の三ヶ月では、 IDS と乳児の発声能力はピッチにおいて相関があり、メロディータイプ [59] や全体の同調性 [60] もいくつかの母子間では相関する。 縦断ケース研究において、両親は一貫し3-17ヶ月の乳児に合わせてボーカルパターンを調整していたのに対して、 乳児は親のボーカルパターンに合わせることがなかったことを考慮すると、この相関は両親が原因であると思われる [61] .

In addition, mothers adapt their IDS to infants’ abilities and needs. A number of studies have shown that mothers strengthen their IDS according to the perceived lack of communicative abilities of their child. Although full-term infants more often followed their mothers’ utterances with a vocalization than preterm infants did, mothers of premature babies more often followed their infants’ vocalizations with an utterance directed at the infants than did mothers of full-term babies [62] . A mother of two 3-month-old fraternal twins accommodated her IDS by using a higher mean f0 and rising intonation contours when she spoke to the infant whose vocal responses were less frequent [63] . Similarly, playing with a Jack-in-the-box, mothers exclaimed in surprise with a higher pitch when their children did not show a surprise facial expression. Infants’ expressions were a stronger predictor of maternal vocal pitch than their ages [64] . Mothers interacting with an unfamiliar deaf 5-year-old child used more visual communicative devices, touches, simpler speech, and frequent initiations than when communicating with an unfamiliar hearing 4.5-year-old child. Although each initiation toward the deaf child was less successful than the previous one, interactions occurred as frequently as with the hearing child [65] . Finally, parents of children with Down syndrome (which is a visible disability) spoke with a significantly higher f0 mean and variance than did parents of children with other types of mental retardation [66] .

注釈

加えて、母親は彼らの IDS を乳児の能力や需要に適応させている。 いくつかの研究では母親が子供の会話能力の知覚上の欠損に応じて IDS を強調していることが示されている。 満期出産乳児は、早産の乳児がするよりも、より頻繁に母親の発話に続いて声を出したが、早期出産乳児の母親は、満期出産乳児の母親が行うよりも、頻繁に乳児の発声に続いて乳児に対する発話を行った [62] 。 二人の三ヶ月の双子を持つ母親は、あまり発話反応の多くない乳児に話しかけを行う際、f0 の平均値やイントネーションの上昇輪郭を使用して IDS を順応させた [63] 。 同様に、ジャックインザボックスで一緒に遊び、 子供たちが驚きの表情を示さなかった際には、母親はより高いピッチで驚いて叫んだ。 子供の年齢よりも母親の音声のピッチを予測しうる因子は子供の表情であった [64] . 母親は不慣れな聴覚障害の5歳児とインタラクションをかわす際、より視覚的なコミュニケーションツール、例えば、接触や簡単な会話、頻繁なイニシエーションを、4.5ヶ月の聴覚障害児とコミュニケーションをとる時よりも使用した。 なお、聴覚障害児に向けての各イニシエーションは、前者と比べ、あまり成功なかったが、相互作用は健常児の場合と同様の頻度で発生した。 最後に、 ダウン症の子どもを持つ親(これは見える障害である) は精神遅滞、他のタイプの子供を持つ親に比べ、有意に高いF0平均と分散で話をした.

Mothers tailor their communication to their infants’ levels of lexical-mapping development. When teaching their infants target words for distinct objects, mothers used target words more often than non-target words in synchrony with the object’s motion and touch. This mothers’ use of synchrony decreased with infants’ decreasing reliance on synchrony as they aged [67] . Similarly, IDS’s semantic content shows strong relationships with changes in children’s language development from zero to one-word utterances [68] , and a clear signal of non-comprehension from children results in shorter utterances [69] . In the IDS directed toward their profoundly deaf infants with cochlear implants, mothers tailored pre-boundary vowel lengthening to their infants’ hearing experience (i.e., linguistic needs) rather than to their chronological age, yet they all exaggerated the prosodic characteristics of IDS (i.e., affective needs) regardless of their infants’ hearing status [70] [71] . Thus, we conclude that IDS largely depends on the child given that it increases with infants’ presence and engagement, is influenced by infants’ actual preferences and vocalizations and depends on mothers’ perceptions of their infants’ overall abilities and needs.

注釈

母親は乳児の語彙マッピングの発達レベルに応じてコミュニケーションを適応させる。 乳児に異なるオブジェクトの目的語を教える際に、母親はオブジェクトの動きと接触の同機を非目的語よりも目的語に対してより多く使用した。 母親の同機に対する依存は、乳児が年齢とともに同機依存を減少させると減少していった。 同様に、 IDS の意味内容は乳児の0語から1語発話への言語的な発達と子供からの非理解的で明確 人工内耳をもつ重度聴覚障害乳幼児向けのIDSでは、 母親は彼らの暦年齢よりも、 むしろ乳児の聴覚経験の前境界母音延長(すなわち、言語的ニーズ)に応じた。 一方で IDS の韻律的特徴は(すなわち感情のニーズ)は乳児の聴覚の状況に関係なく、すべて強調した。 そのため、我々は、乳幼児のプレゼンスと関与とともに増加することを考えると IDS は乳幼児の影響を受けて「実際の好みや発声や母親に依存し彼らの乳児の全体的な能力やニーズの認識されている」と結論づけた。

3.4. Do parental individual differences modify motherese quality? (両親の個人差はマザリーズの質に影響するのか)

Whether a mother had siblings could explain some individual variability in IDS, given that women who grew up with siblings were more likely to show prosodic modifications when reading picture books to a young child than those who did not have siblings [72] . Social class and socio-economic status (as measured by income and education) impact mothers’ CDS [73] [74] [75] , and this impact is mediated by parental knowledge of child development [75] . However, main effects of communicative setting (e.g., mealtime, dressing, book reading, or toy play) and the amount of time that mothers spend interacting with their children may be important influences [74] .

注釈

母親が兄弟を持っているか否かは IDS の個人差の多様性の一部を説明するかもしれない。 兄弟と一緒に育った女性は兄弟のいない女性に比べ、乳児に絵本を読み聞かせする際に韻律的変化を示す可能性が高い [72] 。 社会階級と経済的状態 (収入と教育によってはかられた)は母親の CDS に影響し [73] , [74] , [75] , この影響は両親の子どもの発達に関する知識に仲介される [75] 。 しかし、コミュニケーションの準備(例えば、食事の時間や着替え、本の読み聞かせ、もしくはおもちゃで遊ぶ等)の主効果や 母親や子どもとインタラクションをかわす時間は重要な影響力を持っている [74]

Neural and physiological factors may be relevant to the parenting of young children and to IDS production. When listening to IDS, mothers of preverbal infants (unlike mothers of older children) showed enhanced activation in the auditory dorsal pathway of the language areas in functional MRIs. Higher cortical activation was also found in speech-related motor areas among extroverted mothers [76] . Additionally, in the first 6 months, the maternal oxytocin level is related to the amount of affectionate parenting behavior shown, including “motherese” vocalizations, the expression of positive affect, and affectionate touch [77] . Finally, maternal pathology may affect IDS. The influence of maternal depression on IDS has been the main focus of previous studies. Results show that depressed mothers fail to modify their behavior according to the behavior of their 3 to 4month-old infants, are slower to respond to their infants’ vocalizations, and are less likely to produce motherese [78] . Depressed mothers also speak less frequently with fewer affective and informative features with their 6 and 10-monthold infants, and the affective salience of their IDS fails to decrease over time [79] . Moreover, depressed mothers show smaller IDS f0 variance except when taking antidepressant medication and being in partial remission [80] . Mothers with schizophrenia also show less frequent use of IDS compared to other mothers with postnatal hospitalizations [81] .

注釈

脳科学的あるいは心理学的因子は乳児のしつけやIDSの発声と関係がある。 IDS を聞く際には、まだ言語を話す前の乳児の母親は(それより年上の乳児の母親とは異なり), ファンクショナル MRIs における言語野の音響的背面経路の活性度の強調が見られる。 高次皮質の活性度も外交的な母親の間で発話に関係するモーター野において発見されている [76] . 加えて最初の六ヶ月に関して母方のオキシトシンレベルは愛情のこもったしつけ行動の量、例えばマザリーズのボーカリゼーションや肯定的に影響する表現、 親密な摂食行動と関係している [77] 。 最後に、母親の病気もIDSに影響する [78] . IDS における母親の鬱病の影響は先行研究における主題の一つであった。 鬱病の母親は彼女らの行動を3−4ヶ月児に合わせて変化させることが少なく、IDS の目立った特徴を長い間変化させないという結果を示している [79] 。 さらに、鬱病の母親は IDS の f0 の変化が、抗うつ剤を接種している場合や一時的に鎮静している場合をのぞいて、少ないことを示している [80] 。 統合失調症をもつ母親もまた、IDS の使用が他の産後入院をしている母親と比較して、少ないことが示されている [81]

Thus, in addition to factors associated with the infants, various maternal factors (i.e., familial, socio-economic, physiological, and pathologic) can modulate IDS production.

注釈

従って、乳児と結びついた要素に加えて、 母親側の種々の要素 (例えば、経済状態や、心理学的、病理学的な状態) も IDS 発話を変調させうる。

4: Motherese effects on the infant (乳児に対するマザリーズの影響)

As hypothesized, IDS may function developmentally to communicate affect, regulate infants’ arousal and attention, and facilitate speech perception and language comprehension [16] [82] .

注釈

仮説として、IDSはコミュニケーション効果のある発達的な機能を持っており、通常の乳児を覚醒させ、注意を引き、発話の知覚や言語の理解を手助けする [16] [82] .

4.1. Communication of affect and physiological effects (コミュニケーションの影響と整理学的影響)

Though communication of affect is crucial with regard to communicating with very young infants without linguistic knowledge, few studies have addressed it. Despite the lack of available studies, IDS may convey mothers’ affect and influence infants’ emotions. As reported previously, prosodic patterns are more informative in IDS, and the variations in prosodic contours provide infants with reliable cues for determining their mothers’ affect and intentions. Indeed, when hearing an unfamiliar language in IDS, 5-month-olds smile more often to approvals and display negative affect in response to prohibitions, and these responses were not evident in ADS [83] . Similarly, IDS approval contours elevate infants’ looking, whereas disapproval contours inhibit infants’ looking [84] . Also 14-18 months old infants use prosody to understand intentions [85] . At a psycho-physiological level, a deceleration in heart rate was observed in 9-month-old infants listening to IDS, and EEG power, specifically in the frontal region, was linearly related to the affective intensity of IDS [86] . Finally, one study [87] reported an astonishing physiological correlation: 3 to 4month-old infants (N=52) grew more rapidly when their primary caregivers spoke high quality/quantity IDS. This could be influenced by other intermediate physiological factors, but this work needs to be replicated.

注釈

コミュニケーションの影響は言語の知識のない、幼い乳児とのコミュニケーションに関して致命的であり、いくつかの研究ではこれを特定した。 利用可能な研究が欠如しているが、 IDS は母親の感情を伝え、乳児の情動に対する影響するだろう。 今までの報告として、韻律パターンは IDS のほうがより豊富な情報をもち、韻律変化における多様性は、乳児に母親の感情や意図を推定するために利用可能な手がかりを提供する。 加えて、IDS でなじみのない言語を聞いた際には、5ヶ月児は承認に対してより微笑みを返し、否定的な感情を提示すると拒絶的な反応を返した。 また、これらの反応は ADS では起きなかった [83] 。 同様に、IDS の肯定的変化は乳児の視線をより向けさせるし、否定的な変化は乳児の視線をそらさせる [84] . 14-18 ヶ月の乳児も意図の理解に韻律を使用する [85] 。 心と体の相互作用レベルでは、9 ヶ月児が IDS を聞く際には心拍数の低下が観察された。 また、 EEG power 、特に前頭葉、は直線的に IDS の感情の強さと関連している [86] . 最後に、ある研究 [87] では思いがけない心理的な相互関係が報告されている。 3-4 ヶ月児(N=52)は彼らの主要な養育者が高い質と量の IDS を発話する際により急速に成長した。 これは他の中間的な心理学的因子の影響を受けている可能性もあり、この研究には追試が必要である。

4.2. Facilitation of social interactions through infants’ preference for IDS

Infants prefer to listen to IDS when compared to ADS [88] , and they show greater affective responsiveness to IDS than ADS [89] . This finding is also evident for deaf infants seeing infant-directed signing [44] and even for severely handicapped older hearing children [89] . Moreover, infants remember and look longer at individuals who have addressed them with IDS [90] . Finally, this greater responsiveness makes them more attractive to naïve adults, which helps maintain positive adultinfant interactions [91] .

Infants’ preferences follow a developmental course, on that they are present from birth and may not depend on any specific postnatal experience (though prenatal auditory experience with speech may play a role). One-month-old and even newborn infants prefer the IDS from an unfamiliar woman to the ADS from the same person [92] [93] [94] . While neonates sleep, the frontal cerebral blood flow increases more with IDS than with ADS, which suggests that IDS alerts neonates’ brains to attend to utterances even during sleep [95] . IDS preferences change with development, in that the preference for IDS decreases by the end of the first year [96] [97] . Thereafter, infants may be more inconsistent, in that one study found a preference for IDS [96] but another did not [97] . Thus, more studies are needed to understand the precise course of infants’ preferences for IDS after 9 months of age. With regard to the speech of their own mothers, only 4-month-old infants (and not 1-month-olds) prefer IDS to ADS [98] , and newborns prefer their mothers’ normal speech to IDS [99] . With regard to the quality of IDS, infants’ preferences also follow a developmental course. Fourmonth-olds prefer slow IDS with high affect, whereas 7-montholds prefer normal to slow IDS regardless of its affective level [100] . The developmental course of infants’ preferences is consistent with the type of affective intent used by mothers at each age [47] . The terminal falling contour of IDS (e.g., a comforting utterance) may serve to elicit a higher rate of vocal responses in 3-month-old infants [101] . Infants’ preferences shift between 3 and 6 months from comforting to approving, and between 6 and 9 months from approving to directing [102] . Rising, falling, and bell-shaped IDS contours arouse 4 to 8month-olds’ attention [103] . However, 6-month-olds, but not 4month-olds, are able to categorize IDS utterances into approving or comforting [104] .

Finally, adults prefer ADS (i.e., in content and prosody) to IDS [105] . What are the acoustic determinants of infants’ preference for IDS? When lexical content is eliminated, young infants show an auditory preference for the f0 patterns of IDS, but not for the amplitude (correlated to loudness) or duration patterns (rhythm) of IDS [88] [106] [107] . This pattern is consistent with the finding that infants prefer higher pitched singing [108] . However, deaf infants also show greater attention and affective responsiveness to infant-directed signing than to adult-directed signing [44] . Although an auditory stimulus with IDS characteristics was more easily detected in noise than one that resembled ADS characteristics [109] and mothers accentuate some IDS characteristics in a noisy context [97] , infants’ preference is independent of background noise [97] . Actually, IDS preference relies on a more general preference for positive affect in speech. When affect is held constant, 6-month-olds do not prefer IDS. They even prefer ADS if it contains more positive affect than IDS. Having a higher and more variable pitch is neither necessary nor sufficient for determining infants’ preferences, although f0 characteristics may modulate affectbased preferences [110] . This result may be linked with the finding that IDS’s prosody is driven by the widespread expression of emotion toward infants compared with the more inhibited manner of adult interactions [22] . However, though this issue may be very fruitful for future study, as was evident in the previous section, there is currently a lack of studies addressing the affective and emotional effects of motherese (for example, the immediate effects on infants’ expressions, variations according to infants’ age, later effects on infants’ attachment, and so on, as for mother-infant synchrony, the immediate and later effects of which are now well documented). In contrast, many studies simply address the more behavioral concept of “infants’ preference” for motherese. Finally, preferences depend on linguistic needs. Six-montholds, for example, prefer IDS that is directed at older infants when the frequency of repeated utterances is greater, thus matching the IDS directed at younger infants [111] . This preference for repetitiveness may explain why 6-month-olds prefer audiovisual episodes of their mothers singing rather than speaking IDS [112] [113] .

In summary, the preference for IDS, which is characterized by better attention, gaze and responsiveness from infants, is less prevalent for the infant’s own mother, and is generally related to the affective intensity of the voices. Moreover, this preference is modulated by the age of the infant, which is most likely due to infants’ affective and cognitive abilities and needs.

4.3. Arousing infants’ attention and learning

IDS has arousing properties and facilitates associative learning. In contrast to ADS, IDS elicits an increase in infants’ looking time between the first and second presentations. Similarly, when alternating ADS and IDS, infants’ responses to ADS are stronger if preceded by IDS, whereas their responses to IDS are weaker if preceded by ADS [114] . In a conditionedattention paradigm with IDS or ADS as the signal for a face, only IDS elicited a significant positive summation, and only when presented with a smiling or a sad face (not a fearful or an angry one) [115] . IDS may in fact serve as an ostensive cue, alerting a child to the referential communication directed at him or her. Eye-tracking techniques revealed that 6-month-olds followed an adult’s gaze (which is a potential communicativereferential signal) toward an object (i.e., joint attention) only when it was preceded by ostensive cues, such as IDS or a direct gaze [116] . Likewise, the prosodic pattern of motherese (which is similar to other cues such as eye contact, saying the infant’s name and contingent reactivity) triggered 14-montholds to attend to others’ emotional expressions that were directed toward objects [117] . Thus, IDS may help infants learn about objects from others and, more specifically, about others’ feelings toward these objects, which may pave the way for developing a theory of mind and intersubjectivity.

Yet, experience-dependent processes also influence the effects of IDS. Kaplan conducted several studies using the same conditioned-attention paradigm with face reinforcers to assess how parental depression affected infants’ learning. We know that depression reduces IDS quantity [78] and quality [80] , which may explain why infants of depressed mothers do not learn from their mother’s IDS yet still show strong associative learning in response to IDS produced by an unfamiliar, non-depressed mother [118] [119] . However, this learning was poorer when maternal depression lasted longer (e.g., with 1-year-old children of mothers with perinatal onset) [120] . Nevertheless, infants of chronically depressed mothers acquired associations from the IDS of non-depressed fathers [121] . Paternal involvement may also affect infants’ responsiveness to male IDS. In contrast with infants of unmarried mothers, infants of married mothers learned in response to male IDS, especially if their mothers were depressed [122] . However, as expected, infants of depressed fathers showed poorer learning from their fathers’ IDS [123] . Finally, current mother-infant interactions influence infants’ learning from their mothers’ IDS. In fact, f0 modulations, though smaller in depressed mothers’ IDS, did not predict infants’ learning, whereas maternal sensitivity did, even when accounting for maternal depression [124] . In summary, IDS learning facilitation is affected by past and current experiences (such as long durations of time with a depressed mother, having an involved father, and having a sensitive mother).

4.4. Facilitation of language acquisition
4.4.1. Does IDS’s prosody aid in language acquisition, and, if so, how?

The supra-segmental characteristics of IDS (i.e., f0 amplitude and duration) can facilitate syllable discrimination [125] [126] . When given vowel tokens that were drawn from either English or Japanese IDS, an algorithm successfully discovered the language-specific vowel categories, thereby reinforcing the theory that native language speech categories are acquired through distributional learning [127] . Trainor observed that, although the exaggerated pitch contours of IDS aid in the acquisition of vowel categories, the high pitch of IDS might impair infants’ ability to discriminate vowels (thereby serving a different function, such as attracting infants’ attention or aiding in their emotional communication) [128] . Nevertheless, IDS’s prosody facilitates syllabic discrimination and vowel categorization in the first 3 months.

IDS’s prosody may also help pre-linguistic infants segment speech into clausal units that have grammatical rules, and the pitch peaks of IDS, especially at the ends of utterances, may assist in word segmentation and recognition, which facilitates speech processing. Indeed, 7 to 10-month-olds prefer to listen to speech samples that are segmented at clause boundaries than to samples with pauses inserted at within-clause locations [129] , but this was only for IDS samples, not for ADS samples [130] . Infants can distinguish words from syllable sequences that span word boundaries after exposure to nonsense sentences spoken with IDS’s prosody, but not with ADS’s prosody [131] . Moreover, mothers of 20-month-old late-talkers marked fewer nouns with a pitch peak and used more flat pitch contours than mothers of typical children [132] . In a review of previous research, Morgan suggested that prosody is an important contributor to early language understanding and assists infants in developing the root processes of parsing [133] .

Stress information shapes how statistics are calculated from the speech input and is encoded in the representations of the parsed speech sequences. For example, to parse sequences from an artificial language, 7 and 9-month-olds adopted a stress-initial syllable strategy and appeared to encode the stress information as part of their proto-lexical representations [134] . In fluent speech, 7.5-month-olds prefer to listen to words produced with emphatic stress, although recognition was most enhanced when the degree of emphatic stress was identical during familiarization and recognition tasks [135] . Does word learning with IDS’s prosody impair word recognition in ADS? The high affective variation in IDS appears to help preverbal infants recognize repeated encounters with words, which creates both generalizable representations and phonologically precise memories for the words. Conversely, low affective variability appears to degrade word recognition in both aspects, thereby compromising infants’ ability to generalize across different affective forms of a word and detect similar sounding items [136] . Automatic isolated-word speech recognizers trained on IDS did not always generate better recognition performances, but, for mismatched data, their relative loss in performance was less severe than that of recognizers trained on ADS, which may be due to the larger class overlaps in IDS [137] . Additionally, 7 to 8-month-old infants were successful on word recognition tasks when words were introduced in IDS and not successful for those introduced in ADS, regardless of the register of recognition stimuli [138] . Furthermore, IDS may be more easily detected than ADS in noisy environments [109] . Finally, clarity may vary with the age of the listener. Having a slow speaking rate and vowel hyper-articulation improved 19month-olds’ ability to recognize words, but having a wide pitch range did not [139] . For adult listeners, words that were isolated from parents’ speech to their 2 to 3-year-olds were less intelligible than words produced in ADS [140] .

Thus, IDS’s prosody facilitates vowel categorization, syllabic discrimination, speech segmentation in words and grammatical units, and word recognition. Moreover, IDS’s prosody may serve as an attentional spotlight that increases brain activity to potentially meaningful words [141] . Indeed, event-related potentials increased more for IDS than ADS (only in response to familiar words for 6-month-olds and to unfamiliar words for 13-month-olds).

4.4.2. Do the linguistic properties of IDS aid in language acquisition, and, if so, how?

In response to both Chomsky’s view that motherese is a form of degenerate speech and the resulting theoretical impetus toward nativist explanations of language acquisition, several researchers have sought for evidence that language input to children is highly structured and possibly highly informative for the learner. There has been a lively debate between the proponents of motherese as a useful tool for language acquisition and those who contend that it does not aid language acquisition. First, Newport [142] claimed that motherese is not a syntax-teaching language, given that it may be an effect rather than a cause of learning language. Newport and colleagues found few correlations between the syntax evident in caregivers’ speech and language development. Responding to Furrow [143] , one study with two groups of agematched children (18 to 21-month-olds and 24 to 27-montholds) also found few effects of the syntax of mothers’ IDS on children’s language growth, with most effects restricted to a very young age group, which suggested that the complexity of maternal speech is positively correlated with child language growth in this age range [144] . Scarborough [145] also found that maternal speech type did not influence language development.

However, other studies that considered children’s level of language at the time of maternal speech assessment found a relationship between maternal IDS’s semantic and syntactic categories and children’s language development. Several characteristics (e.g., MLU and pronoun use) of mothers’ IDS with their 18-month-olds predicted the children’s subsequent (27-month-old) speech, specifically, the mothers’ choice of simple constructions facilitated language growth [143] [146] . Rowe, in a study controlling for toddlers’ previous vocabulary abilities, found that CDS at 30 months of age predicted children’s vocabulary ability one year later [75] . As early as 13 months of age, pre-existing differences were found between mothers of earlier and later talkers. When individual differences in style of language acquisition (i.e., expressive versus nonexpressive styles) were examined, several associations emerged for the “non-expressive” group between the IDS type at 13 months of age and the mean length of utterance at 20 months of age [147] .

Which linguistic characteristics of motherese may aid in language acquisition? First, the statistically prominent structural properties of CDS that may facilitate language acquisition are present in realistic CDS corpora [148] . In particular, the partial overlap of successive utterances, which is well known in CDS, enhances adults’ acquisition of syntax in an artificial language [149] . CDS contains isolated words and short, frequently used sentence frames. A familiar sentence context may aid in word acquisition given that 18-month-olds are slower to interpret target words (i.e., familiar object names) in isolation than when these words are preceded by a familiar carrier phrase [150] . The tendency in IDS to put target words in sentence-final positions may help infants segment the linguistic stream. When hearing IDS in Chinese, English-speaking adults learned the target words only when the words were placed in the final position, and this was not when they were placed in a medial position [151] . Finally, the use of diminutives (a pervasive feature of CDS that is evident in many languages) facilitates word segmentation in adults hearing an unfamiliar language [152] [153] , and enhances gender categorization [154] and gender agreement even in languages that uses few diminutives [155] .

In summary, results support the idea that prosodic and linguistic aspects of IDS play an important role in language acquisition. One possibility is that prosodic components play a major part in the very early stages of language acquisition and linguistic aspects play an increasingly important part later in development when children gain some verbal abilities.

Discussion (ディスカッション)
1: Summary

Our review has some limitations. Some studies may have not been identified because not recorded in our 2 databases. Some studies without significant results may have not been reported (risk of a publication bias). And some results of included studies may be considered with caution because they don’t have been replicated or they sometimes derive from a little sample of participants. Some highlights emerge from this review, however. IDS transcends specific languages. Mothers, fathers, grandmothers and other caregivers all modify their speech when addressing infants, and infants demonstrate a preference for IDS. Nonetheless, various factors related either to the caregiver or to the infant influence the quality of the IDS. If present from birth, IDS, like an infant’s preference for IDS, follows a developmental course that can be influenced by the infant’s experience (see Kaplan’s work). IDS consists of linguistic and supra-linguistic modifications. The linguistic modifications include shorter utterances, vocabulary and syntactic simplifications, and the use of diminutives and repetition, all of which are designed to facilitate comprehension and aid in language acquisition. Prosodic modifications may serve more ubiquitous functions. Using a higher pitch matches infants’ preferences, and, using a wider f0 range may facilitate infants’ arousal and learning. Prosodic contours convey caregivers’ affect and intentions, and some of these contours stimulate infants’ responsiveness. Finally, exaggerated pitch contours and phonetic modifications facilitate vowel discrimination, phonetic categorization, speech segmentation, word recognition and syntax acquisition.

2: Positioning IDS within a More Global Communication Phenomenon

We observed that mothers adjust their IDS to their infants’ abilities. From a broader communications perspective, IDS may be part of a more general phenomenon of adaptation to a partner during communication. First, other cases of speech adjustment to the listener exist. Adults simplify their vocabulary choices when speaking with children who are up to 12 years of age [156] . In speech directed at elderly adults, CDS (which clarifies instructions by giving them in an attention-getting manner) is often used and may improve elderly adults’ performance and arousal in difficult tasks [157] . Even in normal ADS, new words are highlighted with prosodic cues. In both IDS and ADS, repeated words are shorter, quieter, lower pitched, and less variable in pitch than words the first time they are mentioned, and they are placed in less prominent positions relative to new words in the same utterance [5] . Even in master–dog dyads, the structural properties of “doggerel” (PDS) are strikingly similar to the structural properties of motherese except in functional and social areas [158] . Second, speakers other than human mothers and caregivers adjust their speech to infants. Four-year-old children modify some of their prosodic characteristics when speaking to infants, in that they speak more slowly, tend to lower their f0, and change their amplitude variability [159] . The linguistic content of educational children’s programs also generally follows the linguistic constraints and adjustments that are evident in adults’ CDS [160] . The use of IDS by humans has been compared with the “caregiver call” (which is almost exclusively infant-directed) in squirrel monkeys, of which the variability of several acoustic features, most notably pitch range and contour, is associated with particular contexts of infant care, such as nursing or retrieval [161] . Similarly, tamarins are calmed by music with the “acoustical characteristics of tamarin affiliation vocalizations” [162] . In a comparison of the mother-infant gestural and vocal interactions of chimpanzees and humans, Falk [163] , suggested that pre-linguistic vocal substrates for motherese evolved as females gave birth to relatively undeveloped neonates and adopted new strategies that entailed maternal quieting, reassuring, and controlling of the behaviors of physically removed infants (who were unable to cling to their mothers’ bodies). The characteristic vocal melodies of human mothers’ speech to infants might be biologically relevant signals that have been shaped by natural selection [164] , a finding that is integrated in a more general human and nonhuman communication field.

3: Integrating IDS into the Nature of Mother-Infant Interactions

IDS implies emotion sharing, mother-infant adjustment, synchrony and multimodal communication. Indeed, IDS is part of a multimodal, synchronous communication style used with infants to sustain interactions and highlight messages. Mothers support their vocal communication with other modalities (e.g., gestural, tactile, and visual). At a gestural level (“gesturese”), mothers of 16 and 20-month-old infants employ mainly concrete deictic gestures (e.g., pointing) that are redundant with the message being conveyed in speech to disambiguate and emphasize the verbal utterance. Moreover, children’s verbal and gestural productions and vocabulary size may be correlated with maternal gesture production [165] [166] . Mothers’ demonstrations of the properties of novel objects to infants are higher in interactiveness, enthusiasm, proximity to the partner, range of motion, repetitiveness and simplicity, thereby indicating that mothers modify their infant-directed actions in ways that likely maintain infants’ attention and highlight the structure and meaning of an action [167] . Moreover, mothers’ singing and synchronous behaviors with the beat (“songese”) segment the temporal structure of the interaction, such that 3to 8-month-old infants are sensitive to their mothers’ emphasis by producing more synchronous behaviors on some beats than on others. The multimodal sensory information provided by mothers shares the characteristics of “motherese” and may ensure effective learning in infants [168] . Mothers also use contingency and synchrony (both intrapersonal and interpersonal) to reinforce dialogues and exchanges. By highlighting focal words using the nonlinguistic contextual information that is available to the listener and by producing frequent repetitions and formulaic utterances, IDS may be a form of “hyper-speech” that facilities comprehension by modifying the phonetic properties of the individual words and providing contextual support on perceptual levels that are accessible to infants even in the earliest stages of language learning [169] . Pragmatic dimensions of IDS may provide contingent support that assists in language comprehension and acquisition. In a case study, both parents used approximately equal amounts of language with their infants, but the functions of the mother’s speech differed importantly from those of the father’s speech with regard to providing more interactive negotiations, which could be crucial to language development [170] . Thus, IDS appears to be a part of a maternal interactive style that supports the affective and verbal communication systems of the developing infant.

IDS should be regarded as an emotional form of speech. Several studies highlight the impact of emotion on both motherese production and its effects, particularly with regard to prosodic characteristics that are conditioned by vocal emotions [22] . In general, acoustic analyses of f0 are positively associated with subjective judgments of emotion [171] . Thus, prosody (which is linked with f0 values and contours) reveals affective quantity and quality. The literature on infants’ perception of facial and vocal expressions indicates that infants’ recognition of affective expressions relies first on multimodally presented information, then on recognition of vocal expressions and finally on facial expressions [172] . Moreover, IDS’s affective value determines infants’ preferences [110] . Therefore, mothers’ affective pathologies, which include maternal depression, alter motherese and impair infants’ conditioned learning with IDS. Could IDS, music and emotion be linked before birth through prenatal associations between a mother’s changing emotional state, concomitant changes in hormone levels in the placental blood and prenatally audible sounds? These links may be responsible for infants’ sensitivity to motherese and music [173] .

Finally, IDS highlights mother-infant adjustments during interactions. Mothers adjust their IDS to infants’ age, cognitive abilities and linguistic level. Therefore, IDS may arouse infants’ attention by signaling speech that is specifically addressed to them, with content and form that are adapted for them. Mothers also adapt their IDS to infants’ reactivity and preferences. Mothers’ continuous adjustments to their infants result in the facilitation of exchanges and interactions, with positive consequences for sharing emotions and for learning and language acquisition. Thus, maternal sensitivity predicts infants’ learning better than f0 ranges do [124] . Infants’ reactivity is also important given that their presence increases motherese [9] , and infants’ positive, contingent feedback makes them more attractive [91] , which in turn increases the quality of the motherese [57] [58] . Mother-infant contingency and synchrony are crucial for IDS production and prolongation.

In :num:`Figure #summary-of-the-motherese-interactive-loop` , we summarize the main points that were previously discussed. We suggest that motherese mediates and reflects an interactive loop between the infant and the caregiver, such that each person’s response may increase the initial stimulation of the other partner. At the behavioral level, this interactive loop is underpinned by the emotional charge of the affective level and affects, at the cognitive level, attention, learning and the construction of intersubjective tools, such as joint attention and communicative skills. Direct evidence of this intertwinement of cognitive and interactive levels is offered by Kuhl’s finding that infants’ learning of the phonetic properties of a language requires interactions with a live linguistic partner [174] , as audiovisual input is insufficient for this. Regarding this impact of social interaction on natural speech and language learning, Kuhl wondered whether the underlying mechanism could be the increased motivation, the enriched information that social settings provide, or a combination of both factors [175] . Given that autistic children and children raised in social deprivation do not develop a normal language, Kuhl suggested that the social brain “gates” language acquisition. As an outcome of our review, we suggest that the co-construction that emerges from the reciprocal infant-maternal adaptation and reinforcement via the interactive loop could be crucial to the development of infants’ cognitive and verbal abilities, which would be consistent with humans’ fundamental social nature.

Conclusion (結論)

Some authors held the perspective that, beyond language acquisition, IDS significantly influences infants’ cognitive and emotional development (e.g., [4] [176] ). Our systematic review supports this view. More studies are needed to understand how IDS impacts affective factors in infants and how this is linked with infants’ cognitive development, however. An interesting approach may be to investigate how this process is altered by infants’ communicative difficulties, such as early signs of autism spectrum disorder, and how these alterations may affect infants’ development [177] .

_images/fig22.png

Summary of the motherese interactive loop (a) and its socio-cognitive implications (2B).

  • 1A: The motherese interactive loop implies that motherese is both a vector and a reflection of mother-infant interaction.
  • 2B: Motherese affects intersubjective construction and learning. Its implications for infants’ early socio-cognitive development are evident in affect transmission and sharing, and in infants’ preferences, engagement, attention, learning and language acquisition.

doi: 10.1371/journal.pone.0078103.g002

注釈

Supporting Information (DOCX)

Annex S1. Rejected papers and reasons for their exclusion. (DOCX) Checklist S1. PRISMA Checklist.

注釈

Author Contributions

Conceived and designed the experiments: DC MC CSG MCL FM. Performed the experiments: CSG RC AM FA. Analyzed the data: CSG MC MCL DC. Contributed reagents/materials/ analysis tools: AM MC. Wrote the manuscript: CSG DC MC AM FA MCL RC FM.

References (参照)
[1]Saxton M (2008) What’s in a name? Coming to terms with the child’s linguistic environment. J Child Lang 35: 677-686. PubMed: 18588720.
[2]Ferguson CA (1964) Baby Talk in Six Languages. Am Anthropol 66: 103-114. doi:10.1525/aa.1964.66.suppl_3.02a00060.
[3](1, 2, 3) Soderstrom M, Morgan JL (2007) Twenty-two-month-olds discriminate fluent from disfluent adult-directed speech. Dev Sci 10: 641-653. doi: 10.1111/j.1467-7687.2006.00605.x. PubMed: 17683348.
[4](1, 2) Snow CE, Ferguson CA (1977) Talking to children. Cambridge, UK: Cambridge University Press.
[5](1, 2, 3, 4, 5) Fisher C, Tokura H (1995) The given-new contract in speech to infants. J Mem Lang 34: 287-310. doi:10.1006/jmla.1995.1013.
[6](1, 2, 3, 4) Grieser DL, Kuhl PK (1988) Maternal speech to infants in a tonal language: Support for universal prosodic features in motherese. Dev Psychol 24: 14-20. doi:10.1037/0012-1649.24.1.14.
[7](1, 2, 3, 4, 5) Soderstrom M, Blossom M, Foygel R, Morgan JL (2008) Acoustical cues and grammatical units in speech to two preverbal infants. J Child Lang 35: 869-902. doi:10.1017/S0305000908008763. PubMed: 18838016.
[8](1, 2) Durkin K, Rutter DR, Tucker H (1982) Social interaction and language acquisition: Motherese help you. First Lang 3: 107-120. doi: 10.1177/014272378200300803.
[9](1, 2, 3, 4, 5) Fernald A, Simon T (1984) Expanded intonation contours in mothers’ speech to newborns. Dev Psychol 20: 104-113. doi: 10.1037/0012-1649.20.1.104.
[10]Ogle SA, Maidment JA (1993) Laryngographic analysis of child-directed speech. Eur J Disord Commun 28: 289-297. doi:10.1111/j. 1460-6984.1993.tb01570.x. PubMed: 8241583.
[11](1, 2, 3, 4, 5, 6) Fernald A, Taeschner T, Dunn J, Papousek M, de Boysson-Bardies B et al. (1989) A cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverbal infants. J Child Lang 16: 477-501. doi:10.1017/S0305000900010679. PubMed: 2808569.
[12](1, 2) Niwano K, Sugai K (2003) Pitch Characteristics of Speech During Mother-Infant and Father-Infant Vocal Interactions. Jpn J Spec Educ 40: 663-674.
[13](1, 2) Shute B, Wheldall K (1999) Fundamental frequency and temporal modifications in the speech of British fathers to their children. Educ Psychol 19: 221-233. doi:10.1080/0144341990190208.
[14](1, 2) Shute B, Wheldall K (2001) How do grandmothers speak to their grandchildren? Fundamental frequency and temporal modifications in the speech of British grandmothers to their grandchildren. Educ Psychol 21: 493-503. doi:10.1080/01443410120090858.
[15](1, 2) Nwokah EE (1987) Maidese versus motherese–is the language input of child and adult caregivers similar? Lang Speech 30 ( 3): 213-237. PubMed: 3503947.
[16](1, 2, 3, 4) Fernald A (1989) Intonation and communicative intent in mothers’ speech to infants: is the melody the message? Child Dev 60: 1497-1510. doi:10.2307/1130938. PubMed: 2612255.
[17](1, 2) Bryant GA, Barrett HC (2007) Recognizing intentions in infant-directed speech: evidence for universals. Psychol Sci 18: 746-751. doi: 10.1111/j.1467-9280.2007.01970.x. PubMed: 17680948.
[18]Katz GS, Cohn JF, Moore CA (1996) A combination of vocal fo dynamic and summary features discriminates between three pragmatic categories of infant-directed speech. Child Dev 67: 205-217. doi: 10.1111/j.1467-8624.1996.tb01729.x. PubMed: 8605829.
[19](1, 2) Stern DN, Spieker S, MacKain K (1982) Intonation contours as signals in maternal speech to prelinguistic infants. Dev Psychol 18: 727-735. doi:10.1037/0012-1649.18.5.727.
[20](1, 2) Papoušek M, Papoušek H, Symmes D (1991) The meanings of melodies in motherese in tone and stress languages. Infant Behav Dev 14: 415-440. doi:10.1016/0163-6383(91)90031-M.
[21](1, 2) Slaney M, McRoberts G (2003) BabyEars: A recognition system for affective vocalizations. Speech Commun 39: 367-384. doi:10.1016/ S0167-6393(02)00049-3.
[22](1, 2, 3, 4) Trainor LJ, Austin CM, Desjardins RN (2000) Is infant-directed speech prosody a result of the vocal expression of emotion? Psychol Sci 11: 188-195. doi:10.1111/1467-9280.00240. PubMed: 11273402.
[23](1, 2) Nwokah EE, Hsu HC, Davies P, Fogel A (1999) The integration of laughter and speech in vocal communication: a dynamic systems perspective. J Speech Lang Hear Res 42: 880-894. PubMed: 10450908.
[24](1, 2) Burnham D, Kitamura C, Vollmer-Conna U (2002) What’s new, pussycat? On talking to babies and animals. Science 296: 1435-1435. doi:10.1126/science.1069587. PubMed: 12029126.
[25](1, 2) Swanson LA, Leonard LB, Gandour J (1992) Vowel duration in mothers’ speech to young children. J Speech Hear Res 35: 617-625. PubMed: 1608253.
[26](1, 2) Swanson LA, Leonard LB (1994) Duration of function-word vowels in mothers’ speech to young children. J Speech Hear Res 37: 1394-1405. PubMed: 7877296.
[27](1, 2) Fernald A, Mazzie C (1991) Prosody and focus in speech to infants and adults. Dev Psychol 27: 209-221. doi:10.1037/0012-1649.27.2.209.
[28](1, 2) Albin DD, Echols CH (1996) Stressed and word-final syllables in infantdirected speech. Infant Behav Dev 19: 401-418. doi:10.1016/ S0163-6383(96)90002-8.
[29]Inoue T, Nakagawa R, Kondou M, Koga T, Shinohara K (2011) Discrimination between mothers’ infant and adult-directed speech using hidden Markov models. Neurosci Res, 70: 62–70. PubMed: 21256898.
[30]Mahdhaoui A, Chetouani M, Cassel RS, Saint-Georges C, Parlato E, et al. (2011) Computerized home video detection for motherese may help to study impaired interaction between infants who become autistic and their parents. Int J Methods Psychiatr Res 20: e6-e18. doi:10.1002/mpr. 332. PubMed: 21574205.
[31]Cristià A (2010) Phonetic enhancement of sibilants in infant-directed speech. J Acoust Soc Am 128: 424-434. doi:10.1121/1.3436529. PubMed: 20649236.
[32](1, 2) Englund K, Behne D (2006) Changes in Infant Directed Speech in the First Six Months. Infant Child Dev 15: 139-160. doi:10.1002/icd.445.
[33](1, 2) Englund KT, Behne DM (2005) Infant directed speech in natural interaction–Norwegian vowel quantity and quality. J Psycholinguist Res 34: 259-280. doi:10.1007/s10936-005-3640-7. PubMed: 16050445.
[34](1, 2) Englund K (2005) Voice onset time in infant directed speech over the first six months. First Lang 25: 219-234. doi: 10.1177/0142723705050286.
[35]Green JR, Nip IS, Wilson EM, Mefferd AS, Yunusova Y (2010) Lip movement exaggerations during infant-directed speech. J Speech Lang Hear Res 53: 1529-1542. doi:10.1044/1092-4388(2010/09-0005) PubMed: 20699342
[36](1, 2, 3, 4) Lee SA, Davis B, Macneilage P (2010) Universal production patterns and ambient language influences in babbling: a cross-linguistic study of Korean and English-learning infants. J Child Lang England, 37: 293-318. PubMed: 19570317.
[37](1, 2, 3, 4) Shute B, Wheldall K (1989) Pitch alterations in British motherese: some preliminary acoustic data. J Child Lang 16: 503-512. doi:10.1017/ S0305000900010680. PubMed: 2808570.
[38](1, 2) Segal O, Nir-Sagiv B, Kishon-Rabin L, Ravid D (2009) Prosodic patterns in Hebrew child-directed speech. J Child Lang 36: 629-656. doi:10.1017/S030500090800915X. PubMed: 19006600.
[39](1, 2) Lee S, Davis BL, Macneilage PF (2008) Segmental properties of input to infants: a study of Korean. J Child Lang 35: 591-617. PubMed: 18588716.
[40]Grieser DL, Kuhl K (1988) Maternal speech to infants in a tonal language: Support for universal prosodic features in motherese. Dev Psychol 24: 14-20. doi:10.1037/0012-1649.24.1.14.
[41](1, 2, 3) Liu HM, Tsao FM, Kuhl PK (2007) Acoustic analysis of lexical tone in Mandarin infant-directed speech. Dev Psychol 43: 912-917. doi: 10.1037/0012-1649.43.4.912. PubMed: 17605524.
[42](1, 2, 3, 4) Fais L, Kajikawa S, Amano S, Werker JF (2010) Now you hear it, now you don’t: vowel devoicing in Japanese infant-directed speech. J Child Lang England, 37: 319-340. PubMed: 19490747.
[43](1, 2) Masataka N (1992) Motherese in a signed language. Infant Behav Dev 15: 453-460. doi:10.1016/0163-6383(92)80013-K.
[44](1, 2, 3, 4) Masataka N (1998) Perception of motherese in Japanese sign language by 6-month-old hearing infants. Dev Psychol 34: 241-246. doi:10.1037/0012-1649.34.2.241. PubMed: 9541776.
[45](1, 2, 3, 4) Reilly JS, Bellugi U (1996) Competition on the face: affect and language in ASL motherese. J Child Lang 23: 219-239. PubMed: 8733568.
[46](1, 2) Werker JF, Pons F, Dietrich C, Kajikawa S, Fais L et al. (2007) Infantdirected speech supports phonetic category learning in English and Japanese. Cognition 103: 147-162. doi:10.1016/j.cognition. 2006.03.006. PubMed: 16707119.
[47](1, 2, 3, 4, 5) Kitamura C, Burnham D (2003) Pitch and communicative intent in mother’s speech: Adjustments for age and sex in the first year. Infancy 4: 85-110. doi:10.1207/S15327078IN0401_5.
[48](1, 2) Stern DN, Spieker S, Barnett RK, MacKain K (1983) The prosody of maternal speech: infant age and context related changes. J Child Lang 10: 1-15. PubMed: 6841483.
[49](1, 2) Liu HM, Tsao FM, Kuhl PK (2009) Age-related changes in acoustic modifications of Mandarin maternal speech to preverbal infants and five-year-old children: a longitudinal study. J Child Lang 36: 909-922. doi:10.1017/S030500090800929X. PubMed: 19232142.
[50](1, 2) Niwano K, Sugai K (2002) Intonation contour of Japanese maternal infant-directed speech and infant vocal response. Jpn J Spec Educ 39: 59-68.
[51](1, 2) Kajikawa S, Amano S, Kondo T (2004) Speech overlap in Japanese mother-child conversations. J Child Lang 31: 215-230. doi:10.1017/ S0305000903005968. PubMed: 15053091.
[52](1, 2) Amano S, Nakatani T, Kondo T (2006) Fundamental frequency of infants’ and parents’ utterances in longitudinal recordings. J Acoust Soc Am 119: 1636-1647. doi:10.1121/1.2161443. PubMed: 16583908.
[53](1, 2) Snow (1972) Mother’s speech to children learning language. Child Dev 43: 549-565. doi:10.2307/1127555.
[54](1, 2) Kitamura C, Thanavishuth C, Burnham D, Luksaneeyanawin S (2002) Universality and specificity in infant-directed speech: Pitch modifications as a function of infant age and sex in a tonal and nontonal language. Infant Behav Dev 24: 372-392.
[55]Feldman R (2007) Parent–infant synchrony and the construction of shared timing; physiological precursors, developmental outcomes, and risk conditions. J Child Psychol Psychiatry 48: 329-354. doi:10.1111/j. 1469-7610.2006.01701.x. PubMed: 17355401.
[56](1, 2) Gleason JB (1977) Talking to children: some notes on feed-back. In: F Sa. Talking to Children: Language Input and Acquisition. Cambridge, UK: Cambridge University Press.
[57](1, 2, 3) Braarud HC, Stormark KM (2008) Prosodic modification and vocal adjustments in mothers’ speech during face-to-face interaction with their two to four-month-old infants: A double video study. Soc Dev 17: 1074-1084. doi:10.1111/j.1467-9507.2007.00455.x.
[58](1, 2, 3) Smith NA, Trainor LJ (2008) Infant-directed speech is modulated by infant feedback. Infancy 13: 410-420. doi: 10.1080/15250000802188719.
[59](1, 2) Shimura Y, Yamanoucho I (1992) Sound spectrographic studies on the relation between motherese and pleasure vocalization in early infancy. Acta Paediatr Jpn 34: 259-266. doi:10.1111/j.1442-200X. 1992.tb00956.x. PubMed: 1509871.
[60](1, 2) Van Puyvelde M, Vanfleteren P, Loots G, Deschuyffeleer S, Vinck B et al. (2010) Tonal synchrony in mother-infant interaction based on harmonic and pentatonic series. Infant Behav Dev 33: 387-400. doi: 10.1016/j.infbeh.2010.04.003. PubMed: 20478620.
[61](1, 2) McRoberts GW, Best CT (1997) Accommodation in mean f0 during mother-infant and father-infant vocal interactions: a longitudinal case study. J Child Lang 24: 719-736. doi:10.1017/S030500099700322X. PubMed: 9519592.
[62](1, 2) Reissland N, Stephenson T (1999) Turn-taking in early vocal interaction: a comparison of premature and term infants’ vocal interaction with their mothers. Child Care Health Dev 25: 447-456. doi: 10.1046/j.1365-2214.1999.00109.x. PubMed: 10547707.
[63](1, 2) Niwano K, Sugai K (2003) Maternal accommodation in infant-directed speech during mother’s and twin-infants’ vocal interactions. Psychol Rep 92: 481-487. doi:10.2466/pr0.2003.92.2.481. PubMed: 12785629.
[64](1, 2) Reissland N, Shepherd J, Cowie L (2002) The melody of surprise: maternal surprise vocalizations during play with her infant. Infant Child Dev 11: 271-278. doi:10.1002/icd.258.
[65]Lederberg AR (1984) Interaction between deaf preschoolers and unfamiliar hearing adults. Child Dev 55: 598-606. doi:10.2307/1129971. PubMed: 6723449.
[66]Fidler DJ (2003) Parental vocalizations and perceived immaturity in down syndrome. Am J Ment Retard 108: 425-434. doi: 10.1352/0895-8017(2003)108. PubMed: 14561106.
[67]Gogate LJ, Bahrick LE, Watson JD (2000) A study of multimodal motherese: The role of temporal synchrony between verbal labels and gestures. Child Dev 71: 878-894. doi:10.1111/1467-8624.00197. PubMed: 11016554.
[68]Kavanaugh RD, Jirkovsky AM (1982) Parental speech to young children: A longitudinal analysis. Merrill-Palmer. Q: J of Dev Psych 28: 297-311.
[69]Bohannon JN, Marquis AL (1977) Children’s control of adult speech. Child Dev 48: 1002-1008. doi:10.2307/1128352.
[70]Bergeson TR, Miller RJ, McCune K (2006) Mothers’ Speech to Hearing-Impaired Infants and Children With Cochlear Implants. Infancy 10: 221-240. doi:10.1207/s15327078in1003_2.
[71]Kondaurova MV, Bergeson TR (2010) The effects of age and infant hearing status on maternal use of prosodic cues for clause boundaries in speech. J Speech Lang Hear Res
[72](1, 2) Ikeda Y, Masataka N (1999) A variable that may affect individual differences in the child-directed speech of Japanese women. Jpn Psychol Res 41: 203-208. doi:10.1111/1468-5884.00120.
[73](1, 2) Hoff E, Tian C (2005) Socioeconomic status and cultural influences on language. J Commun Disord 38: 271-278. doi:10.1016/j.jcomdis. 2005.02.003. PubMed: 15862810.
[74](1, 2, 3, 4) Hoff-Ginsberg E (1991) Mother-child conversation in different social classes and communicative settings. Child Dev 62: 782-796. doi: 10.2307/1131177. PubMed: 1935343.
[75](1, 2, 3, 4, 5) Rowe ML (2008) Child-directed speech: relation to socioeconomic status, knowledge of child development and child vocabulary skill. J Child Lang 35: 185-205. PubMed: 18300434.
[76](1, 2) Matsuda YT, Ueno K, Waggoner RA, Erickson D, Shimura Y et al. (2011) Processing of infant-directed speech by adults. NeuroImage 54: 611-621. doi:10.1016/j.neuroimage.2010.07.072. PubMed: 20691794.
[77](1, 2) Gordon I, Zagoory-Sharon O, Leckman JF, Feldman R (2010) Oxytocin and the development of parenting in humans. Biol Psychiatry 68: 377-382. doi:10.1016/j.biopsych.2010.02.005. PubMed: 20359699.
[78](1, 2, 3) Bettes BA (1988) Maternal depression and motherese: temporal and intonational features. Child Dev 59: 1089-1096. doi:10.2307/1130275. PubMed: 3168616.
[79](1, 2) Herrera E, Reissland N, Shepherd J (2004) Maternal touch and maternal child-directed speech: effects of depressed mood in the postnatal period. J Affect Disord 81: 29-39. doi:10.1016/j.jad. 2003.07.001. PubMed: 15183597.
[80](1, 2, 3) Kaplan PS, bachorowski J-A, Smoski MJ, Zinser M (2001) Role of clinical diagnosis and medication use in effects of maternal depression on infant-directed speech. Infancy 2: 537-548. doi:10.1207/ S15327078IN0204_08.
[81](1, 2) Wan MW, Penketh V, Salmon MP, Abel KM (2008) Content and style of speech from mothers with schizophrenia towards their infants. Psychiatry Res 159: 109-114. doi:10.1016/j.psychres.2007.05.012. PubMed: 18329722.
[82](1, 2) McLeod PJ (1993) What studies of communication with infants ask us about psychology: Baby-talk and other speech registers. Canadian Psychology/Psychologie canadienne 34: 282-292
[83](1, 2) Fernald A (1993) Approval and disapproval: Infant responsiveness to vocal affect in familiar and unfamiliar languages. Child Dev 64: 657-674. doi:10.2307/1131209. PubMed: 8339687.
[84](1, 2) Papoušek M, Bornstein MH, Nuzzo C, Papoušek H (1990) Infant responses to prototypical melodic contours in parental speech. Infant Behav Dev 13: 539-545. doi:10.1016/0163-6383(90)90022-Z.
[85](1, 2) Elena S, Merideth G (2012) Infants infer intentions from prosody. Cogn Dev 27: 1-16. doi:10.1016/j.cogdev.2011.08.003.
[86](1, 2) Santesso DL, Schmidt LA, Trainor LJ (2007) Frontal brain electrical activity (EEG) and heart rate in response to affective infant-directed (ID) speech in 9-month-old infants. Brain Cogn 65: 14-21. doi:10.1016/ j.bandc.2007.02.008. PubMed: 17659820.
[87](1, 2) Monnot M (1999) Function of infant-directed speech. Hum Nat 10: 415-443. doi:10.1007/s12110-999-1010-0.
[88](1, 2) Fernald A, Kuhl P (1987) Acoustic determinants of infant preference for motherese speech. Infant Behav Dev 10: 279-293. doi: 10.1016/0163-6383(87)90017-8.
[89](1, 2) Santarcangelo S, Dyer K (1988) Prosodic aspects of motherese: effects on gaze and responsiveness in developmentally disabled children. J Exp Child Psychol 46: 406-418. doi:10.1016/0022-0965(88)90069-0. PubMed: 3216186.
[90]Schachner A, Hannon EE (2010) Infant-directed speech drives social preferences in 5-month-old infants. Dev Psychol, 47: 19–25. PubMed: 20873920.
[91](1, 2) Werker JF, McLeod PJ (1989) Infant preference for both male and female infant-directed talk: a developmental study of attentional and affective responsiveness. Can J Psychol 43: 230-246. doi:10.1037/ h0084224. PubMed: 2486497.
[92]Cooper RP (1993) The effect of prosody on young infants’ speech perception. Advances Infancy Res 8: 137-167.
[93]Cooper RP, Aslin RN (1990) Preference for infant-directed speech in the first month after birth. Child Dev 61: 1584-1595. doi: 10.2307/1130766. PubMed: 2245748.
[94]Pegg JE, Werker JF, McLeod PJ (1992) Preference for infant-directed over adult-directed speech: Evidence from 7-week-old infants. Infant Behav Dev 15: 325-345. doi:10.1016/0163-6383(92)80003-D.
[95]Saito Y, Aoyama S, Kondo T, Fukumoto R, Konishi N et al. (2007) Frontal cerebral blood flow change associated with infant-directed speech. Arch Dis Child Fetal Neonatal Ed England, 92: F113-F116. PubMed: 16905571.
[96](1, 2) Hayashi A, Tamekawa Y, Kiritani S (2001) Developmental change in auditory preferences for speech stimuli in Japanese infants. J Speech Lang Hear Res 44: 1189-1200. doi:10.1044/1092-4388(2001/092). PubMed: 11776357.
[97](1, 2, 3, 4) Newman RS, Hussain I (2006) Changes in Preference for InfantDirected Speech in Low and Moderate Noise by 4.5 to 13-Month-Olds. Infancy 10: 61-76. doi:10.1207/s15327078in1001_4.
[98]Cooper RP, Abraham J, Berman S, Staska M (1997) The development of infants’ preference for motherese. Infant Behav Dev 20: 477-488. doi:10.1016/S0163-6383(97)90037-0.
[99]Hepper PG, Scott D, Shahidullah S (1993) Newborn and fetal response to maternal voice. J Reprod Infant Psychol 11: 147-153. doi: 10.1080/02646839308403210.
[100]Panneton R, Kitamura C, Mattock K, Burnham D (2006) Slow Speech Enhances Younger But Not Older Infants’ Perception of Vocal Emotion. Res Hum Dev 3: 7-19. doi:10.1207/s15427617rhd0301_2.
[101]Niwano K, Sugai K (2002) Acoustic determinants eliciting Japanese infants’ vocal response to maternal speech. Psychol Rep 90: 83-90. doi:10.2466/pr0.2002.90.1.83. PubMed: 11899017.
[102]Kitamura C, Lam C (2009) Age-specific preferences for infant-directed affective intent. Infancy 14: 77-100. doi:10.1080/15250000802569777.
[103]Kaplan PS, Owren MJ (1994) Dishabituation of visual attention in 4month-olds by infant-directed frequency sweeps. Infant Behav Dev 17: 347-358. doi:10.1016/0163-6383(94)90027-2.
[104]Spence MJ, Moore DS (2003) Categorization of infant-directed speech: development from 4 to 6 months. Dev Psychobiol 42: 97-109. doi: 10.1002/dev.10093. PubMed: 12471640.
[105]Johnson DM, Dixon DR, Coon RC, Hilker K, Gouvier WD (2002) Watch what you say and how you say it: differential response to speech by participants with and without head injuries. Appl Neuropsychol 9: 58-62. doi:10.1207/S15324826AN0901_7. PubMed: 12173751.
[106]Cooper RP, Aslin RN (1994) Developmental differences in infant attention to the spectral properties of infant-directed speech. Child Dev 65: 1663-1677. doi:10.2307/1131286. PubMed: 7859548.
[107]Leibold LJ, Werner LA (2007) Infant auditory sensitivity to pure tones and frequency-modulated tones. Infancy 12: 225-233. doi:10.1111/j. 1532-7078.2007.tb00241.x.
[108]Trainor LJ, Zacharias CA (1998) Infants prefer higher-pitched singing. Infant Behav Dev 21: 799-805. doi:10.1016/S0163-6383(98)90047-9.
[109](1, 2) Colombo J, Frick JE, Ryther JS, Coldren JT (1995) Infants’ detection of analogs of ‘motherese’ in noise. Merrill-Palmer. Q: J of Dev Psych 41: 104-113.
[110](1, 2) Singh L, Morgan JL, Best CT (2002) Infants’ Listening Preferences: Baby Talk or Happy Talk? Infancy 3: 365-394. doi:10.1207/ S15327078IN0303_5.
[111]McRoberts GW, McDonough C, Lakusta L (2009) The role of verbal repetition in the development of infant speech preferences from 4 to 14 months of age. Infancy 14: 162-194. doi:10.1080/15250000802707062.
[112]Nakata T, Trehub SE (2004) Infants’ responsiveness to maternal speech and singing. Infant Behav Dev 27: 455-464. doi:10.1016/ j.infbeh.2004.03.002.
[113]Trehub SE, Nakata T (2001) Emotion and music in infancy. Musicae Sci Spec Issue: 2001-2002: 37-61
[114]Kaplan PS, Goldstein MH, Huckeby ER, Cooper RP (1995) Habituation, sensitization, and infants’ responses to motherese speech. Dev Psychobiol 28: 45-57. doi:10.1002/dev.420280105. PubMed: 7895923.
[115]Kaplan PS, Jung PC, Ryther JS, Zarlengo-Strouse P (1996) Infantdirected versus adult-directed speech as signals for faces. Dev Psychol 32: 880-891. doi:10.1037/0012-1649.32.5.880.
[116]Senju A, Csibra G (2008) Gaze following in human infants depends on communicative signals. Curr Biol 18: 668-671. doi:10.1016/j.cub. 2008.03.059. PubMed: 18439827.
[117]Gergely G, Egyed K, Király I (2007) On pedagogy. Dev Sci 10: 139-146. doi:10.1111/j.1467-7687.2007.00576.x. PubMed: 17181712.
[118]Kaplan PS, Bachorowski JA, Smoski MJ, Hudenko WJ (2002) Infants of depressed mothers, although competent learners, fail to learn in response to their own mothers’ infant-directed speech. Psychol Sci 13: 268-271. doi:10.1111/1467-9280.00449. PubMed: 12009049.
[119]Kaplan PS, Bachorowski JA, Zarlengo-Strouse P (1999) Child-directed speech produced by mothers with symptoms of depression fails to promote associative learning in 4-month-old infants. Child Dev 70: 560-570. doi:10.1111/1467-8624.00041. PubMed: 10368910.
[120]Kaplan PS, Danko CM, Diaz A, Kalinka CJ (2010) An associative learning deficit in 1-year-old infants of depressed mothers: Role of depression duration. Infant Behav Dev.
[121]Kaplan PS, Dungan JK, Zinser MC (2004) Infants of chronically depressed mothers learn in response to male, but not female, infantdirected speech. Dev Psychol 40: 140-148. doi: 10.1037/0012-1649.40.2.140. PubMed: 14979756.
[122]Kaplan PS, Danko CM, Diaz A (2010) A Privileged Status for Male Infant-Directed Speech in Infants of Depressed Mothers? Role of Father Involvement. Infancy 15: 151-175. doi:10.1111/j. 1532-7078.2009.00010.x.
[123]Kaplan PS, Sliter JK, Burgess AP (2007) Infant-directed speech produced by fathers with symptoms of depression: effects on infant associative learning in a conditioned-attention paradigm. Infant Behav Dev 30: 535-545. doi:10.1016/j.infbeh.2007.05.003. PubMed: 17604106.
[124](1, 2) Kaplan PS, Burgess AP, Sliter JK, Moreno AJ (2009) Maternal Sensitivity and the Learning-Promoting Effects of Depressed and NonDepressed Mothers’ Infant-Directed Speech. Infancy 14: 143-161. doi: 10.1080/15250000802706924. PubMed: 20046973.
[125]Karzon RG (1985) Discrimination of polysyllabic sequences by one to four-month-old infants. J Exp Child Psychol 39: 326-342. doi: 10.1016/0022-0965(85)90044-X. PubMed: 3989467.
[126]Karzon RG, Nicholas JG (1989) Syllabic pitch perception in 2 to 3month-old infants. Percept Psychophys 45: 10-14. doi:10.3758/ BF03208026. PubMed: 2913563.
[127]Vallabha GK, McClelland JL, Pons F, Werker JF, Amano S (2007) Unsupervised learning of vowel categories from infant-directed speech. Proc Natl Acad Sci U S A 104: 13273-13278. doi:10.1073/pnas. 0705369104. PubMed: 17664424.
[128]Trainor LJ, Desjardins RN (2002) Pitch characteristics of infant-directed speech affect infants’ ability to discriminate vowels. Psychon Bull Rev 9: 335-340. doi:10.3758/BF03196290. PubMed: 12120797.
[129]Hirsh-Pasek K, Kemler Nelson DG, Jusczyk PW, Cassidy KW (1987) Clauses are perceptual units for young infants. Cognition 26: 269-286. doi:10.1016/S0010-0277(87)80002-1. PubMed: 3677573.
[130]Kemler Nelson DG, Hirsh-Pasek K, Jusczyk PW, Cassidy KW (1989) How the prosodic cues in motherese might assist language learning. J Child Lang 16: 55-68. doi:10.1017/S030500090001343X. PubMed: 2925815.
[131]Thiessen ED, Hill EA, Saffran JR (2005) Infant-Directed Speech Facilitates Word Segmentation. Infancy 7: 53-71. doi:10.1207/ s15327078in0701_5.
[132]D’Odorico L, Jacob V (2006) Prosodic and lexical aspects of maternal linguistic input to late-talking toddlers. Int J Lang Commun Disord 41: 293-311. doi:10.1080/13682820500342976. PubMed: 16702095.
[133]Morgan JL (1996) Prosody and the roots of parsing. Lang Cogn Processes 11: 69-106. doi:10.1080/016909696387222.
[134]Curtin S, Mintz TH, Christiansen MH (2005) Stress changes the representational landscape: evidence from word segmentation. Cognition 96: 233-262. doi:10.1016/j.cognition.2004.08.005. PubMed: 15996560.
[135]Bortfeld H, Morgan JL (2010) Is early word-form processing stress-full? How natural variability supports recognition. Cogn Psychol 60: 241-266. doi:10.1016/j.cogpsych.2010.01.002. PubMed: 20159653.
[136]Singh L (2008) Influences of high and low variability on infant word recognition. Cognition 106: 833-870. doi:10.1016/j.cognition. 2007.05.002. PubMed: 17586482.
[137]Kirchhoff K, Schimmel S (2005) Statistical properties of infant-directed versus adult-directed speech: insights from speech recognition. J Acoust Soc Am 117: 2238-2246. doi:10.1121/1.1869172. PubMed: 15898664.
[138]Singh L, Nestor S, Parikh C, Yull A (2009) Influences of infant-directed speech on early word recognition. Infancy 14: 654-666. doi: 10.1080/15250000903263973.
[139]Song JY, Demuth K, Morgan J (2010) Effects of the acoustic properties of infant-directed speech on infant word recognition. J Acoust Soc Am 128: 389-400. doi:10.1121/1.3419786. PubMed: 20649233.
[140]Bard EG, Anderson AH (1983) The unintelligibility of speech to children. J Child Lang 10: 265-292. PubMed: 6874768.
[141]Zangl R, Mills DL (2007) Increased Brain Activity to Infant-Directed Speech in 6-and 13-Month-Old Infants. Infancy 11: 31-62. doi:10.1207/ s15327078in1101_2.
[142]Newport E, Gleitman H, Gleitman L (1977) Mother, I’d rather do it myself: Some effects and non-effects of maternal speech style. . In: CSaC, Ferguson. Talking to Children: Language Input and Acquisition. Cambridge: Cambridge University Press.
[143](1, 2) Furrow D, Nelson K, Benedict H (1979) Mothers’ speech to children and syntactic development: Some simple relationships. J Child Lang 6: 423-442. PubMed: 536408.
[144]Gleitman LR, Newport EL, Gleitman H (1984) The current status of the motherese hypothesis. J Child Lang 11: 43-79. PubMed: 6699113.
[145]Scarborough H, Wyckoff J (1986) Mother, I’d still rather do it myself: some further non-effects of ‘motherese’. J Child Lang 13: 431-437. PubMed: 3745343.
[146]Furrow D, Nelson K (1986) A further look at the motherese hypothesis: a reply to Gleitman, Newport & Gleitman. J Child Lang 13: 163-176. PubMed: 3949896.
[147]Hampson J, Nelson K (1993) The relation of maternal language to variation in rate and style of language acquisition. J Child Lang 20: 313-342. PubMed: 8376472.
[148]Waterfall HR, Sandbank B, Onnis L, Edelman S (2010) An empirical generative framework for computational modeling of language acquisition. J Child Lang 37: 671-703. doi:10.1017/ S0305000910000024. PubMed: 20420744.
[149]Onnis L, Waterfall HR, Edelman S (2008) Learn locally, act globally: learning language from variation set cues. Cognition 109: 423-430. doi: 10.1016/j.cognition.2008.10.004. PubMed: 19019350.
[150]Fernald A, Hurtado N (2006) Names in frames: infants interpret words in sentence frames faster than words in isolation. Dev Sci 9: F33-F40. doi:10.1111/j.1467-7687.2005.00460.x. PubMed: 16669790.
[151]Golinkoff RM, Alioto A (1995) Infant-directed speech facilitates lexical learning in adults hearing Chinese: implications for language acquisition. J Child Lang 22: 703-726. PubMed: 8789520.
[152]Kempe V, Brooks PJ, Gillis S (2005) Diminutives in child-directed speech supplement metric with distributional word segmentation cues. Psychon Bull Rev 12: 145-151. doi:10.3758/BF03196360. PubMed: 15945207.
[153]Kempe V, Brooks PJ, Gillis S, Samson G (2007) Diminutives facilitate word segmentation in natural speech: cross-linguistic evidence. Mem Cogn 35: 762-773. doi:10.3758/BF03193313. PubMed: 17848033.
[154]Kempe V, Brooks PJ, Mironova N, Fedorova O (2003) Diminutivization supports gender acquisition in Russian children. J Child Lang 30: 471-485. doi:10.1017/S0305000903005580. PubMed: 12846306.
[155]Seva N, Kempe V, Brooks PJ, Mironova N, Pershukova A et al. (2007) Crosslinguistic evidence for the diminutive advantage: gender agreement in Russian and Serbian children. J Child Lang 34: 111-131. doi:10.1017/S0305000906007720. PubMed: 17340940.
[156]Hayes DP, Ahrens MG (1988) Vocabulary simplification for children: a special case of ‘motherese’? J Child Lang 15: 395-410. doi:10.1017/ S0305000900012411. PubMed: 3209647.
[157]Bunce VL, Harrison DW (1991) Child or adult-directed speech and esteem: effects on performance and arousal in elderly adults. Int J Aging Hum Dev 32: 125-134. doi:10.2190/JKA7-15D2-0DFT-U934. PubMed: 2055658.
[158]Hirsh-Pasek K, Treiman R (1982) Doggerel: motherese in a new context. J Child Lang 9: 229-237. PubMed: 7061632.
[159]Weppelman TL, Bostow A, Schiffer R, Elbert-Perez E, Newman RS (2003) Children’s use of the prosodic characteristics of infant-directed speech. Lang Commun 23: 63-80. doi:10.1016/ S0271-5309(01)00023-4.
[160]Rice ML, Haight PL (1986) “Motherese” of Mr. Rogers: a description of the dialogue of educational television programs. J Speech Hear Disord 51: 282-287. PubMed: 3736028.
[161]Biben M, Symmes D, Bernhards D (1989) Contour variables in vocal communication between squirrel monkey mothers and infants. Dev Psychobiol 22: 617-631. doi:10.1002/dev.420220607. PubMed: 2792572.
[162]Snowdon CT, Teie D (2010) Affective responses in tamarins elicited by species-specific music. Biol Lett 6: 30-32. doi:10.1098/rsbl.2009.0593. PubMed: 19726444.
[163]Falk D (2004) Prelinguistic evolution in early hominins: whence motherese? Behav Brain Sci 27: 483-503; discussion: 15773427.
[164]Fernald A (1992) Human maternal vocalizations to infants as biologically relevant signals: An evolutionary perspective. In: Barkow, Cosmides, Tooby, editors. The Adapted. Mind: 391—428
[165]Iverson JM, Capirci O, Longobardi E, Caselli MC (1999) Gesturing in mother–child interactions. Cogn Dev 14: 57-75. doi:10.1016/ S0885-2014(99)80018-5.
[166]O’Neill M, Bard KA, Linnell M, Fluck M (2005) Maternal gestures with 20-month-old infants in two contexts. Dev Sci 8: 352-359. doi:10.1111/j. 1467-7687.2005.00423.x. PubMed: 15985069.
[167]Brand RJ, Baldwin DA, Ashburn LA (2002) Evidence for ‘motionese’: Modifications in mothers’ infant-directed action. Dev Sci 5: 72-83. doi: 10.1111/1467-7687.00211.
[168]Longhi E (2009) ‘Songese’: Maternal structuring of musical interaction with infants. Psychol Music 37: 195-213. doi: 10.1177/0305735608097042.
[169]Fernald A (2000) Speech to infants as hyperspeech: knowledge-driven processes in early word recognition. Phonetica 57: 242-254. doi: 10.1159/000028477. PubMed: 10992144.
[170]Matychuk P (2005) The role of child-directed speech in language acquisition: A case study. Lang Sci 27: 301-379. doi:10.1016/j.langsci. 2004.04.004.
[171]Monnot M, Orbelo D, Riccardo L, Sikka S, Rossa E (2003) Acoustic analyses support subjective judgments of vocal emotion. Ann N Y Acad Sci 1000: 288-292. PubMed: 14766639.
[172]Walker-Andrews AS (1997) Infants’ perception of expressive behaviors: differentiation of multimodal information. Psychol Bull 121: 437-456. doi:10.1037/0033-2909.121.3.437. PubMed: 9136644.
[173]Parncutt R (2009) Prenatal and infant conditioning, the mother schema, and the origins of music and religion. Musicae Sci special issue: 119-150.
[174]Kuhl PK, Tsao FM, Liu HM (2003) Foreign-language experience in infancy: effects of short-term exposure and social interaction on phonetic learning. Proc Natl Acad Sci U S A United States, 100: 9096-9101. PubMed: 12861072.
[175]Kuhl PK (2007) Is speech learning gated by the social brain? Dev Sci 10: 110-120. doi:10.1111/j.1467-7687.2007.00572.x. PubMed: 17181708.
[176]Papousek H, Papousek M (1983) Biological basis of social interactions: implications of research for an understanding of behavioural deviance. J Child Psychol Psychiatry 24: 117-129. doi:10.1111/j. 1469-7610.1983.tb00109.x. PubMed: 6826670.
[177]Cohen D, Cassel RS, Saint-Georges C, Mahdhaoui A, Laznik M-C, et al. (2013) Do Parentese Prosody and Fathers’ Involvement in Interacting Facilitate Social Interaction in Infants Who Later Develop Autism? PLOS ONE 8: e61402. doi:10.1371/journal.pone.0061402. PubMed: 23650498.

affiliation

[178](1, 2, 3) Department of Child and Adolescent Psychiatry, Pitié-Salpêtrière Hospital, Université Pierre et Marie Curie, Paris, France,
[179](1, 2, 3, 4) Institut des Systèmes Intelligents et de Robotique, Centre National de la Recherche Scientifique 7222, Université Pierre et Marie Curie, Paris, France,
[180]Laboratoire de Psychopathologie et Processus de Santé (LPPS, EA 4057), Institut de Psychologie de l’Université Paris Descartes, Paris, France,
[181](1, 2) IRCCS Scientific Institute Stella Maris, University of Pisa, Pisa, Italy,
[182]Department of Child and Adolescent Psychiatry, Association Santé Mentale du 13ème, Centre Alfred Binet, Paris, France

Phonological theory informs the analysis of intonational exaggeration in Japanese infant-directed speech

  • Yosuke Igarashi - Graduate School of Letters, Hiroshima University, - 1-2-3 Kagamiyama, Higashihiroshima-shi,Hiroshima 739-8522, Japan
  • Ken’ya Nishikawa, Kuniyoshi Tanaka, and Reiko Mazuka - Laboratory for Language Development, Brain Science Institute, RIKEN, - 2-1 Hirosawa, Wako-shi,Saitama 351-0198, Japan
abstract

To date, the intonation of infant-directed speech (IDS) has been analyzed without reference to its phonological structure. Intonational phonology should, however, inform IDS research, discovering important properties that have previously been overlooked. The present study investigated “intonational exaggeration” in Japanese IDS using the intonational phonological framework. Although intonational exaggeration, which is most often measured by pitch-range expansion, is one of the best-known characteristics of IDS, Japanese has been reported to lack such exaggeration. The present results demonstrated that intonational exaggeration is in fact present and observed most notably at the location of boundary pitch movements, and that the effects of lexical pitch accents in the remainder of the utterances superficially mask the exaggeration. These results not only reveal dynamic aspects of Japanese IDS, but also in turn contribute to the theory of intonational phonology, suggesting that paralinguistic pitch-range modifications most clearly emerge where the intonation system of a language allows maximum flexibility in varying intonational contours.

I.INTRODUCTION
A.Background

During the past few decades, studies of intonation in various languages have demonstrated that the intonation patterns of a language possess a phonological organization (Pierrehumbert, 1980; Ladd, 1996). Research on typologically different languages has revealed that each language’s way of organizing intonation reflects both universal and language-specific properties (Gussenhoven, 2004; Jun, 2005). To date, the theoretical frameworks that have been presented for intonational phonology have been largely based on the analysis of speech produced in idealized conditions, such as controlled laboratory speech. In natural speech, however, which occurs in live communication among speakers under a variety of conditions and contexts, para- or extra-linguistic factors such as the emotional intent of speakers and the cognitive constraints of speakers or listeners can impact intonation. In many languages, for example, fundamental frequency (F0) is raised when the speaker is angry and lowered when s/he is sad (e.g., Williams and Stevens, 1972). Utterances spoken by individuals with autism reportedly have monotonous F0 contours (cf. McCann and Peppe, 2003). These variations in F0 contour can be seen as phonetic implementations of a phonological representation of intonation (Gussenhoven, 2004). Better understanding of these factors will increase our knowledge of intonational phonology in general. To fully capture the nature of the intonation of human language, therefore, it will be useful to further expand the scope of research beyond idealized speech and include more speech data produced under various real-life conditions.

To this end, analyses of how intonation is modified in specialized registers of speech, such as infant-directed speech (IDS), could shed new light on intonation. As discussed further below, intonation of IDS tends to be “exaggerated.” The fact that the intonation of a language can be modified systematically in a given register indicates that the system of intonation is dynamic; that is, it can shift the realization of intonation dynamically to accommodate the specific paralinguistic factors of a register. Such dynamic properties, which must be one of the ways the phonological structure of intonation is implemented phonetically, cannot be observed in idealized speech. In turn, the development of an intonational phonological framework gives IDS research (and language acquisition research broadly) a tool to capture the universal and language-specific ways intonation is modified in IDS. In the present paper, specifically, we will demonstrate that exaggeration of intonation is actually present in Japanese IDS, a fact which has been overlooked in previous studies. This discovery was made possible only when the phonological structure of Japanese intonation was taken into consideration.

B.Infant-directed speech

IDS is known to play several important roles in communication between infants and caregivers, such as capturing the infants’ attention, communicating affect, and facilitating the infants’ language development as a result of certain distinctive properties (e.g., Fernald, 1989; Kitamura et al., 2002). An extensive body of research has examined in what ways IDS differs from adult-directed speech (ADS) among the world’s languages (cf. Soderstrom, 2007, for a review). Of the various properties of IDS, modification of intonation is arguably the best known. Specifically, caregivers are reported to use “exaggerated” or “sing-songy” intonation in IDS across many languages, and this intonation is often argued to be one of the universal properties of IDS (Ferguson, 1977; Fernald et al., 1989; Grieser and Kuhl, 1988; Kitamura et al., 2002). Intonational exaggeration in IDS, which is defined in general as the expansion of the pitch range of utterances as compared to ADS, is an example of the dynamic properties of the intonation structures, as discussed above, where the realization of intonation is modified or “stretched” from a typical (ADS) intonation by the paralinguistic factor of the caregiver’s attempt to communicate with the infant.

Interestingly, however, significant cross-linguistic differences are also known to exist in the characteristics of IDS (Bernstein Ratner and Pye, 1984; Fernald et al., 1989; Grieser and Kuhl, 1988; Kitamura et al., 2002; Papousek et al., 1991). Fernald et al. (1989) compared intonational modifications in six languages/varieties (French, Italian, German, British English, American English, and Japanese), and found that all of them except Japanese showed pitchrange expansion in IDS. Fernald (1993) also reported that although English-learning infants responded appropriately to positive and negative emotional prosody in IDS in English, German, and Italian, they failed to do so with Japanese IDS. These findings suggest that intonational modifications in Japanese IDS may differ substantially from those of Germanic and Romance languages.

It might be possible for differences in intonational modifications in IDS between languages to arise from cultural differences in mother–infant interaction (Bornstein et al., 1992; Fernald and Morikawa, 1993; Ingram, 1995; Toda et al., 1990). For example, Fernald and Morikawa (1993) revealed that Japanese mothers interact with their infants quite differently from their American counterparts. The implication of this account is that intonation in Japanese IDS is not exaggerated. For Japanese adults, however, the intonational characteristics of Japanese IDS are clearly distinct from those of ADS, and to them, perceptually, it does sound exaggerated (Horie et al., 2008). For Japanese infants as well, behavioral studies show that Japanese IDS is preferred over ADS (Hayashi et al., 2001). It is possible that these Japanese adults and infants were responding to characteristics of Japanese IDS not related to its intonation (cf. Inoue et al., 2011). Nonetheless, the fact that the infants’ data are comparable to findings from among English-learning infants (Cooper et al., 1997; Newman and Hussain, 2006; Pegg et al., 1992) and that the pitch characteristics of IDS seem to be the critical factor in infants’ preferences both in English-learning infants (Fernald and Kuhl, 1987) and Japanese-learning infants (Hayashi, 2004), is sufficient to prompt us to explore the possibility that Japanese IDS is in fact exaggerated.

C.The present study

In the present paper, we will pursue an alternative account of the apparent lack of intonational exaggeration in Japanese IDS. Given that the intonation of a language has a phonological organization (Ladd, 1996; Gussenhoven, 2004; Jun, 2005), it is reasonable to assume that the phonology of a language restricts the way the realization of intonation is affected by paralinguistic factors. For example, languages differ in the way, and the degree to which, pitch changes are utilized lexically as opposed to intonationally. In previous studies, pitch-range expansion in IDS in tone languages, such as Chinese and Thai, has been demonstrated to be significantly smaller than that in English (Grieser and Kuhl, 1988; Kitamura et al., 2002; Papousek et al., 1991), arguably because lexical use of pitch in a language restricts the flexibility to exaggerate pitch range in IDS. Japanese, which is not a tone language but a pitch accent language (Pierrehumbert and Beckman, 1988) may represent a case in which utterance-level differences in intonation due to the speech register (ADS vs IDS) are superficially camouflaged by the impact of lexical pitch accent.

Previous studies have described the intonational modification of IDS without reference to the internal phonological structure of intonation: They have simply measured the mean, minimum, and maximum, values of the F0 contour of the overall utterance (e.g., Ferguson, 1977; Fernald et al., 1989; Grieser and Kuhl, 1988; Papousek et al., 1991; Kitamura et al., 2002). However, if the mechanism of register-induced modification of intonation differs across languages due to language-specificity in intonational phonology, then these conventional measurements may not have been sufficient to capture the nature of intonational exaggeration for each language. In the present study, we analyze a large corpus of IDS in Japanese (Mazuka et al., 2006), and demonstrate that (1) Japanese mothers in fact expand the pitch ranges when they talk to their infants; (2) that this pitch-range expansion or intonational exaggeration in Japanese IDS is observed locally at specific structural positions; and (3) that the profile of the intonational exaggeration cannot be captured unless the phonological structure of Japanese intonation is taken into account.

The structure of this paper is as follows. In Sec. II, we will first describe the phonological structure of Japanese intonation and discuss the phonological entities that should be considered for the valid measurement of pitch modification in Japanese. Section III describes the design and characteristics of the Japanese IDS corpus we used for the analysis, and Sec. IV describes the results of the analysis. Finally, in Sec. V, we will discuss the significance of our results in light of the universal and language-specific characteristics of IDS.

II.THE JAPANESE INTONATION SYSTEM
A.The phonology of Japanese intonation

The description of the Japanese intonation system in this paper is based on the framework called X-JToBI (Maekawa et al., 2002). It is an extended version of the original Japanese Tone and Break Indices, or J_ToBI, framework (Venditti, 2005), which owes its theoretical foundation to the major study of Japanese intonation by Pierrehumbert and Beckman (1988). In this section, we will describe two aspects of Japanese intonation that are relevant to the present study: (1) Lexical pitch accent (and related phenomena) and (2) boundary pitch movement (BPM).

2.BPM

The second major element of Japanese intonation is the inventory of BPMs. BPMs are tones that can occur at the end of an AP and contribute to the pragmatic interpretation of the phrase, indicating features such as questioning, emphasis, and continuation (Venditti et al., 2008). Not all APs have BPMs; in fact, most APs are not marked by BPMs. The occurrence of BPMs is not restricted to IP-final or utterance-final APs. They can also occur at the end of IPmedial APs.

The inventory of BPMs indicated in the X-JToBI system is H% (rise), LH% (scooped rise), (HL)% (rise-fall), and HLH% (rise-fall-rise), as well as their variations (Maekawa et al. 2002). Figure 4 depicts these four main types of BPM. As can be seen from the figure, all types of BPM consist of a rise at their beginning. In most cases, the rise starts around the onset of the AP-final mora.

Which pragmatic intentions each of the BPM conveys is not without controversy. Briefly speaking, H% gives prominence to the constituent to which it associates, LH% is generally exploited to signal a question, and HL% is often used in a context where a speaker is explaining some point to a listener (Venditti et al., 2008). HLH% occurs quite infrequently, and, according to Venditti et al. (2008), it gives a wheeling or cajoling quality to the utterance. In Figs. 2 and 3 above, the final morae of the utterances are marked by a LH% (scooped rise) BPM.

_images/fig13.png

注釈

FIG. 1.

Waveforms and F0 contours of unaccented accentual phrase (AP) amai ame “sweet candy” (left) and accented AP uma’i ame “tasty candy.” Vertical lines indicate AP boundaries. The adjective amai “sweet” and the noun ame “candy” in the phrase amai ame “sweet candy” (left) are both lexically specified as unaccented, while the adjective uma’i “tasty” in the phrase uma’i ame “tasty candy” (right) is lexically specified as accented on the second mora/ ma/, which exhibits an F0 fall starting near its end.

_images/fig23.png

注釈

FIG. 2.

Waveform and F0 contour of utterances without and with downstep: an utterance without downstep yubiwa-o wasure-ta onna’-wa dare-desu-ka? “Who is the woman that left the ring behind?” (top), and an utterance with downstep on the third AP yubiwa-o era’nd-ta onna’-wa da’re-desu-ka? “Who is the woman that chose the ring?” (bottom). Dotted vertical lines stand for AP boundaries, and solid vertical lines for IP boundaries. In the utterance in the top panel, an accented AP onna’-wa “woman-TOP” follows unaccented APs yubiwa-o “ring-ACC” and wasureta “left behind,” and thus downstep does not occur. In the utterance in the bottom panel, on the other hand, the accented AP onna’-wa “woman-TOP” follows an accented AP era’nda “chose.” The F0 peak of the AP onna’-wa “woman-TOP” is therefore reduced due to downstep caused by the preceding accented AP. In both utterances, pitch range is reset at the beginning of the last AP da’re-desu-ka “who-COPULA-Q,” whose peak is as high as that in the preceding AP. We can therefore posit an IP boundary between on’nawa and da’re-desu-ka “who-COPULA-Q” in both utterances.

_images/fig31.png

注釈

FIG. 3.

Waveform and F0 contour of an utterance with four successive downsteps, ao’i ya’ne-no ie’-o era’nda onna’-wa da’re-desu-ka? “Who is the woman who chose the house with a blue roof?” Vertical lines indicate AP boundaries and solid vertical lines indicate IP boundaries. Five accented APs, ao’i ya’ne-no ie-o era’nda onna’-wa “blue roof-GEN house-ACC woman-TOP” constitute a single IP. The F0 peaks of the last four APs are iteratively lowered, i.e., downstepped, until pitch reset occurs at the beginning of the following IP. Also, anticipatory raising is observed in the F0 peak of the first AP, which is notably high.

_images/fig41.png

注釈

FIG. 4.

Waveforms and F0 contours of four BPM types—H%, LH%, HL%, and LH%—appearing in the same utterance I’ma-ne… “Just now…”. Squares mark BPMs.

B.Possible impacts of pitch accent and BPM on pitch range

There are two elements that can impact pitch ranges in Japanese intonation: Lexical pitch accent (and associated downstep and anticipatory raising) and BPMs. As will be clearer below, each of these two factors functions to expand the pitch range for different reasons. This leads us to hypothesize (1) that the intonational exaggeration in IDS most clearly occurs in those tones that the speaker chooses for pragmatic reasons, which only occur at BPMs in Japanese, (2) that pitch ranges can also be enlarged by pitch accents, whose effect grows as the IPs become longer and the number of accents within the IP increases, and (3) that the interaction of these two factors can superficially mask the intonational exaggeration that may exist in Japanese IDS.

We begin by discussing the first part of this hypothesis. When speakers express pragmatic intent, they may do so by varying the intonation contour. Languages do not, however, typically permit speakers to vary F0 contours without limits—rather, they are restricted by the phonology of the language (cf. Ladd, 1996). In other words, in order to convey specific pragmatic information, speakers are free to choose from an inventory of intonational/pragmatic tones (Ward and Hirschberg, 1985; Pierrehumbert and Hirschberg, 1990), but such pragmatically chosen tones are structured according to the phonology of the particular language. It follows not only that the inventory of these pragmatic tones is languagespecific, but also that the locations in the utterance in which pragmatically chosen tones appear differ across languages. Given that variability in the f0 contour is restricted by the phonology of the language, it is reasonable to assume that intonational exaggeration will be most clearly observed at those locations where the intonation system of a language allows maximum flexibility in varying F0 contours (i.e., the location where pragmatically chosen tones can appear), rather than to assume that it is uniformly distributed over the utterance.

Cross-linguistic differences in the locations of pragmatic tones (as well as in their inventory) may become clear by comparing the intonation system of Japanese to that of English. Figure 5(a) shows the finite-state grammar for English of Pierrehumbert (1980), which can generate all the intonational contours in that language. In this language, all of the tonal categories (pitch accent, phrase accent, and boundary tone) involve a choice of tones. For each type of structure speakers choose one of these alternatives in order to convey different types of pragmatic information (Ward and Hirschberg, 1985; Pierrehumbert and Hirschberg, 1990), and thus these tones are pragmatically chosen ones.

The structure of Japanese intonation (Pierrehumbert and Beckman, 1988; Maekawa et al., 2002; Venditti, 2005) is summarized in the finite-state grammar in Fig. 5(b), which shows that except for the end of the phrase, the F0 contour does not allow much variability. The only tonal options available in this part of the utterance involve the presence or absence of H*þL (a lexical pitch accent). As discussed above, however, this is an intrinsic part of the lexical representation of a word and does not vary according to the speaker’s pragmatic intent.

The contour at the end of the phrase, in contrast, can be much more variable. This is the location allocated to various types of BPM. Unlike lexical pitch accent, the choice of BPMs depends on pragmatic factors. When speakers wish to express some pragmatic information, they have a choice as to whether they assign a BPM to a given phrase and as to what type of BPM they use.

If, as we hypothesized above, intonational exaggeration in IDS should emerge at the pragmatically chosen tones, then in English it should appear anywhere in the contour: It should be observed not only in stressed syllables where pragmatic pitch accents occur but also at the edges of phrases where phrasal accents and boundary tones appear. In Japanese, by contrast, exaggeration should be confined to a more restricted part of the contour, namely, in the BPM at the end of the AP, which is the only location where pragmatically chosen tones are realized in this language.

In Japanese the section of the F0 contour which does not include the BPM, which we will refer to as the BODY (we treat %L, H-, H*þL, and L% collectively as the BODY), is largely determined by the lexical specifications of the words in the phrase and thus any register-induced pitch-range expansion in the BODY is expected to be of minimum magnitude for this language. This prediction is consistent with the findings mentioned above from tone languages such as Chinese and Thai, in which lexically specified tones are densely distributed in utterances and registerinduced pitch-range expansion has been shown to be significantly smaller than that in English (Grieser and Kuhl, 1988; Kitamura et al., 2002; Papousek et al., 1991). Thus, the BODY in Japanese patterns with utterances in tone languages with respect to flexibility in varying F0 contours.

We now turn to the second part of our hypothesis. It concerns the effects of pitch accents and accompanied downstep and anticipatory raising that enlarges the pitch range of the BODY.

One of the effects lexical pitch accent has on pitch range is that it lowers the minimum point of the F0 contour and thus enlarges the overall pitch range of a single AP. As can be seen from Fig. 1, the pitch range of the accented AP uma’i ame (left) is larger than that of an unaccented AP amai ame (right). Downstep is also responsible for the enlargement of the pitch range. Although the primary consequence of downstep is the lowering of the F0 peaks (Hor H*þL) of APs, as shown in Fig. 3, F0 valleys (L%) are also lowered, although to a lesser degree. Because of this lowering of F0 valleys, the pitch range of the IP overall becomes larger every time the accentual fall occurs. Finally, when anticipatory raising occurs in association with downstep, it may also impact the pitch-range expansion. As the number of pitch accents in IP increases, the F0 peak of the initial AP becomes higher and consequently the pitch range of the IP is expected to grow. It is important to note here that pitch-range expansion brought about by factors associated with lexical pitch accent occurs even in the most idealized speech, and therefore intonation is not “exaggerated” here.

_images/fig51.png

注釈

FIG. 5.

Finite-state grammars for English (top) and Japanese (bottom) intonational tunes.

The crucial prediction of this hypothesis is that the effects associated with pitch accents on pitch ranges should be larger in ADS than IDS. It is well documented that utterances in ADS in general contain more words and are longer in duration than those in IDS (Fernald, 1992; Fernald et al., 1989; Grieser and Kuhl, 1988; Newport et al., 1977; Snow, 1977, among others). Assuming that the proportion of accented to unaccented words is not significantly larger in IDS than in ADS (an assumption that is borne out, as discussed in Sec. IV D), ADS utterances, which in general contain more words than IDS utterances, should on average contain more pitch accents than IDS utterances. This leads to the prediction that ADS utterances should have a larger average pitch range of BODY than IDS, if there is no register-induced pitch-range expansion in the IDS utterance. Or, said differently, when we compare utterances in IDS and ADS with the same length, we should find the same pitch range if there is no intonational exaggeration in IDS. If, on the other hand, there is intonational exaggeration, IDS utterances should have larger pitch ranges than ADS utterances of equivalent length.2

Finally, we will discuss the third part of the hypothesis. The pitch-range expansion effect that lexical pitch accents have on the BODY should be larger in the long utterances characteristic of ADS than in the short utterances characteristic of IDS. This effect could thus superficially mask intonational exaggeration that should be observed in BPMs of IDS, if these two phonological entities, pitch accent and BPM, are not taken into consideration. In order to find intonational exaggeration in Japanese, it is therefore necessary first to analyze the BODY and BPMs separately, and second to compare utterances of the same length between ADS and IDS.

III.DATA: JAPANESE IDS CORPUS
A.Participants

Twenty-two mothers [age 25–43, average age 33.0, standard deviation (SD) 6 3.6] and their children participated in the recording of the RIKEN Japanese MotherInfant Conversation Corpus (R-JMICC) (Mazuka et al. 2006). All the mothers were from Tokyo or its adjacent prefectures, and were native speakers of Standard Japanese. The children’s ages ranged from 18 to 24 months (average 20.4, SD 6 2.7 months, 10 girls). The data from one mother Igarashi et al.: Exaggerated prosody in infant-directed speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

was excluded from the analyses of the present paper

because of difficulty in F0 extraction due to her overly creaky phonation.

B.Recordings

Mother-child dyads were brought into a soundattenuated room at the Laboratory of Language Development, RIKEN Brain Science Institute, Japan. A head-mounted dynamic microphone was used to record each mother’s speech. These audio recordings were saved onto digital audio tapes (16 bits, 41 kHz). Three tasks were involved in the recordings. The mother was first asked to play with their infant with a number of picture books for approximately 15 min. After 15 min, the books were removed, and replaced with a set of toys. The mother was asked to play with them for an additional 15 min. Afterwards, a female experimenter, aged 33, who was also a mother of a girl of similar age, entered the room and talked with the mother for about 15 min. The topic of the conversation was not specified in advance, but the conversation tended to focus on topics related to child-rearing. Approximately 45 min of recording per mother-child dyad were thus made. The final sample consisted of a total of three hours of ADS (approximately 24 000 words) and eight hours of IDS (47 000 words) from 21 mothers.

C.Linguistic annotations in the R-JMICC

The R-JMICC contains various linguistic annotations including segmental, morphological and intonational labels (Mazuka et al., 2006). Segmental and intonational labeling was undertaken using the PRAAT software (Boersma and Weenink, 2006). The basic coding schema for R-JMICC was adopted from the criteria used in the Corpus of Spontaneous Japanese (CSJ), a large-scale annotated speech database of Japanese spontaneous speech (Maekawa, 2003). Segmental labels, representing types and duration of vowels and consonants, were time-locked to the speech signals. Morphological labels used in the R-JMICC provide information about boundaries of words, their part of speech, and their conjugation. Consistent with the criteria of the CSJ, two types of word-sized units, short unit word (SUW) and long unit word (LUW) were identified. Each SUW is either a mono-morphemic word or else made up of two consecutive morphemes, and is identical to or close to items listed in ordinary Japanese dictionaries. LUWs, on the other hand, are compounds. The SUW and LUW categories are both defined independently of prosodic cues. The analyses in this study exploited SUW but not LUW in measuring the length of various prosodic units, since LUW does not always constitute a hierarchical structure with the other prosodic units. Intonational labeling was based on the X-JToBI scheme (Maekawa et al., 2002), which provides, among other things, information on two levels of prosodic phrasing (AP and IP), lexical pitch accents, and BPMs, as discussed in Sec. II. The labeling was done by three trained phoneticians, including the first author (a highly experienced X-JToBI labeler). To J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013

ensure reliability, the labeling of the entire corpus was double-checked by the first author.3

D.Measurements

For the purpose of the present study, an utterance is defined as an IP or a sequence of IPs followed by a pause longer than 200 ms, following the coding scheme developed for the CSJ (Maekawa et al., 2002). Henceforth, we will refer to this operationally defined unit as an Utterance, with the first letter capitalized. The utterance in a general sense, which may be described as a “stretch of speech preceded and followed by silence or a change of speaker” (Crystal, 2008: 505), is written without a capital letter. Note that in the coding schema of X-JToBI, the highest prosodic unit is IP, and the Utterance is not independently defined. For the measurement of duration of various prosodic units such as AP and IP, we used the segmental labels. For the second analysis (Sec. IV B), we first identified where the BPM occurs and then divided the F0 contours into BODY and BPM parts. The occurrence/non-occurrence of a BPM and its temporal location was detected by means of the intonation labels and segmental labels of X-JToBI. In R-JMICC, the intonation labels for BPMs, such as H%, LH%, and HL%, were aligned exactly with the end of the AP in which the BPM occurred. The ending time of the BPM was thus identified as the offset of the AP accompanying a BPM label. The starting time of the BPM, on the other hand, was identified generally as the onset of the final mora of the AP containing the BPM label, because BPMs usually start from the onset of the AP-final mora. A BPM can also start at the penultimate mora. When this occurred, it was coded using X-JToBI labels, and the beginning of the penultimate mora of the AP was identified as the onset of the BPM. Once the starting and ending times of each BPM were determined, we measured the mean, maximum, minimum, and range of F0 between the two temporal points. In order to examine pitch modification during the BODY section, we first deleted the F0 points between the starting and ending times of the BPM as defined above. We then measured the mean, maximum, minimum, and range of the modified F0 contours of both Utterance and IP.

IV.ANALYSIS =============================-

A.Introduction

The analysis was carried out in three steps. First, in analysis 1, we examined pitch-range modification using the same methodology as in Fernald et al. (1989). Second, in analysis 2, we examined the pitch ranges of the BPMs. Third, in analysis 3, we investigated the pitch ranges of the BODY, controlling for its length. As discussed above, the IP is the relevant domain for intonational phenomena such as downstep, pitch reset, and pitch-range expansion. In the previous studies that investigated IDS prosody, however, utterances have been used as the unit for measurements. In the present paper, the calculations are carried out using both Utterances and IPs as units of reference, and the same pattern of results are obtained Igarashi et al.: Exaggerated prosody in infant-directed speech

1289

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

TABLE I. Pitch modification in BPM. Standard deviation (SD) in

parentheses. ADS Mean (Hz)

Max (Hz)

Min (Hz)

Range (st)

ADS > IDS

IDS

a

H%

210.04

(17.02)

263.06

(23.93)

LH%

209.30

(34.26)

237.67

(23.23)

a

HL%

206.33

(19.00)

234.88

(27.49)

b

H%

224.15

(18.61)

289.87

(26.23)

a

LH%

242.65

(41.90)

311.78

(37.78)

a

HL%

224.77

(21.27)

261.62

(32.62)

a

H%

196.05

(15.84)

237.60

(21.90)

a

LH% HL%

191.33 181.60

(28.28) (17.17)

202.17 195.43

(19.86) (22.75)

a

H%

2.25

(0.57)

3.46

(0.60)

a

n.s.

LH%

3.91

(1.01)

7.37

(1.62)

b

HL%

3.54

(0.76)

4.92

(1.42)

a

a

IDS was significantly higher (larger) than ADS at less than 0.1% in the post-hoc comparisons of means with Bonferroni correction. b IDS was significantly higher (larger) than ADS at less than 1% in the posthoc comparisons of means with Bonferroni correction.

for both analyses. For analysis 1, we report the results based on Utterances, so that our results can be compared directly with previous studies. For other analyses, we report the results based on IPs. In all the statistical analyses, differences were considered significant when p < 0.05. Unless otherwise stated, all of the post hoc tests reported in this paper used the Bonferroni method to control for multiple comparisons, and they are significant at least at p < 0.05 level.

B.Analysis 1: Replication of Fernald et al. (1989)

The first analysis measured the F0 maximum, mean, minimum, and range of Utterance as a whole. Following the convention in IDS pitch studies (cf. Fernald et al., 1989; Kitamura et al., 2002), we report mean, maximum, and minimum F0 in Hz, while we report pitch ranges in semitones (st). Results of paired t-tests revealed that although IDS showed a significantly higher mean F0 [t(20) ¼ 7.60, p < 0.001], maximum F0 [t(20) ¼ 5.46, p < 0.001], and minimum F0 [t(20) ¼ 5.46, p < 0.001] for overall Utterances, the F0 range did not differ significantly between the two registers [t(20) ¼ 0.27, not significant (n.s.)]. These results replicate the findings in Fernald et al. (1989); that is, mothers used a higher pitched voice when talking to infants than to adults, but did not alter their overall pitch range. Moreover, all averages were comparable to those reported in Fernald et al. (1989).

C.Analysis 2: Measurement of BPM

In this analysis, we analyzed the pitch modification during the BPM sections.4 Table I shows means for the F0 mean, maximum, minimum (Hz) and F0 range (st) of each BPM type. F0 mean, maximum, minimum (Hz), and F0 range (st) data were separately submitted to 2 Â 3 repeated measure analyses of variance (ANOVAs), with register (ADS and IDS) and BPM types (H%, LH%, and HL%) as within-subjects factors. As shown in Table II, there was a significant main effect of register, with IDS averages generally higher than ADS averages for each acoustic dimension. Importantly, the pitch ranges in IDS were significantly larger than those of ADS.5 Note that the pitch range expansion here cannot be accounted for in terms of the slower speech rate in IDS, as the duration of syllables with BPM in IDS were not any longer than that of ADS.6 An example of pitch-range expansion at a BPM (LH%) is shown in Fig. 6.

D.Analysis 3: Measurement of the BODY as a function of its length

Utterances and IPs were longer in ADS than IDS no matter how we measured them.7 Thus we examined the pitch range of ADS and IDS BODY, controlling for the length of the IP. Of the various measures, the numbers of pitch accents per IP was chosen for this analysis, because, as the IP is the domain of pitch range specification, i.e. the domain of downstep (see Table III). Since very few of the IPs in IDS contained four or more accents, analyses are constrained to IPs with three or fewer accents.8 We carried out a series of four repeated measure twoway ANOVAs using F0 mean, maximum, minimum, and F0 range as dependent variables. Register (ADS and IDS) and the number of accents within an IP (0, 1, 2, or 3) were the within-subject variables. First, the results of the ANOVA using mean F0 as the dependent variable revealed a significant main effect of register [F (1, 20) ¼ 43.43, p < 0.001], a significant main effect of the number of accents [F (3, 60) ¼ 255.04, p < 0.001], and a significant interaction [F (3, 60) ¼ 14.04, p < 0.001]. Post hoc tests revealed that the F0 mean was significantly higher in IDS than ADS in every condition except three-accent IPs. This shows that the mean F0 tends to decrease as the number of accents within the IP increases, and also that the mean F0 is in general higher in IDS than ADS. Second, for maximum F0, the results revealed a significant main effect of register [F (1, 20) ¼ 22.50, p < 0.001]

TABLE II. Results of ANOVAs for pitch modification in BPMs. “Register * BPM” means an interaction between Register and BPM.

Mean (Hz) Maximum (Hz) Minimum (Hz) Range (st)

1290

Register

BPM

Register * BPM

F (1, 18) ¼ 71.60, p < 0.001 F (1, 18) ¼ 89.91, p < 0.001 F (1, 18) ¼ 47.74, p < 0.001 F (1, 18) ¼ 69.18, p < 0.001

F (2, 18) ¼ 6.46, p < 0.01 F (2, 18) ¼ 15.22, p < 0.001 F (2, 18) ¼ 26.84, p < 0.001 F (2, 18) ¼ 74.84, p < 0.001

F (2, 36) ¼ 0.00, p < 0.001 F (2, 36) ¼ 7.01, p < 0.001 F (2, 36) ¼ 12.55, p < 0.001 F (2, 36) ¼ 30.25, p < 0.001

  1. Acoust. Soc. Am., Vol. 134, No. 2, August 2013

Igarashi et al.: Exaggerated prosody in infant-directed speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

FIG. 6. Waveforms and F0 contours of

utterances with LH% BPMs. ADS Bu’ranko to ka? “Things like a seat swing?” (left) and IDS A’nyo syuru no? “Will you walk?” (right), produced by the same mother. The BPM is marked by a rectangle. Apostrophes indicate the locations of pitch accents.

and a significant main effect of the number of accents [F (3, 60) ¼ 78.86, p < 0.001], but not a significant interaction [F (3, 60) ¼ 1.09, n.s.]. They showed that as the number of accents increased, the maximum F0 also increased. This confirms the presence of anticipatory raising effect of downstep (Laniran and Clements, 2003) in Japanese. Third, the ANOVA with minimum F0 as the dependent variable showed a significant main effect of register [F (1, 20) ¼ 43.43, p < 0.001], a significant main effect of the number of accents [F (3, 60) ¼ 255.04, p < 0.001], and a significant interaction between the two [F (3, 60) ¼ 14.04, p < 0.001]. Post hoc tests revealed that the F0 minimum was significantly higher in IDS than ADS in every condition except threeaccent IPs. This shows that minimum F0 is higher in IDS than in ADS in almost all conditions, and it decreases as the number of accents increases.

TABLE III. Pitch modifications of IPs sorted by the number of pitch accents within the IP. SD in parentheses. No. of Accents within an IP Mean (Hz)

ADS

IDS

Fourth, the ANOVA with F0 range as the dependent variable revealed no significant main effect of register [F (1, 20) ¼ 1.31, n.s.]. There was, however, a significant main effect of the number of accents [F (3, 60) ¼ 255.04, p < 0.001]. An interaction between the two factors was not significant [F (3, 60) ¼ 14.04, n.s.]. Post hoc tests revealed that the F0 range in IDS was significantly larger than in ADS for one-, two-, and three-accent IPs. However, the difference was not significant when the IP contained no accent. The results showed that the F0 range of the IP BODY is determined predominantly by the number of accents within it, while the contribution of the register was not significant. The larger pitch range in the ADS BODY is not attributable to the more frequent occurrence of accented words in ADS. In fact, a significantly smaller proportion of ADS words were accented than IDS words.9 In summary, the results of these analyses showed that the pitch range of the BODY becomes larger as the number of accents within an IP increases. This is true in both ADS and IDS. When the effects of this length-induced pitch-range expansion were factored out, we found that the pitch range of the BODY in ADS is not larger than in IDS.

ADS < IDS

0

230.06 (19.57) 256.30 (24.08)

a

1

225.34 (16.71) 254.73 (27.23)

a

2 3

221.50 (16.95) 245.34 (22.83) 227.89 (24.64) 236.22 (24.01)

0

247.67 (21.45) 277.47 (27.90)

a

1

259.74 (19.32) 295.10 (35.54)

a

2 3

288.18 (26.59) 321.87 (36.09) 317.14 (41.92) 338.47 (49.43)

a

0

209.04 (17.41) 231.73 (19.83)

a

1

187.54 (15.32) 209.63 (18.39)

a

2 3 0 1 2 3

170.95 167.25 2.86 5.47 8.69 10.68

a

V.DISCUSSION AND CONCLUSION

a IDS was significantly higher (larger) than ADS at less than 0.1% in the post-hoc comparisons of means with Bonferroni correction.

The present study investigated the dynamic aspects of a language’s intonation by examining pitch-range expansion, or intonational exaggeration, in IDS in Japanese, which has been reported to be absent in this language (Fernald et al., 1989). We found that (1) when measured as the difference between maximum and minimum F0 of whole utterances, Japanese IDS showed no pitch-range expansion, replicating the findings of Fernald et al. (1989); (2) while pitch ranges at the locations of BPMs were significantly expanded in IDS, pitch ranges for the rest of the utterance, which the paper calls BODY, were larger in ADS than IDS; (3) the pitch range for BODY is most strongly determined by its length (i.e., the number of pitch accents it contains), and once length is accounted for, the pitch range of the BODY in ADS is in fact no larger than that of IDS. On the basis of these findings, we argue that Japanese IDS does show register-induced pitch-range expansion.

  1. Acoust. Soc. Am., Vol. 134, No. 2, August 2013

Igarashi et al.: Exaggerated prosody in infant-directed speech

Max (Hz)

Min (Hz)

Range (st)

(14.26) 183.76 (15.42) (17.66) 170.23 (18.81) (0.54) 3.01 (0.68) (0.69) 5.75 (1.06) (1.24) 9.48 (1.50) (2.09) 11.41 (2.72)

a

n.s.

n.s.

n.s. n.s. n.s. n.s. n.s.

1291

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

First, we found robust pitch-range expansion at the locations of BPMs, which occur more frequently in IDS than ADS. As discussed in Sec. II, BPMs are tones that are associated with pragmatic interpretations (Venditti et al., 2008), and IDS utterances containing BPMs often involve mothers’ attempt to engage infants by addressing questions to them or seeking agreement by using the sentence-final particle ne (cf. Fernald and Morikawa, 1993). Crucially, when BPMs occur in IDS, each of the tones is produced with an expanded pitch range. This type of pitch-range expansion is likely to be heard as “exaggerated” by listeners and may account for the results of previous studies showing that the intonation of Japanese IDS is perceived to be exaggerated by adults (Horie et al., 2008) and that Japanese IDS is preferred over ADS by infants (Hayashi et al., 2001). Second, as shown in the third analysis (Sec. IV D), the length-induced pitch-range expansion in the ADS BODY is not an “exaggeration” of the intonation. The pitch range of an intonation phrase (IP) is determined primarily by the number of accents the IP contains—the longer the IP (and thus the more accented words it contains), the larger the pitch range. This is independent of register-induced pitchrange modification and occurs regardless of the difference between ADS and IDS. When a speaker produces a long utterance with a pitch range that is normal for its length, she/he is under no pressure to “exaggerate” the intonation, nor is it likely to be heard as “exaggerated” by a listener. In fact, when ADS and IDS utterances of equal length were compared, there was no difference in pitch ranges between the two registers. These results highlight the usefulness of an intonational phonological framework to describe how intonation is modified in a specialized speech register. Our analysis has shown that pitch-range expansion in Japanese IDS has previously been overlooked, because no reference has been made in the prior scholarly literature to phonological events specific to Japanese—specifically, BPM and lexical pitch accents, as well as utterance length and the interactions between these factors. We do not mean to argue that pitch-range expansion in IDS is not universal. On the contrary, our findings provide additional support to the view that it is. Our findings are novel in that they successfully demonstrate that not all phonological tones are subject to the paralinguistic modification characteristic of a specialized speech register. Specifically, our analysis suggested that pitch-range expansions in IDS are not realized in the same way in every language, but are instead implemented within a languagespecific system of intonation. When there is a desire or pressure to exaggerate the intonation, speakers seem to do so by expanding the pitch range at the location where flexibility in varying contours is most tolerated. In phonological terms, this is the location where pragmatically chosen tones are realized. In the case of Japanese, these are BPMs at the boundaries of prosodic phrases, while in the case of English, they are not only phrase accents and boundary tones at the phrasal boundaries, but pitch accents at the locations of stressed syllables. It has been commonly assumed that paralinguistic pitch-range modifications (which should include intonational exaggeration in IDS) can occur globally 1292

  1. Acoust. Soc. Am., Vol. 134, No. 2, August 2013

irrespective of what tones are present in the utterance [cf. Ladd (1996), Chap. 7]. The present results, however, showed that only certain tones can undergo paralinguistic modifications. Our study, therefore, promises to shed light on the phonetics of pitch range variation. One implication of the present study is that crosslinguistic differences in IDS intonation may be better captured by re-examining them with reference to the intonation system of each language. The intonational exaggeration in Japanese is camouflaged by the pitch range of BODY that increases linearly as the number of pitch accents in an IP increases. In English, by contrast, such an increase is not expected, and thus the register-induced exaggeration can be captured straightforwardly by the conventional, phonologically uninformed, method of measuring the pitch range—the maximum minus the minimum F0 of an entire utterance. The same method, however, is not sufficient to capture the two competing forms of pitch-range expansion in Japanese. This leads us to speculate that the magnitude of intonational exaggeration in some tone languages (Thai and Chinese) is in fact larger than what has previously been reported. In these languages, the pitch range in IDS are generally reported to be smaller (e.g., 6 to 7 semitones; Grieser and Kuhl, 1988; Kitamura et al., 2002; Papousek et al., 1991) than in English and other Germanic and Romance languages (10 to 12 semitones; Fernald et al., 1989; Kitamura et al., 2002). Examining the intonation of these languages with reference to their intonation systems may allow us to better understand the specific ways these languages modulate their intonation. It might provide a clue as to why the intonation of Mayan-Quiche speaking mothers do not show the typical pitch characteristics of IDS (Bernstein Ratner and Pye, 1984; Ingram, 1995). At the same time, our data robustly showed that Japanese IDS is produced with higher pitch than ADS. This is consistent with many previous studies of English and other Germanic/Romance languages (Fernald, 1989; Fernald et al., 1989), Japanese (Fernald et al., 1989) as well as tone languages (Papousek et al., 1991; Liu et al., 2007; Kitamura et al., 2002). It has sometimes been argued that the speech modulation seen in IDS may be driven by biological factors common across species, and that both adults’ tendency to produce higher-pitched speech when addressing infants and infants’ preference for higher-pitched voices may be part of a biologically driven phenomenon (Morton, 1977; Trainer and Zacharias, 1998). Thus, the results of the present study help show that there is a complex interplay of universal and language-specific factors that contribute to the pitch modulation that is characteristic of IDS intonation.

ACKNOWLEDGMENTS

We thank the children and mothers for their participation in the recordings of IDS and ADS used in the present study. We also thank Akira Utsugi, Kikuo Maekawa, and Andrew Martin for their helpful comments. The study reported in this paper was supported in part by a Japanese government Grant-in-Aid for Young Scientists B #23720207 to Y.I. and a Grant-in-Aid for Scientific Research A#21610028 to R.M. Igarashi et al.: Exaggerated prosody in infant-directed speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

  • Bernstein Ratner, N., and Pye, C. (1984). “Higher pitch in BT is not universal: Acoustic evidence from Quiche Mayan,” J. Child Lang. 11(3), 515–522.

Boersma, P., and Weenink, D. (2006). “Praat, a system for doing phonetics by computer,” Glot International, Vol. 5, pp. 341–345. Bornstein, M. H., Tamis-LeMonda. C. S., Tal, J., Ludemann, P., Toda, S., Rahn, C. W., P^echeux, M. G., Azuma, H., and Vardi, D. (1992). “Maternal responsiveness to infants in three societies: the United States, France, and Japan,” Child Dev. 63(4), 808–821. Cooper, R. P., Abraham, J., Berman, S., and Staska, M. (1997). “The development of infants’ preference for motherese,” Infant Behav. Dev. 20(4), 477–488. Crystal, D. (2008). A Dictionary of Linguistics and Phonetics, 6th ed. (Blackwell Publishing, Malden, MA). Ferguson, C. A. (1977). “Baby talk as a simplified register,” in Talking to Children: Language Input and Acquisition, edited by C. E. Snow and C. A. Ferguson (Cambridge University Press, London), pp. 209–235. Fernald, A. (1989). “Intonation and communicative intent in mothers’ speech to infants: Is the melody the message?,” Child Dev. 60, 1497–1510. Fernald, A. (1992). “Meaningful melodies in mothers’ speech to infants,” in Nonverbal Vocal Communication: Comparative and Developmental Approaches, edited by H. Papousek, U. Jurgens, and M. Papousek (Cambridge University Press, Cambridge), pp. 262–282. Fernald, A. (1993). “Approval and disapproval: Infant responsiveness to vocal affect in familiar and unfamiliar languages,” Child Dev. 64, 657–674. Fernald, A., and Kuhl, P. K. (1987). “Acoustic determinants of infant preference for motherese speech,” Infant Behav. Develop. 10, 279–293. Fernald, A., and Morikawa, H. (1993). “Common themes and cultural variations in Japanese and American mothers’ speech to infants,” Child Dev. 64, 637–656. Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B., and Fukui, I. (1989). “A cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverbal infants,” J. Child Lang. 16, 477–501. Grieser, D. L., and Kuhl, P. K. (1988). “Maternal speech to infants in a tonal language: Support for universal prosodic features in motherese,” Dev. Psychol. 24(1), 14–20. Gussenhoven, C. (2004). The Phonology of Tone and Intonation (Cambridge University Press, Cambridge). Hayashi, A. (2004). “Zen-gengoki no onsei chikaku hattatsu to gengoshuutoku ni kansuru jikken-teki kenkyuu” (“An experimental study of preverbal infants’ speech perception and language acquisition”), Doctoral Dissertation, Tokyo Gakugei University, Tokyo, Japan. Hayashi, A., Tameka, Y., and Kiritani, S. (2001) “Developmental change in auditory preferences for speech stimuli in Japanese infants,” J. Speech Lang. Hear. Res. 44, 1189–1200. Horie, R., Hayashi, A., Shirasawa, K., and Mazuka, R. (2008). “Mother, I don’t really like the high-pitched, slow speech of Motherese: Crosslinguistic differences in infants’ reliance on different acoustic cues in infant directed speech,” XVIth Biennial International Conference on Infant Studies, Vancouver, Canada. Ingram, D. (1995). “The cultural basis of prosodic modifications to infants and children: A response to Fernald’s universalist theory,” J. Child Lang. 22(1), 223–233. Inoue, T., Nakagawa, R., Kondou, M., Koga, T., and Shinohara, K. (2011).“Discrimination between mothers’ infant- and adult-directed speech using hidden Markov models,” Neurosci. Res. 70, 62–70. Jun, S.-A. (2005). Prosodic Typology: The Phonology of Intonation and Phrasing (Oxford University Press, New York). Kitamura, C., Thanavishuth, C., Burnham, D., and Luksaneeyanawin, S. (2002). “Universality and specificity in infant-directed speech: Pitch modifications as a function of infant age and sex in a tonal and non-tonal language,” Infant Behav. Dev. 24, 372–392. Kubozono, H. (1993). The Organization of Japanese Prosody (Kurosio Publishers, Tokyo). Ladd, D. R. (1996). Intonational Phonology (Cambridge University Press, Cambridge). Laniran, Y. O., and Clements, G. N. (2003). “Downstep and high raising:Interacting factors in Yoruba tone production,” J. Phonetics 31, 203–250. Liu, H.-M., Tsao, F.-M., and Kuhl, P. K. (2007). “Acoustic analysis of lexical tone in Mandarin infant-directed speech,” Dev. Psychol. 43(4),912–917. Maekawa, K. (2003). “Corpus of Spontaneous Japanese: Its design and evaluation,” in Proceedings of ISCA and IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, pp. 7–12. Maekawa, K., Kikuchi, H., Igarashi, Y., and Venditti, J. (2002). “X-JToBI:An extended J_ToBI for spontaneous speech,” in Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, pp. 1545–1548. Mazuka, R., Igarashi, Y., and Nishikawa, K. (2006). “Input for learning Japanese: RIKEN Japanese Mother-Infant Conversation Corpus,” Technical report of IEICE, TL2006-16 106(165), pp. 11–15. McCann, J., and Peppe, S. (2003). “Prosody in autism spectrum disorders: A critical review,” Int. J. Lang. Commun. Disord. 38, 325–350. Morton, E. S. (1977). “One the occurrence and significance of motivationstructural rules in some bird and mammal sounds,” Am. Nat. 111(981), 855–869. Newman, R. S., and Hussain, I. (2006). “Changes in preference for infantdirected speech in low and moderate noise by 5- to 13-month-olds,” Infancy 10(1), 61–76. Newport, E., Gleitman, H., and Gleitman, L. (1977) “Mother, I’d rather do it myself: Some effects and non-effects of maternal speech style,” in Talking to Children: Language Input and Acquisition, edited by C. E. Snow and C. A. Ferguson (Cambridge University Press, London), pp. 109–150. Papousek, M., Papousek, H., and Symmes, D. (1991). “The meanings of melodies in motherese in tone and stress languages,” Infant Behav. Dev. 14(4), 415–440. Pegg, J. E., Werker, J. F., and McLeod, P. J. (1992). “Preference for infantdirected over adult-directed speech: Evidence from 7-week-old infants,” Infant Behav. Dev. 15, 325–345. Pierrehumbert, J. (1980). “The phonology and phonetics of English intonation,” Ph.D. dissertation, MIT. Pierrehumbert, J., and Beckman, M. E. (1988). Japanese Tone Structure (MIT Press, Cambridge). Pierrehumbert, J., and Hirschberg J. (1990). “The meaning of intonational contours in the interpretation of discourse,” in Intentions in

  1. Acoust. Soc. Am., Vol. 134, No. 2, August 2013

Igarashi et al.: Exaggerated prosody in infant-directed speech

1

Various linguistic factors bring about pitch reset at the IP boundary; these include syntactic constituency and focus (Pierrehumbert and Beckman, 1988; Selkirk and Tateishi, 1991; Kubozono, 1993). In the case of the utterances in Figs. 2 and 3, pitch reset was induced by the focus indicated by wh-element da’re “who.” 2 We do not mean to argue that the pitch range of English (as well as arguably those of many other languages) is not influenced by downstep and anticipatory raising. English utterances can have staircase-like contours, such as one having multiple H*þL pitch accents. The contour of such an utterance would resemble the Japanese one in Fig. 3. In this case, a large pitch range would be expected due to the progressive lowering of the F0 bottoms and the anticipatory raising of the peaks. The difference is that these contours constitute only a small part of the large inventory of contours in English, and the majority of utterances, which have other types of contours, are not subject to downstep or downstep-induced pitch-range expansion. Consequently, simply because utterances are longer in ADS in English and other languages, their pitch ranges are not expected to increase to the same degree as those in Japanese. 3 See Supplementary Material 1 at [http://lang-dev-lab.brain.riken.jp/igarashi-jasa-suppl.html], for inter-labeler reliability with X-JToBI. 4 See Supplementary Material 2 at [http://lang-dev-lab.brain.riken.jp/igarashi-jasa-suppl.html] for the frequency each BPM type. 5 See Supplementary Material 3 at [http://lang-dev-lab.brain.riken.jp/igarashi-jasa-suppl.html] for post-hoc comparisons of ANOVAs. 6 See Supplementary Material 4 at [http://lang-dev-lab.brain.riken.jp/igarashi-jasa-suppl.html] for duration of morae which bear BPMs. 7 We also analyzed the pitch characteristics of the BODY regardless of its length. The results revealed that although the pitch of IDS is higher than ADS, the pitch ranges of the BODY are larger in ADS than in IDS. See Supplementary Material 5 at http://lang-dev-lab.brain.riken.jp/igarashijasa-suppl.html for details. The larger pitch range in ADS is accounted for by the effect of the length of the utterances, as shown in Sec. IV D. 8 See Supplementary Material 5 at [http://lang-dev-lab.brain.riken.jp/igarashi-jasa-suppl.html] for analyses of the average length of Utterances and Intonation Phrases based on several measures. 9 See Supplementary Material 6 at [http://lang-dev-lab.brain.riken.jp/igarashi-jasa-suppl.html] for proportion of accented versus unaccented words.

1293

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

Communication, edited by P. Cohen, J. Morgan, and M. Pollack (MIT Press, Cambridge, MA), pp. 271–311. Selkirk, E. O., and Tateishi, K. (1991). “Syntax and downstep in Japanese,” in Interdisciplinary Approaches to Language: Essays in Honor of S.-Y. Kuroda, edited by C. Georgopoulos and R. Ishihara (Kluwer Academic, Dordrecht), pp. 519–543. Snow, C. E. (1977). “Mothers’ speech research: From input to interaction,” in Talking to Children: Language Input and Acquisition, edited by C. E. Snow and C. A. Ferguson (Cambridge University Press, London), pp. 31–49. Soderstrom, M. (2007). “Beyond babytalk: Re-evaluating the nature and content of speech input to preverbal infants,” Dev. Rev. 27, 501–532. Toda, S., Fogel, A., and Kawai, M. (1990). “Maternal speech to three-monthold infants in the United States and Japan,” J. Child Lang. 17(2), 279–294.

1294

  1. Acoust. Soc. Am., Vol. 134, No. 2, August 2013

Trainer, L. J., and Zacharias, C. A. (1998). “Infants prefer higher-pitched singing,” Infant Behav. Dev. 21(4), 799–806. Venditti, J. (2005). “The J_ToBI model of Japanese intonation,” in Prosodic Typology: The Phonology of Intonation and Phrasing, edited by S. A. Jun (Oxford University Press, New York), pp. 172–200. Venditti, J., Maekawa, K., and Beckman, M. E. (2008). “Prominence marking in the Japanese intonation system,” in Handbook of Japanese Linguistics, edited by S. Miyagawa and M. Saito (Oxford University Press, New York), pp. 456–512. Ward, G., and Hirschberg, J. (1985). “Implicating uncertainty: The pragmatics of fall-rise,” Language 61, 747–776. Williams, C. E., and Stevens, K. N. (1972). “Emotions and Speech: Some acoustical correlates,” J. Acoust. Soc. Am. 52, 1238–1250.

Igarashi et al.: Exaggerated prosody in infant-directed speech

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 133.9.93.26 On: Fri, 07 Feb 2014 07:08:56

Repetetive Extralinguistic, Prosodic and Linguistic Behavior in Autism Spectrum Disorders-High Functioning (ASD-HF)

  • Hila Green and Yishai Tobin
  • Ben-Gurion University of the Negev,
  • Israel
1.Introduction

Restricted repetitive behavior has been a defining feature of the Autism Spectrum Disorders (ASD) since the original description of autism (Kanner, 1943), and by diagnostic convention, all individuals with ASD display some form of these “restricted repetitive and stereotyped patterns of behavior, interests, and activities” (Diagnostic and Statistical Manual for Mental Disorders-Forth Edition [DSM-IV], American Psychiatric, [APA], 1994:71). Although ASD is associated with a wide range of specific forms of atypical repetition, this issue received far less research attention than social and communication deficits. Indeed, it was not our original attention to examine the prosody of ASD high functioning (ASD-HF) children from the perspective of the presence or the absence of repetitive behavior, we were concentrating on “prosody” within the context of linguistic behavior - whether or not the manifestation of the “different” prosody by ASD-HF individuals may reflect “delays and deficits in language and communication”, which is another core feature of ASD. However, the data we collected in our research brought this issue into focus and raised new questions regarding the centrality of the restricted repetitive behaviors in ASD.

注釈

制限させた繰り返し行動は自閉症の原記述以来, 自閉症スペクトラム障害 (ASD) の決定的な特徴である [1]_ . また, diagnostic conventionによると, ASD をもつすべての個人は “制限された反復的で恒常的な, 行動, 興味, 活性度のパターン” (Diagnostic and Statistical Manual for Mental Disorders-Forth Edition [DSM-IV], American Psychiatric, [APA], 1994:71) を示す. 一方で, ADS は特定の形態の広範囲の非定型の繰り返しに関連しており, この問題は社会的, コミュニケーションの欠如よりも研究の興味が少ないように受け取られている. 確かに, 我々の元々の興味は 高機能 ASD (ASD-HF) の子供の韻律情報の研究を存在の視点や繰り返し行動の欠如から行うことではなかったが, 我々は言語行動の文脈の中で、” 韻律 ” に注目した. つまり, ASD-HF の個人の ” 異なった ” 韻律の表出は “言語やコミュニケーションの遅延や欠落” もしくは他の ASD の主要な機能を反映しているのか否かを調べた.

しかし、我々の研究で収集されたデータは、上記の集中に関する問題を持ち込み、ASD制限反復行動の中心性に関する新たな疑問を提起した。

This chapter is based on results and insights from linguistic research. This research (Green, 2010) comparing and contrasting the prosodic features of 20 peer-matched 9-13 year old male Israeli Hebrew-speaking participants (10 ASD-HF subjects and 10 controls without developmental disorders (WDD) strongly indicated that the prosodic features that were examined exhibited a limited and repetitive repertoire in the ASD-HF population compared with the prosodic features of the WDD control population (Green, 2005; Green & Tobin 2008a, b, 2009 a, b, c; Green, 2010). Furthermore, this significant limited repetitive repertoire of behavior patterns was also exhibited in the extra-linguistic and the linguistic (lexical) domains of the ASD-HF participants.

注釈

本節は言語学的研究の結果と洞察に基づくものである. この研究 ( Green, 2010 ) では 9-13 歳のイスラエルのヘブライ語圏男性参加者にペアマッチする20個の韻律的特徴量を比較,対比している(10名の ASD-HF 患者と10名の発達障害を持っていないコントロール(WDD) では ASD-HF群とWDDのコントロール群の韻律的特徴の比較において 繰り返しのレパートリーと制限された韻律特徴が強く指し示された (Green, 2005; Green & Tobin 2008a, b, 2009 a, b, c; Green, 2010). 加えて, この重要な制限された繰り返し行動パターンのレパートリーはASD-HF参加者の非言語的·言語的(字句)ドメインでも出現した.

2.The experimental research

As already noted, this chapter is based on experimental research and deals with the “restricted repetitive behavior” phenomenon. In the original linguistic-oriented research there were four major goals:

  1. To describe, compare and contrast the phonetic realization of the fundamental frequency and the prosodic features of intonation in the language of children with ASD- HF and WDD children,

  2. To establish a methodology which allows the analysis of more than one feature of prosody simultaneously,

  3. To make use of instrumental measurements, (using recently developed speech technology tools) as well as perceptual analysis, and

  4. To explain the results within the context of the theory of Phonology as Human Behavior (PHB) (e.g. Diver, 1979, 1995; Tobin, 1997, 2009), a linguistic theory which declares that:
    1. Language is a symbolic tool, whose structure is shaped both by its communicative function and the characteristics of its users (Tobin, 1990, 1993, 1994, 2009), and
    2. Language represents a compromise in the struggle to achieve maximum communication using minimal effort as presented in the theory of Phonology as Human Behavior (PHB) (Diver, 1979,1995; Tobin, 1997, 2009).

Our empirical data were drawn from the speech samples of 20 children between the ages 9- 13 years, in two main groups:

  1. Research group: subjects diagnosed clinically with ASD-HF (N=10)
  2. Control group: participants without developmental disorders (WDD, N=10)

The research group includes ten children with ASD aged 9-13 years. They were recruited from mainstream schools that have special education classes for children with ASD. The ASD diagnosis was made by a child psychiatrist who determined that the child met the DSM-IV, APA (1994) criteria for autism. Each child’s special needs were discussed and defined by an “Evaluation Committee”, entrusted with the placement of special needs pupils in appropriate class settings. For all of the children in this group the committee determined that a special class for children within the ASD spectrum is required. IQ scores were re- assessed by the school psychologist within the current year using the Wechsler Intelligence Scale for Children - a Revised Edition [WISC-R]. For the purposes of this research, High Functioning is defined by an IQ 85 and above. All ASD subject‘s have typical, within the average, school performance in the mainstream class in language and reading, as reported by their teachers.

The control group was composed of children without developmental disorders (WDD) and was drawn from the same schools as the research group. Similar to the research group subjects, in their teachers’ judgment, all the children are average students and do not exhibit any particular academic difficulties or exceptional abilities. The group members have not been tested to determine their IQ scores, but from the information received in interviews with the teachers and parents it can be assumed that they have intelligence in normal range. Their parents report that they have not been referred to a specialist for any developmental reasons. In our study, two language measures were used for peer matching. In addition to similar chronological ages (within two month), the peers were matched on the basis of (a) language fluency in spontaneous-speech sample as measured in MLU-W (within the minimal linguistic unit of one word) and (b) the standardized score of the verbal part in the IQ test within the norm or above. Match between subjects and controls are presented in Table 1.

The analysis of the speech samples of this group provides the basis for the characterization of the prosodic features of Israeli Hebrew (Green, 2009a; Green & Tobin 2009b, c; Green, 2010). All participants were male and at least second generation Israeli-born, and were monolingual speakers of Israeli Hebrew (IH). All participants were from comparable socioeconomic backgrounds and attend mainstream schools. Their mothers all have at least 12 years of education, an indication of socioeconomic status, since maternal education level is the most significant predictor of language functioning in children (Dollaghan et al., 1999). None of the members of the participants’ immediate families has learning or other known disabilities.

ASD-HF subjects WDD control group

Research Control Group MLU Group MLU- ASD-HF Mo VIQ PIQ Age -W WDD Mo Age W 1-ADR 13 109 94 9:0 5.62 11-AVY 13 9:0 5.28 2-ITE 17 121 91 9:2 5.34 12-NIS 15 9:3 4.04 3-UDX 15 108 94 10:8 7.86 13-IDR 17 10:7 6.94 4-YOL 17 111 101 11:1 5.77 14-YVO 16 10:11 5.16 5-RAE 12 86 97 11:6 5.68 15-ITS 17 11:6 4.04 6-BAB 17 109 97 11:11 7.42 16-AVS 16 11:9 7.18 7-ETR 16 100 102 12:5 5.14 17-LIS 13 12:3 5.22 8-TOB 15 108 99 12:8 6.42 18-IDW 16 12:6 6.20 9-NOR 14 90 99 13:0 6.5 19-OMX 14 12:11 5.86 10-OMG(*) 14 89 85 13:0 4.8 20-IDS 17 12:11 6.1

Mo=Mother’s years of Education, VIQ/PIQ= verbal/performance Intelligence score (WISC-R) (*) Participants 10 and 20 do not meet the requirements of the definitions used for peer matching, and are consequently excluded from comparison between the groups. Their results are, however, included when the discussion is about differences within the group.

Table 1. Matched peers and subject’s characteristics

The speech samples were collected at the participant’s house, in his own room. There were three types of elicitation tasks: (a) Repetition: this task comprised four sentence pairs, a WH- Question and its answer, (b) Reading Aloud: participants were asked to read a short story, and (c) Spontaneous speech: these were elicited spontaneous speech sequences in response to open questions, relevant to the child’s daily life. In order to conduct acoustic analyses the speech files were digitized at a rate of 44.1KHz with 16-bit resolution, directly into a laptop computer (Hp Compaq 6710b), using the speech- recording software Audacity (a software package for recording and editing sound files) and a small microphone. The data were subsequently analyzed using the speech analysis program Pratt, version 5.0.30 (Computer program, from http://www.praat.org/). Scripts were written to extract data from the transcriptions. Script is a short program that is used to automate Pratt activities and enables the analysis of large data sets, quick processing of information and results, preparation for the use of simple statistics tools, and generation of summary information for control purposes, i.e. to identify errors in the manual transcription process.

3.Restricted repetitive behavior

Restricted repetitive behaviors are a heterogeneous group of behaviors and a wide range of specific forms of atypical repetition that have been identified and described with relation to ASD (e.g. APA, 1994; Bodfish et al., 2000; Esbensen et al., 2009; Kanner 1943; Lewis & Bodfish, 1998; Militerni et al., 2002; Richler et al., 2007; Rutter, 1996; Szatmari et al., 2006; Turner, 1999). This restricted repetitive behavior can be observed across individuals with ASD, and multiple categories of abnormal repetition can occur within the individual with autism (e.g. Lewis & Bodfish, 1998; Wing & Gould, 1979). These behaviors can be socially inappropriate, increase the plausibility of living in a more restricted environment, and stigmatizing (Bonadonna, 1981; Durand & Carr 1987, Varni et al., 1979).

Several researchers who examined age related aspects of repetitive behavior patterns in ASD suggested that age and level of functioning are associated with variation in the manifestation of restricted repetitive behaviors in individuals with ASD (e.g. Esbensen et al., 2009; Militerni et al., 2002; Lam & Aman, 2007). The overall severity of the ASD has been shown to be significantly positively correlated with the overall severity of repetitive behaviors (e.g. Campbell et al., 1990; Prior & MacMillan, 1973). Esbensen et al. (2009) examined the restricted repetitive behaviors among a large group of children and adults with ASD in order to describe age related patterns of symptom expression and examine if age related patterns are different for the various types of restricted repetitive behaviors. In this research, they combined data from several previous studies to have a large sample size (n = 712), spanning a broad age range (age 2–62), and they measured restricted repetitive behaviors using a single instrument, the Repetitive Behavior Scale-Revised (RBS-R: Bodfish et al., 2000) with the modification of the subscales (Lam & Aman, 2007). The empirically derived subscales include: Stereotyped Behavior (movements with no obvious purpose that are repeated in a similar manner), Self-injurious Behavior (actions that cause or have the potential to cause redness, bruising, or other injury to the body), Compulsive Behavior (behavior that is repeated and performed according to a rule or involves things being done ‘‘just so’’), Ritualistic/sameness Behavior (performing activities of daily living in a similar manner; resistance to change, insisting that things stay the same), and Restricted Interests (limited range of focus, interest, or activity). Their analyses suggest that repetitive behaviors are less frequent and less severe among older individuals than among younger individuals regardless of whether examining total display of restricted repetitive behaviors, or whether examining each of the various subtypes. One may ask whether restricted repetitive behaviors decrease with age or whether they merely take a different form. A thought previously arise by Piven et al. (1996). Piven’s idea was that manifestation of ASD changes as the individual develops.

Other research has suggested that the expression of restricted repetitive behaviors may be influenced by level of functioning (e.g. Bartak & Rutter, 1976; Campbell et al., 1990; Gabriels et al., 2005; Le Couteur et al., 2003; Turner, 1999). Low IQ or presence of mental retardation has been shown to be associated with increased occurrence of repetitive behaviors in autism including stereotypy and self-injury (Bartak & Rutter 1976; Campbell et al., 1990). Turner (1997) proposed a taxonomy of repetitive behavior; consisting of eleven categories and in a later review (Turner, 1999) suggested that human repetitive behaviors can be divided into (a) lower-level and (b) higher-level categories. Lower-level repetitive behaviors include dyskinesia (involuntary, repetitive movements), tics, repetitive manipulation of objects, repetitive forms of self-injurious behavior and stereotyped movements. Turner’s review indicates that although some stereotyped movements and repetitive manipulation of objects might be differentiating features of autism, there are some lower-level repetitive behaviors that may rather be related to ability level or the presence of organic pathology (e.g. Bishop et al., 2006; Bodfish et al., 2000; Cuccaro et al., 2003; Esbensen et al., 2009; Fecteau et al., 2003; Militerni et al., 2002; Lam & Aman, 2007; Szatmari et al., 2006). Irrespective of whether these low-level repetitive behavioral characteristics are unique to ASD or exist in a wider range of organic pathological conditions, they are all repetitive extra-linguistic behaviors.

The high-level repetitive behaviors include circumscribed interests, attachments to objects, insistence on maintenance of sameness and repetitive language. Turner (1999) suggested that certain types of higher-level behavior may be characteristic of and restricted to individuals with ASD once a certain level of development has been achieved.

3.1 Repetitive language behavior

During the data analysis phase we could not ignore the proliferation of word repetition and repetition of contents. Repetitive language behavior has been reported in the literature (e.g. Perkins et al., 2006), but as far as we can determine there has not been a comprehensive study of questions raised by this phenomenon. The following is an example of the lexical repetition found in the spontaneous speech of BAB-ASD (age 11:11) regarding his “interest” (hitanyenut) in the “sciences” (mada‘im). The data are taken from sequential utterances in the same short conversation:

U3: [ ani mi# hahit‘anyenut sheli be‘ika(r) mada‘im ]
I from my INTEREST ESPECIALLY LIKE SCIENCE
U4: [ hit‘anyenti bemada‘im kvar begil ca‘r ]
I was INTERESTED in SCIENCE since I was young
U5: [ meod ahavti mada‘im ]
I LIKED very much SCIENCE
U6: [mada‘im # shama‘ati shemada‘im # ze ha‘olam shemisvivenu]
SCIENCE—I heard that SCIENCE is the world around us
U13: [ ani ohev et kol hamikco‘ot aval be‘iqar mada‘im ]
I LIKE all the subjects BUT ESPECIALLY SCIENCE
U14: [ be‘iqar mada‘im ]
ESPECIALLY SCIENCE
U15: [ ani yoter beqeTa shel mada‘im ]
I am more into SCIENCE
3.2 Repetitive prosodic behavior

The term ‘prosody’ is derived from the Greek ‘prosodia ’, which is a musical term. Metaphorically, in linguistic contexts, it is implied that prosody is the musical accompaniment to the words themselves. The term “prosody” describes the way one says a particular utterance and covers a wide range of phenomena including: intonation patterns, stress and accent, and pauses and junctions, etc. in speech.

Atypical prosody have been reported in a wide range of developmental conditions including dysarthria (e.g. Brewester, 1989; Crystal, 1979; Vance, 1994), aphasia (e.g. Bryan, 1989; Cooper & Klouda, 1987; Moen 2009), in hearing impairment (e.g. Parkhurst & levitt, 1978; Monsen, 1983; Most & Peled, 2007), in developmental speech and language disorders and/or learning disabilities (e.g. Garken & McGregor, 1989; Hargrove, 1997; Hargrove & McGarr, 1994; Wells & Peppé, 2003), in Williams Syndrome e.g. Setter et al., 2007; Stjanovik et al., 2007), and in ASD.

In ASD the atypical prosody has been identified as core feature and since the initial description, by Kanner (1943) and Asperger (1944, as cited in Frith 1991), the “unnatural” prosody was marked using different narrations such as “monotonous”, “odd”, “sing-song”, “exaggerated”, and more. Asperger, translated in Frith (1991) wrote: “Sometimes the voice is soft and far way, sometimes it sounds refined and nasal but sometimes it is too shrill and ear- splitting. In yet other cases, the voice drones on in a sing-song and does not even go down at the end of sentence. However many possibilities there are, they all have one thing in common: the language feels unnatural” (Frith, 1991:70)

Research on prosody within the ASD population, has shown that even when other aspects of language improve, prosodic deficits tend to be persistent and show little change over time (e.g. Kanner, 1971; Simmons & Baltaxe, 1975). This persistence of prosodic deficits seems to limit the social acceptance of children with ASD-HF mainstreamed into the larger community since they sound strange to their peers (McCann & Peppé, 2003; Paul et al., 2001).

Adapting Fujisaki’s definition, “Prosody is the systematic organization of various linguistic units into an utterance or coherent group of utterances in the process of speech production. Its realization involves both segmental and suprasegmental feature of speech and serves to convey not only linguistic information, but also paralinguistic and non-linguistic information” (Fujisaki, 1997:28). By this definition, Fujisaki established the prosodic features by two major components that can be measured: (a) the word accent, (b) the intonation, and they are both manifested by the contour of the voice F0 (the frequency of the vibration of the vocal folds). Hence, in order to understand the results and the insights from the presented research, we will first explore the nature of these two components from both a conceptual and operative view.

3.2.1 Word accent and the intonation

Bolinger (1958) formulates the relations of stress-accent. He argues that the main means to express stress is pitch and proposed the term accent for prominence in the utterance. Following Bolinger, Pierrehumbert (1980) represents the F0 contour as a linear sequence of phonologically distinctive units - pitch accents and edge tones. The occurrence of these features within the sequence can be described linguistically as a grammar, within the Autosegmental-Metrical (AM) theory (Ladd, 1996; Liberman & Pierrehumbert, 1984; Pierrehumbert, 1980; Pierrehumbert & Hirschberg, 1990).

The AM theory is a generative phonological framework in which the tone is specified using an independent string of tonal segments and the prosody of an utterance is viewed as a hierarchically organized structure of phonologically defined features. Following the AM theory, Pierrehumbert (1980) proposes a description of intonation that consists of three parts:

  1. The grammar of phrasal tones,
  2. The metrical representation of the text,
  3. The rules of assigning association lines.

Pierrehumbert assumes that the tonal units are morphemes of different kinds and those phonetic rules translate the abstract representations into concrete F0 contours. Thus, phonological aspects of intonation can be categorized according to the inventory of the phonological tones, and to the meanings assigned to phonological tones of a specific language. However it is the ToBI (Tones and Break Indices: Beckman & Hirshberg, 1994; Beckman & Ayers 1997) transcription that was designed for this presentation, of the phonological tones, within the AM theory.

ToBI was first designed for Mainstream American English and then expanded into a general framework for the development of prosodic annotation systems of different typological languages (Jun, 2005). ToBI has been applied to a wide variety of languages that vary geographically, typologically and according to their degree of lexical specifications, and to tone languages. For the purpose of the present research, an IH-ToBI was established in order to create a systematic procedure for transcribing data for Israeli Hebrew (Green, 2009a, 2010; Green & Tobin, 2008a).

3.2.2 The inventory of the IH prosodic features of Intonation (IH-ToBI)

The starting point for the analysis of the prosodic pitch contour i.e. intonation, is the notion of an intonation unit. This unit can be defined by its phonetic-phonological characteristics: (a) there is a “unity of pattern” within the intonation unit i.e. the intonation unit has a distinct intonation pitch pattern, and (b) the intonation unit is delimited by a boundary tone. In IH-ToBI, the intonational structure of the utterance is represented by a “Tone Tier” and three types of tonal events can be identified: (a) pitch accents, the event that associates with stressed syllables and two types of phrasal tones; (b) phrase accents and (c) boundary tones. Therefore, on the “Tone Tier” the perceived pitch contour is transcribed in terms of: (1) Pitch Accents (PAs) (2) Phrase Accents, and (3) Edge Tones (the last phrase accent and the final boundary tone).

In IH-ToBI every intonational phrase contains at least one Pitch Accent (PA). PAs are localized pitch events that associate with the stressed syllable but in contrast to stress (which is lexically determined), in the tone domain it is not expected that every stressed syllable will be accented. In the AM theory, PAs are perceptually significant changes in F0 aligned with particular words in an utterance, and give them prominence. IH-ToBI identified five PAs: two mono-tonal: high (H) and low (L) tonal patterns: H* and L*, and three bi-tonal: L+H*, H*+L and L*+H. As in other language’s descriptions, the H and L tones are described as high or low relative to each speaker’s pitch range. The H*- a high pitch accent starting from the speaker’s middle range and realized as a F0 peak preceded by a small rise is by far the most frequently used pitch accent in IH.

Phrase accents and Boundary tones: IH-ToBI identifies two levels of phrasing: (a) the intermediate phrase and (b) the intonation unit. Each intonation unit contains at least one intermediate phrase. The edge tones for these phrases determine the contour from the last tone of the last pitch accent until the end of the phrase. There are two types of phrase accents in IH: (a) ‘Hp’ and (b) ‘Lp’. Hp has roughly the same F0 value as the peak corresponding to the most recent H tone, which creates a plateau at the end of the phrase. Lp can either be a F0 minimum low in the range, or be down-stepped in relation to a previous tone.

Concerning the boundary tones, IH-ToBI identified three types: (a) an initial boundary tone ‘%’, (b) a high boundary tone ‘H%’, and (c) a low boundary tone ‘L%’. The two final boundary tones combine with the phrase accents in four different combinations i.e. the last intermediate phrase accent (Hp or Lp) combines with the intonational boundary tones to yield the configuration of LpL%, LpH%, HpL% or HpH%. These boundaries appear to have specific pragmatic functions. By analyzing the distribution of these configurations appearing in the spontaneous speech and the reading aloud corpus of our data, it was evident that LpL% is the most frequently used boundary tone in IH and the L-boundary tone signals finality. The absence of finality i.e. signaling a continuation, is marked by a high (H) boundary tone or high phrase accent (Hp) with a L-boundary tone i.e., LpH%, HpH%, HpL%.

To conclude, the richness within the prosodic features (five pitch accents, two phrase accents three boundary tones and all their possible combinations) serve as the basis for the comparing and contrasting of the speech prosody of the ASD-HF subjects with their peers - the WDD controls.

Regarding our investigation of the realization of pitch accents in the speech of children with ASD-HF, our research concentrated on three variables to be analyzed: (1) frequency of high PAs occurrences, (2) distribution of the different IH PAs, and (3) PAs per word (PAs/W), followed by a case investigation of one subject and his matched peer, in order to explore the differences found at the lexical word level.

We found that the children with ASD-HF produced more high PAs than the control group of WDD children in both the reading aloud and spontaneous speech elicitation tasks, without statistical significance, but with high standard deviation within the research group. This high standard deviation shows that the variability within the ASD-HF research group is much greater than that within the WDD control group. In a comparison of peers within the groups, in seven of the nine matched peers, the ASD-HF participant showed a greater use of high PAs in the spontaneous speech task and in six matched peers, the ASD-HF participant shows a greater use of high PAs in the reading aloud task.

In the WDD control group only two participants demonstrate above 80% use of high PAs, while in the ASD-HF research group four participants produced above 80% use of high PAs. No participants in the ASD-HF research group produced less than 70% high PAs while in the WDD control group there are three participants with less than 70% high PAs. The differences arise when comparing the research group and the controls as a group in contrast with a comparison of peers – as a “group of case-studies”. These intergroup differences lead to the conclusion that the characteristic of heterogeneity (e.g. Beglinger, 2001; Firth, 2004; Happe’ & Frith, 1996a) within the ASD classification has methodological implications for research procedure in general and in the present research in particular: i.e. it was the aggregation of peer comparisons that motivated the exploration of the prosodic behavioral features in the group of subjects diagnosed with ASD-HF.

Concerning the PA prosodic feature, the most prominent results deal with PAs/W and the placement of PAs. In a peer-case-investigation it was evident that the ASD-HF subject produced almost twice as often, more PAs in function words than his mach-peer did, and in particular more than one PA per word, while his WDD peer hardly ever adds more than one PA in a word (15.69% of the words in the ASD-HF speech sample and 1.96% of the words in his WDD peer speech sample). These results are illustrated in the following example This example is a sentence from the reading aloud elicitation task: yom “exad yac” a “orit lesaxek ba-xacer lefet” a ra’ata kadur Qatan umuzar munax ba-gina. (Translation: One day Orit went to play in the yard and suddenly saw a small, strange ball in the garden). In this example, the ASD-HF subject (1a below) produced the sentence with three intonation units. Every word has a PA. Function words (FW) are emphasized with a PA as well as content words (CW). The words /baxacer/ (yard) and /qatan/ (small) have two PAs each. In contrast, the matched pair (1b below) produced the same sentence with only two intonation units. Not every word has a PA and none of the words has more than one PA.

(1a) 1-ADR-ASD IU-1: /yom ‘exad yac’a ‘Orit lesaxeq / Gloss: day one to go out (name) to play

FW CW CW CW

IU-2: / ba- xacer/ Gloss: in+yard

FW+CW

IU-3: /lefet’a ra`ata Kadur qa-Tan umuzar munax ba- gina/ Gloss: suddenly to see ball small and+strange placed in+garden

FW CW CW CW FW+CW CW FW+CW

(1b) 11-AVY-WDD IU-1: /yom ‘exad yac’a ‘Orit lesaxek ba-xacer/ IU-2: /lefet’a ra`ata kadur qatan umuzar munax ba-gina/

As was previously found (e.g. Baltaxe, 1984; Balataxe & Guthrie, 1987; Fosnot & Jun, 1999; MacCaleb & Prizant, 1985), and extending over in the current research, it can be concluded that within the ASD individuals that exhibited atypicality in prosody ‘accents’ are likely to be affected.

Regarding the investigation of boundary tones and phrase accents, a variation in the distribution of edge contour patterns arises when comparing the edge contours of the matched peers within the groups. The research group subjects may be divided into two sub- groups:

  1. ASD-HF subjects that produced a full repertoire of edge contour patterns, similar to the control group (4 subjects). Figure 2 is an example of the full repertoire prosodic behavior by the ASD-HF subject compared with his matched peer in the spontaneous speech elicitation task.
  2. Of the nine matched peers, in the spontaneous elicitation task five of the subjects produced a varied limited repeatedly used repertoire of the edge contour patterns. Figure 3 presents the distribution of the edge contour patterns of two subjects in the reading aloud elicitation task and Figure 4 presents the results of five subjects in the spontaneous speech elicitation task.

Fig. 2. Full repertoire of edge contour patterns. This figure shows the comparison between the edge contour patterns of 7-ETR-ASD, age: (12:5) and his matched peer LIS-WDD, age: (12:3) in the spontaneous speech task. The ASD-HF subject uses the same patterns as his peer and has the full repertoire of edge contour patterns (Green, 2010:106)

Fig. 3. Limited repeatedly used repertoire in the reading aloud elicitation task (Green, 2010:106)

Fig. 4. Limited repeatedly used repertoire in the Spontaneous speech ellicitation task (Green, 2010:107)

In conclusion we will emphasize certain aspects that were manifested in the present study. The starting point in our study was the need to characterize the prosodic features of children diagnosed with ASD-HF who mainstreamed in regular schools. The research established methodology, which allows the analysis of more than one feature of prosody simultaneously and described, compared and contrasted the phonetic realization of the fundamental frequency and the prosodic features of intonation in the language of 10 children with ASD-HF and 10 children WDD. By using recently developed speech technology tools, we performed an extensive investigation of the prosody of children with ASD-HF between the ages of 9-13 years. The speech sample analysis yielded quantitative results, of group comparison, peer comparison and of subjects within the ASD-HF group.

The peer comparison highlights the greater variations within the ASD-HF subjects, as compared with their peers and between themselves. From this study we can conclude that not all ASD-HF subjects present an a-typicality in each of the different prosodic features examined, but no subject performed in the same way as his WDD peer.

It was found that ASD-HF subjects produce more high PAs and less low PAs. If the variations in intonation are a result of differences in the kinds of PAs and transitions between the prominent components, then when the prominence in the ASD-HF subjects exists in a more frequent single high PA and there are consequently fewer transitions, a monotonous accent is created. The ASD-HF present repetitive behavior expressed over the use of pitch accents within a word - a repetitiveness that did not observed in the control group.

One of the most significant founding is concern with the use of edge tone i.e., the tonal events at the edge of prosodic domains. The ASD-HF subjects primarily use three different edge tone patterns, although they do make a very limited use of all the other patterns. Thus, the problem is not the absence of patterns due to lack of competence to produce them, rather it is the nature of the behavior that the ASD-HF exhibited. Although the ASD-HF subjects are capable of producing a wide range of prosodic patterns, they concentrate on a limited repertoire of the most basic prosodic patterns. Both the monotonous accent and the repetitiveness of edge tones create a stiff sounding prosody in subjects within the ASD-HF group.

Our claim based on all the data collected and results from our research indicates that the restricted repetitive behavior of the ASD-HF subjects, appears in a parallel way across the board in the extralinguistic, paralinguistic (prosody) and linguistic (lexical choice) domains. Then, Turner’s distinction between higher and lower level behavioral categories may only reflect the observable symptoms of ASD behavior rather than their fundamental motivation. We suggest that the concept of limited and repetitive behavior found on all levels of extralinguistic, paralinguistic and linguistic behaviors in a parallel way among different populations with ASD should play a more central role in research to help us better understand ASD.

4.References

The interplay of linguistic structure and breathing in German spontaneous speech

  • INTERSPEECH 2013
  • Amélie Rochet-Capellan : GIPSA-lab, UMR 5216 CNRS/Université de Grenoble –France
  • Susanne Fuchs : Centre for General Linguistics, Berlin – Germany
Abstract

本稿は自発音声における breth groupの言語学的構造と呼吸の運動学の関係を調査した. 26名のドイツ人女性話者のPlethysmograph [1] の磁場の平均値を記録した. brearh group は単一呼気における発話のインターバルと定義した. それぞれのグループ間で,言語学的パラメータ(句のタイプ数や,シラブル数,ためらいなど句のタイプ数や,シラブル数,言いよどみ等)を計測し,吸気と関連づけた. 呼吸グループの平均持続時間は3.5 sec 以上であった. 呼吸ブループのほとんどが1ー3の句 [2] を構成していた. 53%以上は句マトリックスの冒頭で開始しており,24%が句の中に埋め込まれており,23%は不完全な句(継続,反復,言いよどみ)とともに現れた. 吸音の深さと長さは,はじめの句のタイプの関数として,呼吸グループの長さを反映して変化し, 発話計画と呼吸の制御の間にいくつかの交互作用が現れた. これらの結果は,自発的な発話での音声·計画と呼吸制御の相互作用をより良く理解するために有益である, 調査結果はスピーチセラピーやその技術の応用に関連している.

This paper investigates the relation between the linguistic structure of the breath group and breathing kinematics in spontaneous speech. 26 female speakers of German were recorded by means of an Inductance Plethysmograph. The breath group was defined as the interval of speech produced on a single exhalation. For each group several linguistic parameters (number and type of clauses, number of syllables, hesitations) were measured and the associated inhalation was characterized. The average duration of the breath group was ~3.5 s. Most of the breath groups consisted of 1-3 clauses; ~53% started with a matrix clause; ~24% with an embedded clause and ~23% with an incomplete clause (continuation, repetition, hesitation). The inhalation depth and duration varied as a function of the first clause type and with respect to the breath group length, showing some interplay between speech-planning and breathing control. Vocalized hesitations were speaker-specific and came with deeper inhalation. These results are informative for a better understanding of the interplay of speech-planning and breathing control in spontaneous speech. The findings are also relevant for applications in speech therapies and technologies.

注釈

Index Terms

spontaneous speech, breathing kinematics, breath group, inhalation pauses, syntactic clause, hesitation

訳者注

[1]plethysmograph : 呼吸用の計測器っぽいです.
[2]ここで言う句は句のことかと
1. Introduction

数秒のタイムスケールにおいて,音声発話とは発声を伴う長い呼気に付随する短い吸気の休止の連続である. 一つの呼気における音声発話のインターバルは一般的には ” breath group ” として定義されている. これは言語学的,コニュニケーション的,生理的な制約に依存している. Breath group は韻律や発話の知覚に対する重要なユニットでもある[1]. 本稿ではドイツ語の breath group を以下の2つの問いに注目して解析をした.

  1. Brearh group の言語学的な構造は何なのか?
  2. 吸気をしている間にこの構造を予測しているのか?

On a time-scale of several seconds, speech production is a sequence of short inhalations pauses followed by long exhalations with phonation. The interval of speech produced on a single exhalation is commonly defined as the breath group. It relies on linguistic, communicative and physiological constraints. The breath group is also an important unit for prosody and speech perception [1]. The present paper analyses the breath group in German spontaneous speech with respect to two main questions:

  1. What is the linguistic structure of the breath group?
  2. Is this structure anticipated during inhalation?

吸気の深さと続く breath group の言語学的構造体の長さの関係は発話計画と換気の相互作用を反映している[2-8]. これらの関係は読み上げでも,自発音声でも調査されてきた. これらの研究は異なる発話タスクを含んでおり(例えば,センテンスやテキストの読み上げ,異なる認知負荷での自発音声など),異なる方法で呼吸のパラメータを推定している(呼吸音ノイズの検知,例えば[2,9], 口や鼻からの風の計測[3-8,11-16],そして音響と運動学的方法の組み合わせ[17]).

The relation of the inhalation depth and duration to the linguistic structure of the upcoming breath group reflects the interplay of speech-planning with ventilation [2-8]. These relations have been investigated in both read and spontaneous speech. These studies involved different speech tasks (e.g. sentences and texts reading, spontaneous speech with different cognitive load) and estimated breathing parameters with different methods (detection of breath noises, e.g. [2,9]; measurement of the air flow from the mouth and nose, e.g.[10]; monitoring of the kinematics of the chest wall, e.g. [3-8, 11-16], see also [17] for a comparison between acoustic and kinematic methods).

いくつかの研究ではセンテンスや文章において吸気時の breath group の長さの予測を示している. 吸気の深さと長さはセンテンスの長さと共に増加する[6,11-14,16]. 加えて,文読み上げの際の吸気は次に来る breath group の文法的な複雑さ(句の数)とは明確な関係はない. テキストリーディングでは,ほぼ100%の吸気パルスが句読点や接続語(例えば, and など)によってしめされる文法的な境界において生じる. これらは brearh group がセマンティックな構造をとっていることを示す結果である[2-6, 8-10, 15]. テキストリーディングにおいては,吸気の深さも長さも構文上のマークに対しては異なっている(例えば 段落 > ピリオド > コンマ)[5-6].

Several studies show an anticipation of the breath group length during the preceding inhalation for sentence and text reading. The inhalation depth and duration increase with the sentence length [6, 11-14, 16]. Furthermore, inhalations in sentence reading are not clearly related to the syntactic complexity (number of clauses) of the upcoming breath group [12-13]. In text reading, almost 100 % of the inhalation pauses occurs at syntactic boundaries, indicated by punctuation marks or conjunctions (e.g. and). These results show that the breath groups are syntactically structured [2-6, 8-10, 15]. In text reading, the inhalation depth and duration also differed with respect to syntactic marks (e.g. paragraph > period > comma) [5-6].

自発音声においては,呼吸パレスはシンタックスのみではなく,言語学的なコンテンツを生成するために必要な認知プロセスによっても支配されている[2,4,7-8,15]. このプロセスは発話のフローの言いよどみを導入する. 自発音声においては約80%の呼吸パレスが文法的な成分において発声する. 吸音の平均的なアンプリチュードと持続時間はテキスト読み上げに似ており,次にくる brearh group の長さを反映している. brearh group の平均持続時間は読み上げ音声よりも長い[4,7,15,18-19を参照]. これらのパラメータのばらつきの範囲はテキスト読み上げと比較して自発音声のほうが大きい. 自発音声は呼吸に関連しており,異なる機能を有すると想定される有声のためらい発話(/うー/とか/うーむ/とか)によって特徴付けられる[20,21].

In spontaneous speech, the breathing pauses are not only governed by syntax but also by the cognitive processing required to generate the linguistics content [2,4,7-8,15]. This process introduces disfluencies in the speech flow. In spontaneous speech about 80% of the breathing pauses occur at syntactic constituents; the average amplitude and duration of inhalation are similar to text reading and are reflecting the length of the upcoming breath group. The average duration of breath groups is also longer than in text reading [see: 4, 7, 15, 18-19]. The ranges of variability of these parameters are larger in spontaneous speech as compared to text reading. Spontaneous speech is also characterized by the production of vocalized hesitations (uh, um) that have been assumed to have different functions and have been related to breathing [20, 21].

本稿では,ドイツ語の自発音声における, 呼吸の運動学と brearh group 言語学的な構造との関係を評価した. 先行研究に習って,我々は brearh group の中の文法的な構造(句の数)とシラブルの数を考慮した. brearh group 内の句の種類(マトリックス句または埋め込み句)と言いよどみ(ためらい - うーん,うーむ,繰り返し)も調査した. brearh group における最初の句の種類(マトリックス句か,埋め込み句か)は言語学的構造と関連した吸気の位置の指標である. 言いよどみ,特に有声のためらいと呼吸との関連は発話計画を含む認知プロセスに付いて有益な情報を持っている.

This paper evaluates the relationship between the kinematics of breathing and the linguistic structure of the breath group in German spontaneous speech. As in previous studies we consider the syntactic structure (number of clauses) and the number of syllables in the breath group. We also analyzed the type of clauses (matrix, embedded clause) and disfluencies (hesitations – uh, um, repetition, repairs...) in the breath group. The type of the first clause (matrix clause or embedded clause) in the breath group is an indicator of the location of inhalation relative to the linguistic structure. The association of breathing to disfluencies, and especially vocalized hesitations, is informative about the cognitive process involved in speech planning.

2. Experiment
2.1. Subjects

参加者は26人の女性で,ドイツ語母語話者である(年齢:平均25歳 プラマイ3.1 体格指標21.5 プラマイ2.1) 全ての参加者は音声,聴覚障害の経歴はない.

The participants were 26 female, native speakers of German (age: 25 years (mean) ±3.1 (standard deviation), body mass index 21.5 ±2.1). All participants had no known history of speech, language or hearing disorders.

2.2. Experimental settings and procedure

参加者は,指向性マイクと2つのスピーカーの正面に立った(Figure1.A). 自発音声タスクが大きな実験プロトコルの一部である. 休憩時,及び,短い音読時の呼吸を少し収録したあと,参加者はドイツ語の男性,あるいは,女性のネイティブスピーカーによって読み上げられた,10個の簡単な文章の録音音声をよく聞くように指示された. 各トラックはスピーカーを通して再生された. これらの文章を聞いたあと,参加者は内容の簡単な要約を行った. 呼吸運動の観察を妨げるような可能性のある動きを制限するため, 参加者はトランクに沿って,手を維持するように指示された. 肺活量(VC)操作は,胸部の変位とVCによって誘導された腹部を推定するための手順の最後に実行された. そのため,被験者は,できる限り多くの空気を吐き出し,できる限り多くの空気を吸入する.

Participants were standing up in front of a directional microphone and two loudspeakers (Figure 1.A). The spontaneous speech task was part of larger experimental protocol. After a short recording of breathing at rest and short reading, participants were instructed to listen attentively to the audio recordings of ten brief texts (151±22.1 syllables), read by a male or a female native speaker of German. The tracks were played back through the loudspeakers. After listening to each text, participants briefly summarized the story. In order to limit the movements that could interfere with the monitoring of breathing kinematics, participants were instructed to keep their hands along their trunk. Vital capacity (VC) maneuvers were run at the end of the procedure to estimate the displacement of the rib cage and the abdomen induced by VC. To do so, subjects exhaled as much air as they could and then inhaled as much air as they could.

_images/fig14.png

Figure 1: - (A) 実験セットアップ - (B) 吸気フェーズ(I)と呼気フェーズ(E)の呼吸運動の例. - (C) breath groupsのラベリングとシラブル,句数.

  • Hはためらい発話部分を示している.詳細は本文参照.
2.3. Data acquisition, processing and labeling

胸囲と腹部の運動学はインダクタンスプレチスモグラフ(RespitranceTM)の平均値を記録した. バンドのひとつは脇の下(胸部)のレベルに位置し,もうひとつのバンドはへそ(腹部,図1参照)のレベルに位置した. 成果は全ての参加者において,胸部でも腹部でも同等であった. 全ての信号は11030Hzでサンプリングした.

The rib cage and the abdominal kinematics were recorded by means of an Inductance Plethysmograph (RespitraceTM). One band was positioned at the level of the axilla (rib cage) and the other band at the level of the umbilicus (abdomen, see Figure 1.A). The acoustic and the breathing signals were recorded synchronously by means of a six channels voltage data acquisition system. The gains were the same for the thorax and the abdomen and for all the participants. All signals were sampled at 11030 Hz,

収録後,呼吸データは200Hzでサブサンプリングし,通過帯域[1-40hz]を濾過した. 発話呼吸の腹部と胸部の After the recording, the breathing data were sub-sampled at 200 Hz and pass-band filtered [1-40Hz]. The contribution of the rib cage and the abdomen to speech breathing varied according to the speaker. For some speakers, breathing cycles were not clear for the abdomen. For these reasons, we analyzed the sum of the rib cage and the abdomen displacements. As RespitraceTM was not calibrated, our measures could over- or sub-estimate the contribution of the thorax relative to the contribution of the abdomen to lung volume and should not be considered as a direct estimation of lung volume [22-23]. To allow comparison between speakers and conditions, displacements were expressed for each subject in %MD (Maximal Displacement). MD was the displacement corresponding to the excursion of the rib cage and the abdomen during the VC maneuver. The onset and offset of inhalations were automatically detected on the breathing signal using the velocity profiles and zero crossing. The detection was then visualized and corrected when required. The breathing cycle was divided into an inhalation and an exhalation phase (Figure 1.B).

Speech productions were labeled in Praat [24] by detecting the onset and offset of vocalizations and by transcribing the spoken text for each breath group. The vocalized hesitations (e.g. uh, um) and the non-breathing pauses were distinguished (see Figure 1.C). On the basis of this transcription, the number of syllables was derived automatically from the output of the BALLOON toolkit [25]. The syntactic labeling of the breath groups was done by a trained phonetician. The clauses were marked by distinguishing between matrix and embedded clauses. German is a language where the position of the auxiliary verb (verb second or verb final) defines the type of clause. Mainly, the clauses with a verb in a second position were considered as matrix (also called main) clauses and those with a verb final position were considered as embedded clauses. For instance, m-e1-e2 characterized a breath group that included one matrix clause followed by two embedded clauses, with the first one (e1) referring to the matrix clause (m), and the second one (e2) referring to the first embedded clause (e1), see Figure 1.C. The third category, uncompleted clauses (u), included words or groups of words corresponding to hesitations, repetitions or repairs.

2.4. Data selection

Our data set included 1467 breath groups. We discarded 45 groups that were perturbed by laugh, cough or body movements. The number of clauses ranged from 1 to 7 (2.11 (mean) ±1.13 (standard error)). The dataset was restricted to groups with 1-3 clauses. They represented 88% of the observations and were produced by all subjects. Only groups starting with m, e1, e2 or u were considered in this study (99% of the groups with 1-3 clauses).

2.5. Measures and analyses

We estimated: (1) the duration of the breath group (dur_g), as the time interval from speech onset to speech offset; (2) the amplitude (amp_I) and duration (dur_I) of inhalation; (3) the relationship between amp_I and the amplitude of exhalation (amp_IE, amp_I divided by the amplitude of exhalation). This last measure evaluates if speakers exhale more air (amp_IE < 1), less air (amp_IE > 1) or the same amount of air (amp_IE = 1) than they have just inhaled to produce the breath group. This measure could not be taken as an indicator of the reserve volume consumption, as displacements values were not expressed relative to a zero volume.

We considered four main factors: (1) the number of clauses in the breath group (n_clauses, 1, 2, 3); (2) the number of syllables n_syll (continuous factor); (3) the type of first clause f_clause (m, e1, e2, u); (4) the type of hesitation: t_hesi (levels: none, at least one at onset: onset, at least one not at onset: elsewhere).

The effects of n_syll, n_clauses and f_clause on the different parameters were tested as fixed factors effects using Linear Mixed Models (LMM), with subject as a random factor. The interactions between factors were not significant and therefore, additive models were calculated. For dur_I and amp_IE the log values were used to satisfy normality. An analysis of hesitation was introduced in a second step with subject as random factor and n_syll and t_hesi as fixed factors. All the effects reported significant were satisfying the criteria pMCMC <.01.

3. Results

Table I. Description of the breath groups according to the number of clauses and to the type of the first clause. NB: Number of breath groups; n_syll: Average number of syllables; dur_g: average duration (± one standard error).

_images/tab1.png
3.1. Linguistic structure of the breath group

The average characteristics of the breath groups and their repartition according to the first clause and to the number of clauses are displayed in Table 1. Speakers produced from 13 to 99 breath groups (47.4 (mean) ±4.5 (sterr), Figure 2). Half of the breath groups (53%) started with a matrix clause (m), a quarter (24%) with and embedded clause and the last quarter (23%) with an uncompleted clause (u). On average, the breath group included ~15.9 syllables (range: 1 to 50), and lasted ~3.5 s (range: .17 to 12.1). The number of syllables and the duration of the groups significantly increased with the number of clauses (~+7.5 syllables and +1.5 s per supplementary clause), but were similar for groups starting with a matrix as compared to an embedded clause. Groups starting with an uncompleted clause were ~6 syllables and 1.1 s shorter than the other groups.

_images/fig24.png

Figure 2. Number of breath groups for each speaker with repartition of groups in: no hesitation (none), at least one hesitation at onset or elsewhere

The percentage of breath groups with vocalized hesitations ranged form 0 to more than 50% according to the subject (average 40%, see Figure 2). Among the breath groups with at least one hesitation (n=482), 40% started with a hesitation. Note that the groups with at least one hesitation not at the onset of the group were longer than the groups starting with a hesitation (~+3syllables and ~+749 ms) and than the groups without hesitation (~+3syllables and ~+1246 ms). The effect of hesitation type (t_hesi) on the number of syllables and the duration of the group were significant but didn’t interact with the effect of the first clause.

_images/fig32.png

Figure 3. Correlations between n_syll in the breath group with: dur_I and amp_I and amp_IE, all values (top), average (bottom). Correlations values for amp_IE are indicated for log(amp_IE), see text for details.

3.2. Breathing kinematics

On average, the duration of inhalation was 676 ms (±8.5) and the amplitude was 17.6 %MD (± 0.2). The amplitude and the duration of inhalation depended both on the length of the breath group and on the type of the first clause. These values were also positively correlated with the number of syllables (r = ~.20 for all values and r = ~.60 for average correlations, see Figure 3, first two columns). LMM showed a significant effect of n_syll on both amp_I and dur_I.

_images/fig42.png

Figure 4. Average and standard errors of dur_I, amp_I and amp_IE according to n_clauses and f_clause (white panels) and to the type of hesitation in the breath group (gray panel)

The duration of inhalation (Figure 4.A) significantly increased from 1 to 2 (+26 ms) and 2 to 3 (+36 ms) clauses. Dur_I was also longer for groups starting with a matrix clause as compared to other types of clauses (+197 ms). Inhalation (Figure 4.B) was significantly deeper when the first clause of the upcoming group was a matrix clause (+3.5 %MD) than any other clauses. Yet, amp_I did not significantly depend on the number of clauses. The analysis of the inhalation displacement relative to the exhalation displacement (amp_IE, Figure 3 and 4) shows: (1) that amp_IE was close to 1 for groups with 2 clauses and groups with 15-18 syllables; (2) a significant linear correlation between the logarithm of amp_IE with the number of syllables (-.48 for all values, -.83 for average, significant effect of n_syll); (3) an effect of the number of clauses (1 > 2 > 3); (4) no significant effect of the type of the first clause. Hence, on average, the inhalation displacement was similar to the exhalation displacement for groups with 2 clauses or 15-18 syllables, larger for shorter groups and smaller for longer groups. Inhalations were deeper (+2.54 %MD) and longer (+41 ms) for the breath groups with at least one hesitation as compare with no hesitation (Figure 4). The effect of t_hesi on amp_IE was not significant when the number of syllables was taken into account.

4. Discussion

The present study investigated the linguistic structure of the breath group in German spontaneous speech and evaluated if this structure is reflected in breathing kinematics. The important findings are:

    1. Inhalations occur at syntactic boundaries (before a matrix or an embedded clause) or before a disfluency (uncompleted clause, repetition, hesitation, repair);
    1. Inhalation depth and duration reflect: (2.1) the length of the breath group (number of syllables); (2.2) the type of the first clause, with deeper and longer inhalation for groups starting with a matrix clause as compared to the other groups; (2.3) vocalized hesitations, with deeper and longer inhalations for groups that include at least one vocalized hesitation as compared to none;
    1. Syntactic complexity (number of clauses) is reflected only in the duration but not in the amplitude of inhalation;
    1. On average the amplitude of exhalation is similar to the amplitude of inhalation for groups with 2 clauses or 15-18 syllables.

The observation that most of the inhalation pauses respect the syntactic organization of speech is consistent with previous work on English spontaneous speech [7,15]. The average duration (3.5 s), the number of syllables in the breath group (16 syllables) and the duration of inhalation (~.7 s) are also similar to values reported in the literature on English language ([7,8,15]).

As described in the introduction, previous studies found deeper and longer inhalations for longer utterances. Our dataset also show these relations. However, we also found that inhalations were deeper and longer for the breath groups starting with a matrix clause and for the groups including hesitations as compared to the other groups. To our knowledge, the relationship between the type of the first clause and hesitation to inhalation parameters have not been investigated so far for spontaneous speech. This relation is important with respect to the understanding of speech planning. It suggests that speaker inhale more air: (1) when they are starting a matrix clause that may come with other related clauses; (2) when they produce hesitations and do not know exactly what they are going to say. In this case, they can use vocalized hesitations as fillers during the exhalation phase, which could help to preserve ventilation and speech at the same time [21]. The fact that the breath groups with a hesitation at the onset were shorter than groups with a later hesitation shows that when hesitation came at the onset of the group, speaker probably inhaled again soon after it.

We also found that groups with an average number of syllables (15-18) show similar exhalation and inhalation amplitudes. These breath groups correspond to 2 clauses and could be a “favored” association between linguistic structure and breathing. This hypothesis should be tested by considering inter-speaker variability and speaker-specific lung volume capacities.

The speech task used in the present study required speakers to summarize the story they have just heard. This task is cognitively demanding and could have influenced the production of hesitations and the breathing profiles. This is in line with inter-speakers variability we found with respect to the number of breath groups and hesitations produced in the current task. To our knowledge only [8] have investigated the possible effect of cognitive load on breathing kinematics during spontaneous speech. We think it is important to distinguish between speaker-specific behaviors according to the task (e.g. variation in disfluency, hesitations).

5. Limits and perspectives

This study is a first analysis of a larger corpus of breathing kinematics in German spontaneous speech that now includes more than 50 speakers. Our global aim is to understand the interplay of speech planning and breathing in unconstrained speech. From the current study some first issues appear: (1) it is difficult to distinguish between the effect of the number of syllables and the effect of the number of clauses. Note that the quartile of the average number of syllables (10-15-21) were close to the average number of syllables in 1, 2, and 3 clauses, respectively (10-17-25 syllables); (2) Uncompleted clauses should be analyzed in more detail by splitting between hesitations, repairs and repetitions, that could have specific effect on breathing; (3) the amplitude of inhalation anticipates the upcoming breath group, but may also rely on what happened before [9]. This may be especially true for groups starting with an embedded clause. The next step is also to characterize the breath group in spontaneous speech not only as an individual unit but as a temporal sequence that depends on the preceding and following speech.

Speaker-specific behavior and context effects should also be considered. Previous studies on read and spontaneous speech, found that the properties of the breath group and their relations to inhalation parameters are speaker-specific [10,13], varied with age [11], cognitive load [8], speech rate [3] and loudness [16,19]. A large variability has also been observed for a same subject across repetitions and according to her emotional state [6-7, 10]. The sensitivity of speakers’ breathing regarding these multiple influences is important to understand the interplay between linguistics and respiration and may provide a fundamental tool for pathological diagnostics and speech therapy. Furthermore, implementing breathing in speech synthesis may improve the naturalness of speech synthesizers.

6. Acknowledgements

This work was funded by a grant from the BMBF (01UG0711) and the French-German University to the PILIOS project. The authors want to thanks Jörg Dreyer, Anna Sopronova and Uwe Reichel for their help with data collection and labeling.

7. References
  • [1] Lieberman, P., Intonation, Perception and Language. (1967), Cambridge MA: MIT Press.
  • [2] Henderson, A., Goldman-Eisler, F., & Skarbek, A. (1965). Temporal patterns of cognitive activity and breath control in speech. Lang Speech, 8, 236–242.
  • [3] Grosjean, F. & Collins, M. (1979). Breathing, pausing and reading. Phonetica, 36(2), 98–114.
  • [4] Conrad, B. & Schonle, P. (1979). Speech and respiration. Arch Psychiatr Nervenkr, 226, 251–268.
  • [5] Conrad, B., Thalacker, S., & Schonle, P. (1983). Speech respiration as an indicator of integrative contextual processing. Folia Phoniatr (Basel), 35, 220–225.
  • [6] Winkworth, A. L., Davis, P. J., Ellis, E., & Adams, R. D. (1994). Variability and consistency in speech breathing during reading: lung volumes, speech intensity, and linguistic factors. J Speech Hear Res, 37, 535–556.
  • [7] Winkworth, A. L., Davis, P. J., Adams, R. D., & Ellis, E. (1995). Breathing patterns during spontaneous speech. J Speech Hear Res, 38, 124–144.
  • [8] Mitchell, H. L., Hoit, J. D., & Watson, P. J. (1996). Cognitivelinguistic demands and speech breathing. J Speech Hear Res, 39, 93–104.
  • [9] Bailly, G. and Gouvernayre, C. (2001). Pauses and respiratory markers of the structure of book reading. in Interspeech. 2012. Portland, OR.
  • [10] Teston, B. and Autesserre, D. (1987). L’ aérodynamique du souffle phonatoire utilisé dans la lecture d’un texte en français. in International Congress of Phonetic Sciences (ICPhS). Estonia, University of Tallin. p. 33-36.
  • [11] Sperry, E. E. & Klich, R. J. (1992). Speech breathing in senescent and younger women during oral reading. J Speech Hear Res, 35, 1246–1255.
  • [12] Whalen, D. H. & Kinsella-Shaw, J. M. (1997). Exploring the relationship of inspiration duration to utterance duration. Phonetica, 54, 138–152.
  • [13] Fuchs, S., Petrone, C. Krivokapic, J. & Hoole, P. (2013). Acoustic and respiratory evidence for utterance planning in German. Journal of Phonetics 41. 29-47.
  • [14] McFarland, D. H. & Smith, A. (1992). Effects of vocal task and respiratory phase on prephonatory chest wall movements. J Speech Hear Res, 35, 971–982.
  • [15] Wang, Y. T., Green, J. R., Nip, I. S., Kent, R. D., & Kent, J. F. (2010). Breath group analysis for reading and spontaneous speech in healthy adults. Folia Phoniatr Logop, 62, 297–302.
  • [16] Huber, J. E. (2008). Effects of utterance length and vocal loudness on speech breathing in older adults}. Respir Physiol Neurobiol, 164, 323–330.
  • [17] Wang, Y.T., Nip, I.S.B., Green, J.R., Kent, R.D., Kent, J.F., Ullman, C. Accuracy of perceptual and acoustic methods for the detection of inspiratory loci in spontaneous speech. Behavior
  • Research Methods, 2012. 44(4): p. 1121-1128 [18] McFarland, D. H. (2001). Respiratory markers of conversational interaction}. J. Speech Lang. Hear. Res., 44, 128
  • [19] Huber, J. E. (2007). Effect of cues to increase sound pressure level on respiratory kinematic patterns during connected speech}. J. Speech Lang. Hear. Res., 50, 621–634.
  • [20] Ferreira, F. and Bailey K. G.D. (2004). Disfluencies and human language comprehension. Trends in Cognitive Sciences, 8(5), 231–237.
  • [21] Schonle, P. W. & Conrad, B. (1985). Hesitation vowels: a motor speech respiration hypothesis. Neurosci. Lett., 55, 293–296.
  • [22] Konno, K. & Mead, J. (1967). Measurement of the separate volume changes of rib cage and abdomen during breathing. Journal of Applied Physiology 22(3), 407–422.
  • [23] Banzett, R. B., Mahan, S. T., Garner, D. M., Brughera, A. & Loring, S. H. (1995). A simple and reliable method to calibrate respiratory magnetometers and Respitrace. Journal of Applied Physiology, 79(6), 2169-2176.
  • [24] Boersma, P. and D. Weenink, Praat, a System for doing Phonetics by Computer, version 3.4, in Institute of Phonetic Sciences of the University of Amsterdam, Report 132. 182 pages. 1996.
  • [25] Reichel, U.D. (2012). PermA and Balloon: Tools for string alignment and text processing. Proceedings of Interspeech, Portland, paper 346.

Weak semantic context helps phonetic learning in a model of infant language acquisition

  • Edinburgh Research Explorer

  • Stella Frank :
    • sfrank@inf.ed.ac.uk
    • ILCC, School of Informatics
    • University of Edinburgh
    • Edinburgh, EH8 9AB, UK
  • Naomi H. Feldman :
    • nhf@umd.edu
    • Department of Linguistics
    • University of Maryland
    • College Park, MD, 20742, USA
  • Sharon Goldwater :
モデルについて
ディレクレ過程事前分布
  • 音素カテゴリのセットと語彙は ノンパラメトリックなディリクレ過程 (DP) 事前分布でモデリングされている.
    • これは, 潜在的には無限個のカテゴリや語彙を返す.

    • 一つの DP は パラメタ \(DP(\alpha, H)\) によって規定される.
  • \(DP, G ~ (\alpha, H)\) からの描画は \(H\) からの描画のセット上に, \(H\) によって例えば, 生成されたカテゴリまたは語彙素のセットに対する離散分布などの分布を返す.

  • 混合モデル設定
    • \(H\) 由来の対応するコンポーネントによって生成されたデータポイントを持つ, カテゴリの割り当てを \(G\) から生成する
    • \(H\) が無限の場合, \(DP\) のサポートもまた無限である.
    • 推論の間, \(G\) を最小化した.
IGMM (ベースモデル)
  • 母音学習の先行研究 (de Boer and Kuhl, 2003; Vallabha et al., 2007; McMurray et al., 2009; Dillon et al., 2013) にしたがって, 我々は, 母音トークンは ガウス混合分布モデルにしたがって起きるものであると過程した.
  • Infinite Gaussian Mixture Model (IGMM) は上述の DP を含む - 基本分布 \(H_C\) : Normal Inverse-Wishart事前分布から描かれる多変量カウス分布
  • 各観測では, フォルマントベクトル \(w_{ij}\) はカテゴリ割り当て \(c_{ij}\) に対応するガウシアンより描かれる.
  • 上記のモデルはそれぞれの母音トークン \(w_{ij}\) に対するカテゴリ割り当て \(c_{ij}\) を生成する.
  • これが, ボトムアップな分布の情報のみを使用して母音トークンをクラスタ化する IGMM モデルである.
(13)\[\begin{split} \mu_c , \Sigma_c ∼ H_C &=& NIW(\mu_0 , \Sigma_0 , \nu_0 )\end{split}\]
(14)\[\begin{split} G_C &∼& DP (\alpha_c , H_C )\end{split}\]
(15)\[\begin{split} c_{ij} &∼& G_C\end{split}\]
(16)\[\begin{split} w_{ij} | c_{ij} &=& c ∼ N (\mu_c , \Sigma_c )\end{split}\]
Lexicon

LD モデルはトークンレベルで語彙内のカテゴリを割り当てるのではなく、トップダウンの情報が追加される.

  • 単語トークン = \(x_i\)

  • 単語トークンのフレーム = \(f_i\)
    • これは子音のリストと母音のスロットと, 母音トークン \(w_i\) のリストで構成されている ( TLD モデルでは, 以下に記載する追加の対象も含む ).
  • 母音トークン = \(w_{ij}\) = 第一,第二フォルマントのベクトル

  • 母音カテゴリ = \(w_{ij} =c\)

  • 中間の語彙 = \(f_{\ell}\)
    • これは フレーム \(f_{\ell}\) と 母音カテゴリの割り当て \(\nu_{\ell j} = c\) を含む.
  • 単語トークンがある語彙に割り当てられた場合 = \(x_i = \ell\)
    • その語に含まれる母音は語彙の母音カテゴリに割り当てられる = \(w_{ij} = \nu_{\ell j} =c\)
  • その単語フレームと語彙フレームは一致する = \(f_i = f_{\ell}\) .

注釈

\(f_i\) の具体例

\(x_i\) = “Kitty”

  • \(f_i = \text{k\_t\_}\) を持ち, 2つの母音スロットをその間に持つ, 2つの子音音素/k/, /t/を持ち, そして2つの母音ベクトル \(w_{i0} = [464, 2294] \text{ and } w_{i1} = [412, 2760]\) を持つ.

注釈

語彙を導入する利点と欠点

語彙情報はカテゴリの重度な重複の曖昧さをなくすため,音素のカテゴリゼーションを手助けする. \(ae-eh\) 領域のデータポイントの中心を観察する純粋な分布学習者はカテゴリの分布が非常に似ているため, これらすべてのポイントは単一のカテゴリであると割り当てるだろう. しかし, 語彙コンテキストに注目する学習者は \(ae\)\(ae-eh\) 領域の一部で 観察されるというコンテキストと, 一方 \(eh\) は異なる(ただし部分的には重複する) 空間でのみ観察されると いうコンテキストがあるため, 違いに気がつくことができる. そのため, 学習者は2つの異なるカテゴリが異なる語彙のセットにおいて生じるという証拠を持っている.

LDモデルのシミュレーションは音素学習を制約する語彙情報を用いることで大幅に分類精度を向上させることができることを示しており(Feldman et al., 2013a), それはまた、エラーを導くことにもなりうる. 異なる母音で同じ子音のフレームを含む二語のトークン(すなわち、ミニマルペア)では, モデルは、これら二つの母音を同じものとして分類する可能性が高い. したがって, このモデルではミニマルペアの区別は難しい. 一方, 乳児もまたミニマルペアに関しては問題があり (Stager and Werker, 1997; Thiessen, 2007), LD モデルは問題の程度を過大評価できる. 我々は学習者が(子供がそうであるように)その使用の文脈で言葉を関連付けすることができる場合には, その正確な意味を知らなくても ミニマルペアの曖昧性に対する弱い情報源を提供することができると仮定した. これは,学習者が異なる状況のコンテキストで \(k V_1 t\)\(k V_2 t\) という語を聞いた場合, それらの間の語彙的類似性にもかかわらず, それらは異なる語彙項目である可能性が高く (そして \(V_1\)\(V_2\) は異なる音素である) と判断できる.

  • LDモデルでは、母音音素は、語彙から引き出された単語中に現れる.

  • それぞれの語彙素性はフレームと母音カテゴリ \(\nu_{\ell}\) のリストとして表現される.

  • 語彙素性は語彙生成基本分布 \(H_L\) を持つ DP から描かれる.

  • 単語に含まれるそれぞれの母音トークンのカテゴリは語彙素性ごとに探索される.

  • IGMMのように, フォルマントの値は対応するガウシアンから描かれる.

  • \(H_L\) は幾何分布から最初に描かれる音素の数と二峰性の分布由来の子音音素の数とによって, 語彙素性を生成する.

  • 上記の IGMM DP \(\nu_{\ell j} ∼ GC\) から 母音音素 \(\nu_{\ell}\) が生成される一方,
    • 子音はその後, 均一的な基本分布を持つ DP から生成される
  • \(H_L\) からの2つの描写は独立した語彙素性における結果であることに注目.
    • これらは, それにもかかわらず、別々の(同音)語彙素であると考えられる.
(17)\[G_L ∼ DP (\alpha_l , H_L )\]
(18)\[x_i = \ell ∼ G_L\]
(19)\[w_{ij} | \nu_{\ell j} = c ∼ N (\mu_c , \Sigma_c )\]
Topic-Lexical-Distributional Model
  • 状況の情報を利用する利点を実証するために, 我々は 話題-単語-分布モデルを開発した.

  • これはその語が 類似した話題の文章で生じるという Topic モデルを仮定することにより LD モデルを拡張したものである.

  • それぞれの状況 = \(h\)

  • 観察された話題 = \(\theta_h\)

  • ある状況 \(h\) における \(i\) 番目のトークン = \(x_{hi}\)
    • そのフレーム \(f_{hi}\), 母音 \(w_{hi}\), そして Topic ベクトル \(\theta_h\) となる.
  • TLD モデル は IGMM の母音コンポーネントを保ってはいるが, 話題に限定的な語彙によって LD モデルの語彙を拡張した.
    • 語彙素性の確率は話題に由来しているという考えをとったものである.
    • 特に, TLD モデルはディレクレ過程の語彙を,階層的ディレクレ過程に置き換えたものである(HDP; Teh (2006)).
  • HDP 語彙では, LD モデルの場合のように、トップレベルでグローバルな語彙を生成する.

  • その後、話題特異な語彙をグローバルな語彙から取り出だす (しかし、グローバルな語彙のサイズが限られていないので、これは、話題に限定的な語彙になる)
    • これらの話題に限定的な語彙は LD モデルと似た方法でトークンを生成するために使用される
  • 低レベルの話題の語彙は固定値がある.
    • これらは話題分布を推論するために使用される LDA モデルの中でトピックの数に一致する(6.4章参照)。
    • グローバルな語彙はトップレベル \(DP: G_L ∼ DP (\alpha_{\ell} , H_L )\) として作成される
  • \(G_L\) は話題レベルの DPs \(G_k ∼ DP (\alpha_k , G_L )\) において基本分布として使用される.

  • HDPs を記述するためにフランチャイズの中華料理屋の比喩がよく使われる.
    • \(G_L\) は皿(語彙素性)のグローバルな一覧である.
    • 話題に限定された語彙はレストランであり,それぞれの皿に対する分布を持っている.
    • この分布は座席に座る客(単語トークン)によってテーブルで定義され, これらはそれぞれメニューから一つの皿を給仕する.
    • 同じテーブル \(t\) のすべてのトークン \(x\) は同じ語彙素性 \(t\) に割り当てられる.
    • 推論(5章)は語彙素性ではなくテーブルの点から定義される.
    • 多数のテーブルが同じ皿を \(G_L\) から引く場合、これらのテーブルのトークンは同じ語彙素性を共有します.
(20)\[G_L ∼ DP (\alpha_l , H_L )\]
(21)\[G_k ∼ DP (\alpha_k , G_L )\]
  • TLD モデルでは,トークンは状況のなかに出現し,これらの状況は話題に対する分布 \(\theta_h\) を持っている.
  • それぞれのトークン \(x_{hi}\)\(\theta_h\) から描かれる, 共通するインデックスが付与されたトピックの割り当て変数 \(z_{hi}\) を持っている.
  • \(w_{hij}\) に対するフォルマントの値は LD モデルの場合と同様の方法で描かれ, \(x_{hi}\) に割り当てられた語彙素性を与えられる.
(22)\[z_{hi} ∼ M_{ult}(\theta_h )\]
(23)\[x_{hi} = t|z_{hi} = k ∼ G_k\]
(24)\[w_{hij} | x_{hi} = t, \nu_{\ell_t j} = c ∼ N (\mu_c , \Sigma_c )\]
_images/fig25.png

左から右へ TDL モデル, IGMMコンポーネント, LD 語彙コンポーネント, 話題に限定的な語彙, 最後に文章 \(h\) に現れるトークン \(x_{hi}\) と 観察された母音フォルマント \(w_{hij}\) と フレーム \(f_{hi}\)

語彙素性割り当て \(x_{hi}\) と, 話題割り当て \(z_{hi}\) は推定され, 後者は観察された文章分布 \(theta_h\) に使用される. \(f_i\) は決定論的な与えられた語彙素性割り当てであることに注意. ノードの2乗はハイパーパラメータを描く. \(\lambda\) は語彙アイテムを生成する際に \(H_L\) によって使用されるハイパーパラメータのセットである( 3.2 章参照 ).
Abstract

Learning phonetic categories is one of the first steps to learning a language, yet is hard to do using only distributional phonetic information. Semantics could potentially be useful, since words with different meanings have distinct phonetics, but it is unclear how many word meanings are known to infants learning phonetic categories. We show that attending to a weaker source of semantics, in the form of a distribution over topics in the current context, can lead to improvements in phonetic category learning. In our model, an extension of a previous model of joint word-form and phonetic category inference, the probability of word-forms is topic-dependent, enabling the model to find significantly better phonetic vowel categories and word-forms than a model with no semantic knowledge.

注釈

音素カテゴリの学習は言語学習の最初のステップであるが、音素の分布的情報のみを使用してこれを行うのは困難である。 異なる意味を持つ単語は異なる音素を持つという意味論は潜在的に効果的であるが音素カテゴリを学習している乳児がどの程度の単語の意味を知っているかは不明確である。 我々は現在の文脈における話題にまたがる分布の形状における弱いセマンティックのソースに付随することは音素カテゴリの学習の向上を導くことを示す. 我々のモデルでは、先行モデルにワードフォームと音素のカテゴリの影響、ワードフォームの確率が話題に依存することを加え拡張した。 また、このモデルはセマンティックの知識を持たないモデルと比べて母音音素カテゴリとワードフォームを有意に優れて学習することを発見した。

1 Introduction

Infants begin learning the phonetic categories of their native language in their first year (Kuhl et al., 1992; Polka and Werker, 1994; Werker and Tees, 1984). In theory, semantic information could offer a valuable cue for phoneme induction [1] by helping infants distinguish between minimal pairs, as linguists do (Trubetzkoy, 1939). However, due to a widespread assumption that infants do not know the meanings of many words at the age when they are learning phonetic categories (see Swingley, 2009 for a review), most recent models of early phonetic category acquisition have explored the phonetic learning problem in the absence of semantic information (de Boer and Kuhl, 2003; Dillon et al., 2013; Feldman et al., 2013a; McMurray et al., 2009; Vallabha et al., 2007).

注釈

乳児は初年度に母語の音素カテゴリーの学習を始める (Kuhl et al., 1992; Polka and Werker, 1994; Werker and Tees, 1984). 理論上では, 意味論的情報は乳児のミニマル・ペアを区別を助けることで, 言語学者が行うように, 音素の誘導 [1] のために価値のある手がかりを提供する (Trubetzkoy,1939). しかし, 乳児は, 音素カテゴリを学習している時期では, 多くの単語の意味をしらないという広く知られた仮定(see Swingley, 2009 for a review) のため, 初期音素カテゴリ獲得の最近のモデルは意味論的な情報がない場合の音素の学習問題を調査している. (de Boer and Kuhl, 2003; Dillon et al., 2013; Feldman et al., 2013a; McMurray et al., 2009; Vallabha et al., 2007)

Models without any semantic information are likely to underestimate infants’ ability to learn phonetic categories. Infants learn language in the wild, and quickly attune to the fact that words have (possibly unknown) meanings. The extent of infants’ semantic knowledge is not yet known, but existing evidence shows that six-month-olds can associate some words with their referents (Bergelson and Swingley, 2012; Tincoff and Jusczyk, 1999, 2012), leverage non-acoustic contexts such as objects or articulations to distinguish similar sounds (Teinonen et al., 2008; Yeung and Werker, 2009), and map meaning (in the form of objects or images) to new word-forms in some laboratory settings (Friedrich and Friederici, 2011; Gogate and Bahrick, 2001; Shukla et al., 2011). These findings indicate that young infants are sensitive to co-occurrences between linguistic stimuli and at least some aspects of the world.

注釈

あらゆる意味論的情報を除いたモデルは乳児の音素カテゴリ学習の能力を過小評価しているように見える. 乳児は自然に, 早く, 言葉は(おそらく不明確な)意味を持っているという事実に同調する. 乳児の意味論的知識の範囲はまだ知られていないが, 六ヶ月児はいくつかの単語とその指示対象を関連付けることが可能であることを示す証拠があるし (Bergelson and Swingley, 2012; Tincoff and Jusczyk, 1999, 2012) , 例えば対象や表現など,非音響的コンテキストを利用してよく似た音を区別するし (Teinonen et al., 2008; Yeung and Werker, 2009) , いくつかの実験設定では新しいワードフォームに(オブジェクトや画像の形式で)意味をマッピングすることもできる (Friedrich and Friederici, 2011; Gogate and Bahrick, 2001; Shukla et al., 2011).

In this paper we explore the potential contribution of semantic information to phonetic learning by formalizing a model in which learners attend to the word-level context in which phones appear (as in the lexical-phonetic learning model of Feldman et al., 2013a) and also to the situations in which word-forms are used. The modeled situations consist of combinations of categories of salient activities or objects, similar to the activity contexts explored by Roy et al. (2012), e.g., ‘getting dressed’ or ‘eating breakfast’. We assume that child learners are able to infer a representation of the situational context from their non-linguistic environment. However, in our simulations we approximate the environmental information by running a topic model (Blei et al., 2003) over a corpus of childdirected speech to infer a topic distribution for each situation. These topic distributions are then used as input to our model to represent situational contexts.

注釈

本稿では音素が出現する単語レベルの文脈やワードフォームが使用される文脈を付与した 学習者モデル(Feldman et al., 2013a の 語彙-音素 学習モデルにあるようなモデル)を定式化することで音素学習に対する意味論的情報の潜在的な寄与を調査した. モデル化の状態は, 例えば, 「ドレスを着る」とか「朝食を食べる」のように, 顕著な活動や対象のカテゴリの組み合わせからなり, Roy et al., (2012) の調査したアクティビティコンテキストに似たものである. 我々は乳児がその日言語的環境から状況的文脈の表現を推測することができると仮定した. しかし, 我々のシミュレーションでは, 対児童発話のコーパス上のそれぞれの状況で Topic 分布 を推定するトピックモデル (Blei et al., 2003) を実行した. これらの トピック分布は状況のコンテキストを再提示する我々のモデルへのインプットとして使用した.

The situational information in our model is similar to that assumed by theories of cross-situational word learning (Frank et al., 2009; Smith and Yu, 2008; Yu and Smith, 2007), but our model does not require learners to map individual words to their referents. Even in the absence of word-meaning mappings, situational information is potentially useful because similar-sounding words uttered in similar situations are more likely to be tokens of the same lexeme (containing the same phones) than similarsounding words uttered in different situations.

注釈

我々のモデルにおける状況的情報とは,クロス状況的単語学習の理論 (Frank et al., 2009; Smith and Yu, 2008; Yu and Smith, 2007) が想定しているものに似ているが, 我々のモデルでは独立した単語とその指し示すものとのマッピングを学習者に要求しない. よく似た状況において発話させたよく似た音は (同じ音素を含む) 同じ語彙のトークン である可能性が,他の様々な状況での似たような音よりも高いため, 単語と意味のマッピングが欠落している時にさえ, 状況的な情報は有効である可能性がある.

In simulations of vowel learning, inspired by Vallabha et al. (2007) and Feldman et al. (2013a), we show a clear improvement over previous models in both phonetic and lexical (word-form) categorization when situational context is used as an additional source of information. This improvement is especially noticeable when the word-level context is providing less information, arguably the more realistic setting. These results demonstrate that relying on situational co-occurrence can improve phonetic learning, even if learners do not yet know the meanings of individual words.

注釈

母音学習のシミュレートでは Vallabha et al. (2007) や Feldman et al. (2013a) にインスパイアされ, 我々は状況の文脈は追加された情報源として使用される場合に, 音素と語彙(ワードフォーム) 両方のカテゴリゼーションにおける先行モデル以上に明らかな向上を示す. 単語レベルの文脈の情報が少なく, 間違いなく, より現実的な設定を提供したときにこの改良は特に目立つ. これらの結果は状況の共起に頼ることで, 学習者がまだ個々の単語の意味を知らなくても、音声的な学習を向上させることが可能であると示す.

注釈

[1](1, 2)

The models in this paper do not distinguish between phonetic and phonemic categories, since they do not capture phonological processes (and there are also none present in our synthetic data). We thus use the terms interchangeably.

本誌におけるモデルは,音素カテゴリと音素は音韻プロセスを取得しないため (そして,我々の人工データで提示もされないため), 音素カテゴリと音素を区別しない.

2 Background and overview of models

Infants attend to distributional characteristics of their input (maye et al., 2002, 2008), leading to the hypothesis that phonetic categories could be acquired on the basis of bottom-up distributional learning alone (de Boer and Kuhl, 2003; Vallabha et al., 2007; McMurray et al., 2009). However, this would require sound categories to be well separated, which often is not the case—for example, see :num:`Figure #fig1`, which shows the English vowel space that is the focus of this paper.

注釈

乳児はそれらの入力の分布的な特徴に注目をする(maye et al., 2002, 2008). これは, 音素カテゴリはボトムアップに分布の学習のみに基づいて獲得できるという仮説につながった(de Boer and Kuhl, 2003; Vallabha et al., 2007; McMurray et al., 2009).

これは音響カテゴリが分離されていることを要求しているが, しばしば,そうではない例がある. 例えば :num:`図 #fig1` を見て欲しい. ここには本稿で注目する英語母音空間を示す.

_images/fig15.png

英語母音空間(Hillenbrand ら(1995) の6.2節より引用).第一,第二フォルマントを図示する.

Recent work has investigated whether infants could overcome such distributional ambiguity by incorporating top-down information, in particular, the fact that phones appear within words. At six months, infants begin to recognize word-forms such as their name and other frequently occurring words (Mandel et al., 1995; Jusczyk and Hohne, 1997), without necessarily linking a meaning to these forms. This ‘protolexicon’ can help differentiate phonetic categories by adding word contexts in which certain sound categories appear (Swingley, 2009; Feldman et al., 2013b). To explore this idea further, Feldman et al. (2013a) implemented the Lexical-Distributional (LD) model, which jointly learns a set of phonetic vowel categories and a set of word-forms containing those categories. Simulations showed that the use of lexical context greatly improved phonetic learning.

注釈

最近の研究では乳児がこのような分布の曖昧さをトップダウンな情報,とくに音素は単語の中に現れるという情報,を組み込むことで克服することができるのか否かが研究されてきた. 乳児は六ヶ月で, 彼らの名前やその他の頻出する語のようなワードフォームをこれらのフォームの意味の関連は必要なとも, 認識し始める(Mandel et al., 1995: Jusczyk and Hohne, 1997). この ‘源語彙’ はある音響カテゴリが現れる単語のコンテキストを追加することで,音素カテゴリの分離を手助けする(Swingley, 2009; Feldman et al., 2013b). この考えを更に深めるため, Feldman et al. (2013a) では “語彙-分布モデル ( LD モデル )” を実装した. このモデルは母音音素カテゴリのセットとこれらのカテゴリを含む ワードフォームのセットを共同学習するものである. シミュレーションでは、語彙的文脈の使用が大幅に音素学習を改善したことを示した.

Our own Topic-Lexical-Distributional (TLD) model extends the LD model to include an additional type of context: the situations in which words appear. To motivate this extension and clarify the differences between the models, we now provide a high-level overview of both models; details are given in Sections 3 and 4.

注釈

我々は LDモデル を拡張し, 新たにコンテキストの種類を一つ追加した “話題-語彙-分布 モデル (TLD)” を作成した. ここで追加したものは 単語が出現する状況である. この拡張の動機とモデル間の違いを明確にするためにここでは,両方のモデルの大まかな概要を提示する. なお,モデルの詳細は 3,4章 で述べる.

2.1 Overview of LD model

Both the LD and TLD models are computationallevel models of phonetic (specifically, vowel) categorization where phones (vowels) are presented to the model in the context of words. [2] The task is to infer a set of phonetic categories and a set of lexical items on the basis of the data observed for each word token \(x_i\) . In the original LD model, the observations for token \(x_i\) are its frame \(f_i\) , which consists of a list of consonants and slots for vowels, and the list of vowel tokens \(w_i\). (The TLD model includes additional observations, described below.) A single vowel token, \(w_{ij}\) , is a two dimensional vector representing the first two formants (peaks in the frequency spectrum, ordered from lowest to highest). For example, a token of the word kitty would have the frame \(f_i = \text{k\_t\_}\) , containing two consonant phones, /k/ and /t/, with two vowel phone slots in between, and two vowel formant vectors, \(w_{i0} = [464, 2294] \text{ and } w_{i1} = [412, 2760]\). [3]

注釈

LD, TDL 両モデルとも, 単語における音素(母音)の出現位置をモデルに教え, 音素 (特に母音) のカテゴリ分類を行う計算モデルである. [2] ここでの課題は観察されたそれぞれの単語トークン \(x_i\) に基づいて, 音素カテゴリのセットと語彙要素のセットを推測することである. 元々の LD モデルにおいては, トークン \(x_i\) に対する観察対象は そのフレーム \(f_i\) である. これは子音のリストと母音のスロットと, 母音トークン \(w_i\) のリストで構成されている ( TLD モデルでは, 以下に記載する追加の対象も含む ). 一つの母音トークン \(w_{ij}\) は 第一,第二フォルマント(低次から高次に並べた際のスペクトラムの頻度のピーク)で表現される 2次元のベクトルである. 例えば, “Kitty” という語のトークンはフレーム \(f_i = \text{k\_t\_}\) を持ち, 2つの母音スロットをその間に持つ, 2つの子音音素/k/, /t/を持ち, そして2つの母音ベクトル \(w_{i0} = [464, 2294] \text{ and } w_{i1} = [412, 2760]\). [3]

Given the data, the model must assign each vowel token to a vowel category, \(w_{ij} = c\). Both the LD and the TLD models do this using intermediate lexemes, \(\ell\) , which contain vowel category assignments, \(\nu_{\ell j} = c\), as well as a frame \(f_{\ell}\) . If a word token is assigned to a lexeme, \(x_i = \ell\), the vowels within the word are assigned to that lexeme’s vowel categories, \(w_{ij} = \nu_{\ell j} = c\) . [4] The word and lexeme frames must match, \(f_i = f_{\ell}\) .

注釈

データを与えると, モデルはそれぞれの母音トークンを母音カテゴリ \(w_{ij} =c\) に割り当てる. LD, TLD モデルの両方で, 中間の語彙 \(f_{\ell}\) を使用してこれを行う. ここにはフレーム \(f_{\ell}\) と同じく母音カテゴリの割り当て \(\nu_{\ell j} = c\) を含む. 単語トークンがある語彙に割り当てられた場合 \(x_i = \ell\), その語に含まれる母音は語彙の母音カテゴリに割り当てられる, \(w_{ij} = \nu_{\ell j} =c\) [4] . その単語と語彙フレームは一致するはずである, \(f_i = f_{\ell}\) .

Lexical information helps with phonetic categorization because it can disambiguate highly overlapping categories, such as the \(ae\) and \(eh\) categories in :num:`Figure #fig1`. A purely distributional learner who observes a cluster of data points in the \(ae-eh\) region is likely to assume all these points belong to a single category because the distributions of the categories are so similar. However, a learner who attends to lexical context will notice a difference: contexts that only occur with \(ae\) will be observed in one part of the \(ae-eh\) region, while contexts that only occur with \(eh\) will be observed in a different (though partially overlapping) space. The learner then has evidence of two different categories occurring in different sets of lexemes.

注釈

語彙情報は例えば, :num:`図 #fig1` にある \(ae\)\(eh\) のような カテゴリの重度な重複の曖昧さをなくすため,音素のカテゴリゼーションを手助けする. \(ae-eh\) 領域のデータポイントの中心を観察する純粋な分布学習者はカテゴリの分布が非常に似ているため, これらすべてのポイントは単一のカテゴリであると割り当てるだろう. しかし, 語彙コンテキストに注目する学習者は \(ae\)\(ae-eh\) 領域の一部で 観察されるというコンテキストと, 一方 \(eh\) は異なる(ただし部分的には重複する) 空間でのみ観察されると いうコンテキストがあるため, 違いに気がつくことができる. そのため, 学習者は2つの異なるカテゴリが異なる語彙のセットにおいて生じるという証拠を持っている.

Simulations with the LD model show that using lexical information to constrain phonetic learning can greatly improve categorization accuracy (Feldman et al., 2013a), but it can also introduce errors. When two word tokens contain the same consonant frame but different vowels (i.e., minimal pairs), the model is more likely to categorize those two vowels together. Thus, the model has trouble distinguishing minimal pairs. Although young children also have trouble with minimal pairs (Stager and Werker, 1997; Thiessen, 2007), the LD model may overestimate the degree of the problem. We hypothesize that if a learner is able to associate words with the contexts of their use (as children likely are), this could provide a weak source of information for disambiguating minimal pairs even without knowing their exact meanings. That is, if the learner hears \(k V_1 t\) and \(k V_2 t\) in different situational contexts, they are likely to be different lexical items (and \(V_1\) and \(V_2\) different phones), despite the lexical similarity between them.

注釈

LDモデルのシミュレーションは音素学習を制約する語彙情報を用いることで大幅に分類精度を向上させることができることを示しており(Feldman et al., 2013a), それはまた、エラーを導くことにもなりうる. 異なる母音で同じ子音のフレームを含む二語のトークン(すなわち、ミニマルペア)では, モデルは、これら二つの母音を同じものとして分類する可能性が高い. したがって, このモデルではミニマルペアの区別は難しい. 一方, 乳児もまたミニマルペアに関しては問題があり (Stager and Werker, 1997; Thiessen, 2007), LD モデルは問題の程度を過大評価できる. 我々は学習者が(子供がそうであるように)その使用の文脈で言葉を関連付けすることができる場合には, その正確な意味を知らなくても ミニマルペアの曖昧性に対する弱い情報源を提供することができると仮定した. これは,学習者が異なる状況のコンテキストで \(k V_1 t\)\(k V_2 t\) という語を聞いた場合, それらの間の語彙的類似性にもかかわらず, それらは異なる語彙項目である可能性が高く (そして \(V_1\)\(V_2\) は異なる音素である) と判断できる.

脚注

[2](1, 2)

For a related model that also tackles the word segmentation problem, see Elsner et al. (2013). In a model of phonological learning, Fourtassi and Dupoux (submitted) show that semantic context information similar to that used here remains useful despite segmentation errors.

単語分割問題に取り組む関連モデルについては, Elsner et al. (2013) を参照. 音素学習モデルでは Fourtassiと Dupoux (submitted) が本稿で使用したのと同様のセマンティックコンテキスト情報を利用してセグメンテーションエラーに対しても有効なことを示しています.

[3](1, 2)

In simulations we also experiment with frames in which consonants are not represented perfectly.

シミュレーションでは,子音が完全に表現されていないフレームを使用する実験も行った.

[4](1, 2)

The notation is overloaded: wij refers both to the vowel formants and the vowel category assignments, and xi refers to both the token identity and its assignment to a lexeme.

表記は多重定義されている: \(w_{ij}\) は母音フォルマントと母音カテゴリの割り当てを言及しており, \(x_i\) はトークンID と その語彙に対する割り当てを言及している.

2.2 Overview of TLD model

To demonstrate the benefit of situational information, we develop the Topic-Lexical-Distributional (TLD) model, which extends the LD model by assuming that words appear in situations analogous to documents in a topic model. Each situation \(h\) is associated with a mixture of topics \(theta_h\) , which is assumed to be observed. Thus, for the \(i\) th token in situation \(h\), denoted \(x_{hi}\) , the observed data will be its frame \(f_{hi}\) , vowels \(w_{hi}\) , and topic vector \(\theta_h\) .

注釈

状況の情報を利用する利点を実証するために, 我々は 話題-単語-分布モデルを開発した. これはその語が 類似した話題の文章で生じるという Topic モデルを仮定することにより LD モデルを拡張したものである. それぞれの状況 \(h\) は観察された話題 \(\theta_h\) の混合と関連している. したがって, ある状況 \(h\) における \(i\) 番目のトークン, 以後 \(x_{hi}\) と表記, のために 観察されたデータは そのフレーム \(f_{hi}\), 母音 \(w_{hi}\), そして Topic ベクトル \(\theta_h\) となる.

From an acquisition perspective, the observed topic distribution represents the child’s knowledge of the context of the interaction: she can distinguish bathtime from dinnertime, and is able to recognize that some topics appear in certain contexts (e.g. animals on walks, vegetables at dinnertime) and not in others (few vegetables appear at bathtime). We assume that the child would learn these topics from observing the world around her and the co-occurrences of entities and activities in the world. Within any given situation, there might be a mixture of different (actual or possible) topics that are salient to the child. We assume further that as the child learns the language, she will begin to associate specific words with each topic as well.

注釈

獲得の視点からは, 観察された話題分布はインタラクションのコンテキストに対する子供の知識を表現している. 彼女は夕食時とバスタイムを区別することができるし, いくつかのトピックは特定の状況でのみ出現するし, また他の物は出ないことを認識することができる (例えば,散歩時には動物の話題が,夕食時には野菜の話題が出現するが, お風呂では野菜の話題にはなりにくい). 我々は子供は周囲の世界を観察することでこれらの話題を学習することができ, 周囲の世界において同時に活性化する共起物を学習できると仮定した. 与えられたあらゆる情報において, 子供に顕著な異なる話題 (実際的にしろ,可能な形にしろ ) の混同があるかもしれない. 我々は更に, 乳児が単語を学習している際に, 同様に各話題に特定の単語を関連付けることも開始すると仮定した.

Thus, in the TLD model, the words used in a situation are topic-dependent, implying meaning, but without pinpointing specific referents. Although the model observes the distribution of topics in each situation (corresponding to the child observing her non-linguistic environment), it must learn to associate each (phonetically and lexically ambiguous) word token with a particular topic from that distribution. The occurrence of similar-sounding words in different situations with mostly non-overlapping topics will provide evidence that those words belong to different topics and that they are therefore different lexemes. Conversely, potential minimal pairs that occur in situations with similar topic distributions are more likely to belong to the same topic and thus the same lexeme.

注釈

したがって TLD モデルでは, 単語は 話題に依存する状況において使用され, 特定の対処を決めることなく,意味を暗示する. モデルは,それぞれの状況にある話題の分布を観察し(これは子供が非言語的な環境を観察することに相当します), その分布から特定のトピックに関連するそれぞれの(音声学と辞書的に曖昧な)単語トークンを学ぶ必要がある. 大部分が重複しない話題について,様々な状況での発音の似た単語の出現は,これらの単語が別の話題に属しており,したがって異なる語彙であることを示す証拠になる. 逆に、似たような話題の分布をもつ状況での潜在的なミニマルペアは、同じトピックに属している可能性が高いので、同じ語彙になる.

Although we assume that children infer topic distributions from the non-linguistic environment, we will use transcripts from CHILDES to create the word/phone learning input for our model. These transcripts are not annotated with environmental context, but Roy et al. (2012) found that topics learned from similar transcript data using a topic model were strongly correlated with immediate activities and contexts. We therefore obtain the topic distributions used as input to the TLD model by training an LDA topic model (Blei et al., 2003) on a superset of the child-directed transcript data we use for lexical-phonetic learning, dividing the transcripts into small sections (the ‘documents’ in LDA) that serve as our distinct situations \(h\) . As noted above, the learned document-topic distributions \(\theta\) are treated as observed variables in the TLD model to represent the situational context. The topic-word distributions learned by LDA are discarded, since these are based on the (correct and unambiguous) words in the transcript, whereas the TLD model is presented with phonetically ambiguous versions of these word tokens and must learn to disambiguate them and associate them with topics.

注釈

我々は子供が話題の分布を非言語的な環境から推測していると仮定しているため, CHILDES [5] の転機を我々のモデルの 語/音素 学習用のインプット作成に使用した. これらの転機は環境コンテキストがアノテーションされていないが, Roy et al. (2012) は Topic モデルを使用して, 似たような転機データから, 学習された話題は即時行動や文脈と強い相関を持っていることを発見している. そのため, 我々が転記を明確な状況 \(h\) として機能する小さなセクションに分割し,語彙-音素を学習するのために使用する子供に対する書き起こしデータのスーパーセットを LDA Topic モデル [6] で訓練することで, TLD モデルへの入力として使用する話題分布を入手した. 上述した通り,状況的文脈を表現するためにTLDモデルで観測変数として, 訓練された文章-話題分布 \(theta\) は訓練された. LDAによって学習トピックワード分布は, TLDモデルが単語トークンの音声学的に曖昧なバージョンが提示され, それらを明確にし,話題に関連付けることを学習する必要があるのに対し, 転記情報中の(正しく,明確な)単語に基づいているため, LDAによって学習トピックワード分布は破棄された.

訳者注

[5]

CHILDES(チャイルズ、CHild Language Data Exchange System ) : 第一言語獲得研究用のデータベース

[6]

LDA Topic Model : 最近の Topic Model の代表的な実装方法

3 Lexical-Distributional Model

In this section we describe more formally the generative process for the LD model (Feldman et al., 2013a), a joint Bayesian model over phonetic categories and a lexicon, before describing the TLD extension in the following section.

注釈

本章では, LD モデル (Feldman et al., 2013a) における生成プロセス, 音素カテゴリと語彙に対するベイジアンモデルの接続部分をより本質的に記述する. その後, TLD 拡張をこの章の続きに記述する.

The set of phonetic categories and the lexicon are both modeled using non-parametric Dirichlet Process priors, which return a potentially infinite number of categories or lexemes. A DP is parametrized as \(DP (\alpha, H)\), where \(\alpha\) is a real-valued hyperparameter and \(H\) is a base distribution. \(H\) may be continuous, as when it generates phonetic categories in formant space, or discrete, as when it generates lexemes as a list of phonetic categories.

注釈

音素カテゴリのセットと語彙の両方はノンパラメトリックなディリクレ過程 (DP) 事前分布でモデリングされている. これは, 潜在的には無限個のカテゴリや語彙を返す. 一つの DP は パラメタ \(DP(\alpha, H)\) によって規定される. \(\alpha\) は実数値であり, ハイパラメータ \(H\) はベースとなる分布である. \(H\) はそれが音素カテゴリのリストとして語彙を生成する場合や, フォルマント空間の音素カテゴリを生成する場合などには, 連続値をとる.

A draw from a \(DP, G ∼ DP (\alpha, H)\) , returns a distribution over a set of draws from \(H\), i.e., a discrete distribution over a set of categories or lexemes generated by \(H\). In the mixture model setting, the category assignments are then generated from \(G\), with the datapoints themselves generated by the corresponding components from \(H\). If \(H\) is infinite, the support of the \(DP\) is likewise infinite. During inference, we marginalize over \(G\).

注釈

\(DP, G ~ (\alpha, H)\) からの描画は \(H\) からの描画のセット上に 例えば, \(H\) によって生成されたカテゴリまたは語彙素のセットに対する離散分布などの分布を返す. ついで,混合モデル設定では, \(H\) 由来の対応するコンポーネントによって生成されたデータポイントを持つ, カテゴリの割り当てを \(G\) から生成する \(H\) が無限の場合, \(DP\) のサポートもまた無限である. 推論の間, \(G\) を最小化した.

3.1 Phonetic Categories: IGMM

Following previous models of vowel learning (de Boer and Kuhl, 2003; Vallabha et al., 2007; McMurray et al., 2009; Dillon et al., 2013) we assume that vowel tokens are drawn from a Gaussian mixture model. The Infinite Gaussian Mixture Model (IGMM) (Rasmussen, 2000) includes a DP prior, as described above, in which the base distribution \(H_C\) generates multivariate Gaussians drawn from a Normal Inverse-Wishart prior. [7] Each observation, a formant vector \(w_{ij}\) , is drawn from the Gaussian corresponding to its category assignment \(c_{ij}\) :

(25)\[\begin{split} \mu_c , \Sigma_c ∼ H_C &=& NIW(\mu_0 , \Sigma_0 , \nu_0 )\end{split}\]
(26)\[\begin{split} G_C &∼& DP (\alpha_c , H_C )\end{split}\]
(27)\[\begin{split} c_{ij} &∼& G_C\end{split}\]
(28)\[\begin{split} w_{ij} | c_{ij} &=& c ∼ N (\mu_c , \Sigma_c )\end{split}\]

注釈

母音学習の先行研究 (de Boer and Kuhl, 2003; Vallabha et al., 2007; McMurray et al., 2009; Dillon et al., 2013) にしたがって, 我々は, 母音トークンは ガウス混合分布モデルにしたがって起きるものであると過程した. Infinite Gaussian Mixture Model (IGMM) (Rasmussen, 2000) は上述の DP を含み, Normal Inverse-Wishart事前分布から描かれる多変量カウス分布を 基本分布 \(H_C\) から生成する. 各観測では, フォルマントベクトル \(w_{ij}\) はカテゴリ割り当て \(c_{ij}\) に対応するガウシアンより, 描かれる.

The above model generates a category assignment \(c_{ij}\) for each vowel token \(w_{ij}\) . This is the baseline IGMM model, which clusters vowel tokens using bottom-up distributional information only; the LD model adds top-down information by assigning categories in the lexicon, rather than on the token level.

注釈

上記のモデルはそれぞれの母音トークン \(w_{ij}\) に対するカテゴリ割り当て \(c_{ij}\) を生成する. これが, ボトムアップな分布の情報のみを使用して母音トークンをクラスタ化する IGMM モデルである. LD モデルはトークンレベルで語彙内のカテゴリを割り当てるのではなく、トップダウンの情報が追加される.

脚注

[7]This compound distribution is equivalent to \(\Sigma_c ∼ IW(\Sigma_0 , \nu_0 ), \mu_c | \Sigma_c ∼ N (\mu_0 , \frac{\sigma_c}{\nu_0})\)
3.2 Lexicon

In the LD model, vowel phones appear within words drawn from the lexicon. Each such lexeme is represented as a frame plus a list of vowel categories \(\nu_{\ell}\) . Lexeme assignments for each token are drawn from a DP with a lexicon-generating base distribution \(H_L\) . The category for each vowel token in the word is determined by the lexeme; the formant values are drawn from the corresponding Gaussian as in the IGMM:

(29)\[G_L ∼ DP (\alpha_l , H_L )\]
(30)\[x_i = \ell ∼ G_L\]
(31)\[w_{ij} | \nu_{\ell j} = c ∼ N (\nu_c , \Sigma_c )\]

注釈

LDモデルでは、母音音素は、語彙から引き出された単語中に現れる. それぞれの語彙素性はフレームと母音カテゴリ \(\nu_{\ell}\) のリストとして表現される. 語彙素性はそれぞれのトークンに割り当てられ, 語彙生成基本分布 \(H_L\) を持つ DP から描かれる. 単語に含まれるそれぞれの母音トークンのカテゴリは語彙素性ごとに探索される. IGMMのように, フォルマントの値は対応するガウシアンから描かれる.

\(H_L\) generates lexemes by first drawing the number of phones from a geometric distribution and the number of consonant phones from a binomial distribution. The consonants are then generated from a DP with a uniform base distribution (but note they are fixed at inference time, i.e., are observed categorically), while the vowel phones \(\nu_{\ell}\) are generated by the IGMM DP above, \(\nu_{\ell j} ∼ GC\) .

注釈

\(H_L\) は幾何分布から最初に描かれる音素の数と二峰性の分布由来の子音音素の数とによって, 語彙素性を生成する. 上記の IGMM DP \(\nu_{\ell j} ∼ GC\) から 母音音素 \(\nu_{\ell}\) が生成される一方, 子音はその後, 均一的な基本分布を持つ DP から生成される (しかし, これらは推論時,例えば,カテゴリ的に観察されているときなど,には固定されることに注意).

Note that two draws from \(H_L\) may result in identical lexemes; these are nonetheless considered to be separate (homophone) lexemes.

注釈

\(H_L\) からの2つの描写は独立した語彙素性における結果であることに注目. これらは, それにもかかわらず、別々の(同音)語彙素であると考えられる.

4 Topic-Lexical-Distributional Model

The TLD model retains the IGMM vowel phone component, but extends the lexicon of the LD model by adding topic-specific lexicons, which capture the notion that lexeme probabilities are topicdependent. Specifically, the TLD model replaces the Dirichlet Process lexicon with a Hierarchical Dirichlet Process (HDP; Teh (2006)). In the HDP lexicon, a top-level global lexicon is generated as in the LD model. Topic-specific lexicons are then drawn from the global lexicon, containing a subset of the global lexicon (but since the size of the global lexicon is unbounded, so are the topic-specific lexicons). These topic-specific lexicons are used to generate the tokens in a similar manner to the LD model. There are a fixed number of lower level topic-lexicons; these are matched to the number of topics in the LDA model used to infer the topic distributions (see Section 6.4).

注釈

TLD モデル は IGMM の母音コンポーネントを保ってはいるが, 話題に限定的な語彙によって LD モデルの語彙を拡張した. 語彙素性の確率は話題に由来しているという考えをとったものである. 特に, TLD モデルはディレクレ過程の語彙を,階層的ディレクレ過程に置き換えたものである(HDP; Teh (2006)). HDP語彙では, LD モデルの場合のように、トップレベルでグローバルな語彙を生成する。 その後、話題特異な語彙をグローバルな語彙から取り出だす. (しかし、グローバルな語彙のサイズが限られていないので、これは、話題に限定的な語彙になる) これらの話題に限定的な語彙は LD モデルと似た方法でトークンを生成するために使用される. 低レベルの話題の語彙は固定値がある. これらは話題分布を推論するために使用される LDA モデルの中でトピックの数に一致する(6.4章参照)。

More formally, the global lexicon is generated as a top-level \(DP: G_L ∼ DP (\alpha_{\ell} , H_L )\) (see Section 3.2; remember \(H_L\) includes draws from the IGMM over vowel categories). \(G_L\) is in turn used as the base distribution in the topic-level DPs, \(G_k ∼ DP (\alpha_k , G_L )\). In the Chinese Restaurant Franchise metaphor often used to describe HDPs, \(G_L\) is a global menu of dishes (lexemes). The topicspecific lexicons are restaurants, each with its own distribution over dishes; this distribution is defined by seating customers (word tokens) at tables, each of which serves a single dish from the menu: all tokens \(x\) at the same table \(t\) are assigned to the same lexeme \(t\) . Inference (Section 5) is defined in terms of tables rather than lexemes; if multiple tables draw the same dish from \(G_L\) , tokens at these tables share a lexeme.

注釈

より正式に言えば,グローバルな語彙はトップレベル \(DP: G_L ∼ DP (\alpha_{\ell} , H_L )`として作成される (3.2参照: :math:`H_L\) はIGMMから母音カテゴリ上に描かれることに注意). \(G_L\) は話題レベルの DPs \(G_k ∼ DP (\alpha_k , G_L )\)-において基本分布として使用される. HDPs を記述するためにフランチャイズの中華料理屋の比喩がよく使われる. \(G_L\) は皿(語彙素性)のグローバルな一覧である. 話題に限定された語彙はレストランであり,それぞれの皿に対する分布を持っている. この分布は座席に座る客(単語トークン)によってテーブルで定義され, これらはそれぞれメニューから一つの皿を給仕する. 同じテーブル \(t\) のすべてのトークン \(x\) は同じ語彙素性 \(t\) に割り当てられる. 推論(5章)は語彙素性ではなくテーブルの点から定義される. 多数のテーブルが同じ皿を \(G_L\) から引く場合、これらのテーブルのトークンは同じ語彙素性を共有します.

In the TLD model, tokens appear within situations, each of which has a distribution over topics \(\theta_h\) . Each token \(x_{hi}\) has a co-indexed topic assignment variable, \(z_{hi}\) , drawn from \(\theta_h\) , designating the topic-lexicon from which the table for \(x_{hi}\) is to be drawn. The formant values for \(w_{hij}\) are drawn in the same way as in the LD model, given the lexeme assignment at \(x_{hi}\) . This results in the following model, shown in :num:`Figure #fig2`:

(32)\[G_L ∼ DP (\alpha_l , H_L )\]
(33)\[G_k ∼ DP (\alpha_k , G_L )\]
(34)\[z_{hi} ∼ M_{ult}(\theta_h )\]
(35)\[x_{hi} = t|z_{hi} = k ∼ G_k\]
(36)\[w_{hij} | x_{hi} = t, \upsilon_{\ell_t j} = c ∼ N (\mu_c , \Sigma_c )\]

注釈

TLD モデルでは,トークンは状況のなかに出現し,これらの状況は話題に対する分布 \(\theta_h\) を持っている. それぞれのトークン \(x_{hi}\)\(\theta_h\) から描かれる, 共通するインデックスが付与されたトピックの割り当て変数 \(z_{hi}\) を持っている. \(w_{hij}\) に対するフォルマントの値は LD モデルの場合と同様の方法で描かれ, \(x_{hi}\) に割り当てられた語彙素性を与えられる. モデルに従った結果を :num:`図 #fig2` に示す.

_images/fig25.png

左から右へ TDL モデル, IGMMコンポーネント, LD 語彙コンポーネント, 話題に限定的な語彙, 最後に文章 \(h\) に現れるトークン \(x_{hi}\) と 観察された母音フォルマント \(w_{hij}\) と フレーム \(f_{hi}\)

The lexeme assignment \(x_{hi}\) and the topic assignment \(z_{hi}\) are inferred, the latter using the observed documenttopic distribution \(theta_h\) . Note that \(f_i\) is deterministic given the lexeme assignment. Squared nodes depict hyperparameters. \(\lambda\) is the set of hyperparameters used by \(H_L\) when generating lexical items (see Section 3.2).

注釈

語彙素性割り当て \(x_{hi}\) と, 話題割り当て \(z_{hi}\) は推定され, 後者は観察された文章分布 \(theta_h\) に使用される. \(f_i\) は決定論的な与えられた語彙素性割り当てであることに注意. ノードの2乗はハイパーパラメータを描く. \(\lambda\) は語彙アイテムを生成する際に \(H_L\) によって使用されるハイパーパラメータのセットである( 3.2 章参照 ).

5 Inference: Gibbs Sampling

We use Gibbs sampling to infer three sets of variables in the TLD model: assignments to vowel categories in the lexemes, assignments of tokens to topics, and assignments of tokens to tables (from which the assignment to lexemes can be read off).

注釈

我々は Gibbs sampling を TLD モデルの 変数の木構造セットを推定するために使用した. 語彙素性に含まれる母音カテゴリへの割り当て, トークンの話題に対する割り当て, そして, トークンのテーブルに対する割り当てである (そこから語彙素への割り当てが読み取ることができる).

5.1 Sampling lexeme vowel categories

Each vowel in the lexicon must be assigned to a category in the IGMM. The posterior probability of a category assignment is composed of the DP prior over categories and the likelihood of the observed vowels belonging to that category. We use \(w_{\ell j}\) to denote the set of vowel formants at position \(j\) in words that have been assigned to lexeme . Then,

注釈

語彙に含まれるそれぞれの母音はIGMMにおいて一つのカテゴリへ割り当てられることになる. カテゴリの割り当ての事後分布はカテゴリに対する DP 事前分布 と そのカテゴリに属している観察された母音の尤度から構成されている. \(w_{\ell j}\) を語彙素性に割り当てられた単語のポジション \(j\) にある母音フォルマントのセットを示すために使用した. すなわち,

(37)\[P (\upsilon_{\ell j} = c | w, x, \ell^{\backslash \ell} ) ∝ P (\upsilon_{\ell j} = c | \ell^{\backslash \ell})p(w_{\ell j} | \upsilon_{\ell j} = c, w^{\backslash\ell j} )\]

The first (DP prior) factor is defined as:

注釈

第一 ( DP 事前分布 ) 因子は以下のように定義した:

(38)\[\begin{split} P (\upsilon_{\ell j} = c | \upsilon^{\backslash \ell j} ) = \left\{ \begin{array}{ll} \frac{n_c}{\Sigma_c n_c + \alpha_c} & \text{if } c \text{ exists} \\ \frac{\alpha_c}{\Sigma_c n_c + \alpha_c} & \text{if } c \text{ new} \end{array} \right.\end{split}\]

where \(n_c\) is the number of other vowels in the lexicon, \(\mu^{\backslash \ell j}\) , assigned to category \(c\). Note that there is always positive probability of creating a new category.

注釈

\(n_c\) の部分はカテゴリ \(c\) に割り当てられた語彙 \(\mu^{\backslash \ell}\) に含まれる他の母音数である. 新しいカテゴリの肯定的な可能性が常にあることに注意して欲しい.

The likelihood of the vowels is calculated by marginalizing over all possible means and variances of the Gaussian category parameters, given the NIW prior. For a single point \((\text{if} | w_{\ell j} | = 1)\), this predictive posterior is in the form of a Student-t distribution; for the more general case see Feldman et al. (2013a), Eq. B3.

注釈

母音の尤度は NIW 事前分布に与えられたガウスカテゴリパラメータのありうるすべての平均と分散のを無視することによって計算される. 一つの点 \(\text{if} | w_{\ell j} | =1\) に対して,この予想後部は スチューデントの t 分布の形式の中にある. より一般的な場合に関しては Feldman et al. (2013a), Eq. B3 を参照して欲しい.

5.2 Sampling table & topic assignments

We jointly sample \(x\) and \(z\), the variables assigning tokens to tables and topics. Resampling the table assignment includes the possibility of changing to a table with a different lexeme or drawing a new table with a previously seen or novel lexeme. The joint conditional probability of a table and topic assignment, given all other current token assignments, is:

注釈

我々は \(x\)\(z\), テーブルや話題へ割り当てられるトークンの変数を 共にサンプリングした. テーブルの割り当てをリサンプリングすることは 異なる語彙素性を使用してテーブルに変更したり、 以前に見られたまたは新規語彙を持つ新しいテーブルを描画する可能性を含んでいる. すべての他のトークン割り当てが与えられた,テーブルと話題割り当ての同時条件付き確率は 以下の通り.

(39)\[\begin{split}P (x_{hi} = t, z_{hi} = k | w_{hi} , \theta_h , t^{\backslash hi} , \ell, w^{\backslash hi} ) \\ = P (k | \theta_h ) P ( t | k, \ell_t , t{\backslash hi} ) \\ \prod_{c \in c} p (w_{hi \cdot} | \mu_{\ell_t \cdot} = c, w^{\backslash hi} )\end{split}\]

The first factor, the prior probability of topic \(k\) in document \(h\), is given by \(\theta_{hk}\) obtained from the LDA. The second factor is the prior probability of assigning word \(x_i\) to table \(t\) with lexeme given topic \(k\). It is given by the HDP, and depends on whether the table \(t\) exists in the HDP topic-lexicon for \(k\) and, likewise, whether any table in the topiclexicon has the lexeme :

注釈

第一要素,ドキュメント \(h`に含まれる話題の事前確率 :math:`k\) ,は LDA から得られた \(\theta_{hk}\) によって与えられる. 第二要素は 単語 \(x_i\) を 話題 \(k\) が与えられた語彙素性を持つテーブル \(t\) へ割り当てる事前確率である. これは HDP より与えられ, \(k`に対する話題語彙 HDP のどこにテーブル :math:`t\) が存在するのかと同様に 語彙素性をもつ話題語彙に含まれるすべてのテーブルがどこにあるのかに依存する.

(40)\[\begin{split}P (t | K, \ell, t^{\backslash hi}) \propto \begin{cases} \frac{ n_{kt} }{ n_k + \alpha_k} & \text{if $t$ in $k$} \\ \frac{a_k}{n_k+\alpha_k} \frac{m_{\ell}}{m+\alpha_l} & \text{if $t$ new, $\ell$ known} \\ \frac{\alpha_k}{n_k+\alpha_k} \frac{\alpha_{\ell}}{m+\alpha_l} & \text{if $t$ and $\ell$ new} \end{cases}\end{split}\]

Here \(n_{kt}\) is the number of other tokens at table \(t\), \(n_k\) are the total number of tokens in topic \(k\), \(m_{\ell}\) is the number of tables across all topics with the lexeme \(\ell\) , and \(m\) is the total number of tables.

注釈

ここで \(n_{kt}\) はテーブル \(t\) の他のトークンの数であり, \(n_k\) は話題 \(k\) に含まれるトークンの総和である. また,\(\ell\) は語彙素性 \(\ell\) を含むすべての話題に渡ったテーブル数であり, \(m\) はテーブルの総和である.

The third factor, the likelihood of the vowel formants \(w_{hi}\) in the categories given by the lexeme \(\mu_{\ell}\) , is of the same form as the likelihood of vowel categories when resampling lexeme vowel assignments. However, here it is calculated over the set of vowels in the token assigned to each vowel category (i.e., the vowels at indices where \(\mu_{\ell t \dot} = c\) ). For a new lexeme, we approximate the likelihood using 100 samples drawn from the prior, each weighted by \(\alpha/100\) (Neal, 2000).

注釈

第三要素は,語彙素性 \(\mu_{\ell}\) によって与えられた カテゴリに含まれる母音フォルマント \(w_{hi}\) の尤度であり, 語彙素性母音割り当てをリサンプリングした際の母音カテゴリの尤度と同じ形状である. しかし,ここで,それは 各母音カテゴリ(例えば,\(\mu_{\ell t \dot} = c\) の部分の索引の母音) に割り当てられたトークンに含まれる母音セットに渡って計算される. 新しい語彙素性のために, それぞれ \(\alpha/100\) によって重み付けられた 事前分布から取られた 100 サンプル を使用した尤度を近似した (Neal, 2000).

5.3 Hyperparameters

The three hyperparameters governing the HDP over the lexicon, \(\alpha_{\ell}\) and \(\alpha_{k}\) , and the DP over vowel categories, \(\alpha_c\) , are estimated using a slice sampler. The remaining hyperparameters for the vowel category and lexeme priors are set to the same values used by Feldman et al. (2013a).

注釈

語彙 \(\alpha_{\ell}\)\(\alpha_{k}\) に対する HDP , そして母音カテゴリに対する DP を決める 3つのハイパラメータ は スライスサンプラー を使用して推定された. 語彙素性と母音カテゴリの事前分布に対するハイパラメータは Feldman et al. (2013a) で 使用された値と同じ値のセットである.

6 Experiments
6.1 Corpus

We test our model on situated child directed speech, taken from the C1 section of the Brent corpus in CHILDES (Brent and Siskind, 2001; MacWhinney, 2000). This corpus consists of transcripts of speech directed at infants between the ages of 9 and 15 months, captured in a naturalistic setting as parent and child went about their day. This ensures variability of situations.

注釈

我々は対児童音声の上でモデルのテストを行った. これは, Brent が作成したコーパス CHILDS に含まれるセクション C1 にある ( Brent and Siskind, 2001; Macwhinney, 2000). このコーパスは 9-15ヶ月 の間の乳児に対して話しかけられた転記テキストによって構成されていて, 親子の日常について自然なセッティングで収録されたものである. これは状況の変動性を保証する.

Utterances with unintelligible words or quotes are removed. We restrict the corpus to content words by retaining only words tagged as adj, n, part and v (adjectives, nouns, particles, and verbs). This is in line with evidence that infants distinguish content and function words on the basis of acoustic signals (Shi and Werker, 2003). Vowel categorization improves when attending only to more prosodically and phonologically salient tokens (Adriaans and Swingley, 2012), which generally appear within content, not function words. The final corpus consists of 13138 tokens and 1497 word types.

注釈

理解できない単語や引用符をもつ発話は削除した. 我々は, ADJ, N, part, V (形容詞, 名詞, 句, 及び,動詞) としてタグ付けられた単語だけを保持することで, 内容語のみにコーパスを制限した. これは乳児が音響信号に基づいて, 内容語や機能語を区別しているという証拠に基づいたものである(Shi and Werker, 2003). 母音のカテゴリ化はより韻律的, 音韻的に顕著なトークンのみに注目した時に向上し(Adriaans and Swingley, 2012), それらのトークンは一般に機能語ではなく内容語に現れる. 最終的に, 13138 トークン, 1497の単語タイプのデータを使用した.

6.2 Hillenbrand Vowels

The transcripts do not include phonetic information, so, following Feldman et al. (2013a), we synthesize the formant values using data from Hillenbrand et al. (1995). This dataset consists of a set of 1669 manually gathered formant values from 139 American English speakers (men, women and children) for 12 vowels. For each vowel category, we construct a Gaussian from the mean and covariance of the datapoints belonging to that category, using the first and second formant values measured at steady state. We also construct a second dataset using only datapoints from adult female speakers.

注釈

転記テキストは音素情報を含んでいない. そのため, Feldman et al. (2013a) にしたがって, Hillenbrand et al. (1995) 由来のデータを使用したフォルマントの値を合成した. このデータセットは 12 種類の母音に対する 139名 のアメリカ人 (成人男女,子供) から手動で収集された 1669 のフォルマントの値のセットを含んである. それぞれの母音カテゴリに対して, 低上部分の第一第二フォルマントの値を使用したカテゴリに属するデータポイントの平均と分散から正規分布を構築した. 我々は, 成人女性話者のみからのデータポイントを使用した第二データセットも構築した.

Each word in the dataset is converted to a phonemic representation using the CMU pronunciation dictionary, which returns a sequence of Arpabet phoneme symbols. If there are multiple possible pronunciations, the first one is used. Each vowel phoneme in the word is then replaced by formant values drawn from the corresponding Hillenbrand Gaussian for that vowel.

注釈

データセットに含まれる各単語はアルファベット音素シンボルが返される, CMUの発話辞書を使用した音素表現に換算された. 仮に, 発話の選択肢が複数個あった場合には, 最初のものを使用した. 続いて, 単語に含まれる各母音音素はその母音用の Hillenbrand 正規分布に対応するフォルマントの値に変換された.

6.3 Merging Consonant Categories

The Arpabet encoding used in the phonemic representation includes 24 consonants. We construct datasets both using the full set of consonants — the ‘C24’ dataset — and with less fine-grained consonant categories. Distinguishing all consonant categories assumes perfect learning of consonants prior to vowel categorization and is thus somewhat unrealistic (Polka and Werker, 1994), but provides an upper limit on the information that word-contexts can give.

注釈

音素表現で使用されたアルファベットエンコーディングには 24 個の子音が含まれていた. 我々は子音のフルセットなデータセット, ‘C24’ データセット, と より詳細で少ない子音カテゴリのものの両方のデータセットを使用した. すべての子音のカテゴリを区別することは母音の分類の前に子音の完璧な学習を前提としていており, したがって,やや非現実的である (Polka and Werker, 1994) が, その単語のコンテキストが与えうる情報の上限を提供している.

In the ‘C15’ dataset, the voicing distinction is collapsed, leaving 15 consonant categories. The collapsed categories are B/P, G/K, D/T, CH/JH, V/F, TH/DH, S/Z, SH/ZH, R/L while HH, M, NG, N, W, Y remain separate phonemes. This dataset mirrors the finding in Mani and Plunkett (2010) that 12 month old infants are not sensitive to voicing mispronunciations.

注釈

‘C15’ データセットには, 15 子音のカテゴリを残して, ボイシングの区別が欠損している. 欠損したカテゴリは B/P, G/K, D/T, CH/JH, V/F, TH/DH, S/Z, SH/ZH, R/L であり, 一方 HH, M, NG, N, W, Y は別々の音素のままである. このデータセットは 12 ヶ月児はボイシングの誤りへの教示に敏感ではないという Mani and Plunkett (2010) での所見を反映している.

The ‘C6’ dataset distinguishes between only 6 coarse consonant phonemes, corresponding to stops (B,P,G,K,D,T), affricates (CH,JH), fricatives (V, F, TH, DH, S, Z, SH, ZH, HH), nasals (M, NG, N), liquids (R, L), and semivowels/glides (W, Y). This dataset makes minimal assumptions about the categories that infants could use in this learning setting.

注釈

‘C6’ データセットは閉鎖音(B、P、G、K、D、T), 破擦音(CH、JH), 摩擦音(V、F、TH、DH、S、Z、SH、ZH、HH), 鼻音(M、NG、N), 流音(R、L), 半母音/グライド(W、Y) に対応する6つの大まかな子音音素のみを区別する. このデータセットはこの学習セットで乳児が使用することができるカテゴリについての最小限の仮定を作る.

Decreasing the number of consonants increases the ambiguity in the corpus: bat not only shares a frame (b t) with boat and bite, but also, in the C15 dataset, with put, pad and bad (b/p d/t), and in the C6 dataset, with dog and kite, among many others (STOP STOP). Table 1 shows the percent age of types and tokens that are ambiguous in each dataset, that is, words in frames that match multiple wordtypes. Note that we always evaluate against the gold word identities, even when these are not distinguished in the model’s input. These datasets are intended to evaluate the degree of reliance on consonant information in the LD and TLD models, and to what extent the topics in the TLD model can replace this information.

注釈

子音の数を減らすと、コーパス中のあいまいさを増加する. /bat/ は /boat/ や /bite/ と フレーム /b t/ を共有し, その他, C15 データセットでは /put/, /pad/, /bad/ に b/p, d/p を共有し, C6 データセットでは, /dog/ や /kite/ その他色々 (もういいでしょう) を共有する. 表1に各データセットにおける曖昧なタイプとトークン, すなわち,複数の単語タイプと一致するフレーム内の単語の割合を示す.

Table 1 : 曖昧性(マージされた子音カテゴリ) の増加を示したコーパス統計
Dataset C24 C15 C6
Input Types 1487 1426 1203
Frames 1259 1078 702
Ambig Types % 27.2 42.0 80.4
Ambig Tokens % 41.3 56.9 77.2
6.4 Topics

The input to the TLD model includes a distribution over topics for each situation, which we infer in advance from the full Brent corpus (not only the C1 subset) using LDA. Each transcript in the Brent corpus captures about 75 minutes of parent-child interaction, and thus multiple situations will be included in each file. The transcripts do not delimit situations, so we do this somewhat arbitrarily by splitting each transcript after 50 CDS utterances, resulting in 203 situations for the Brent C1 dataset. As well as function words, we also remove the five most frequent content words (be, go, get, want, come). On average, situations are only 59 words long, reflecting the relative lack of content words in CDS utterances.

注釈

TLD モデルへの入力はそれぞれの状況に対する話題の分布を含んでいる. 話題は LDA を使用し, すべてのBrentコーパス( C1 サブセットのみではなく) から前もって推定した. Brent コーパスに含まれるそれぞれの転記は 親子のインタラクション約75分収集しており, そのため, 複数の状態がそれぞれのファイルには含まれている. 転記は状況を限定するものではないため, 我々はやや恣意的に,それぞれの転記を 50 CDS 発話後に分割することによって,Brent C1 データセットに対し, 203 の兄弟を生じさせた. 機能語と同様に5つの頻出する単語を除外した (be, go, get, want, come). CDSの発話中の内容語の相対的な不足を反映して, 状況は平均わずか 59 単語分の長さであった.

Input types are the number of word types with distinct input representations (as opposed to gold orthographic word types, of which there are 1497). Ambiguous types and tokens are those with frames that match multiple (orthographic) word types.

注釈

入力の種類は個別の入力表現を持つ単語の種類の数である ( 1497個ある正式な正投影単語の種類とは対象に) あいまいなタイプとトークンは、複数の(正投影)単語の種類と一致するフレームを持つ.

We infer 50 topics for this set of situations using the mallet toolkit (McCallum, 2002). Hyperparameters are inferred, which leads to a dominant topic that includes mainly light verbs (have, let, see, do). The other topics are less frequent but capture stronger semantic meaning (e.g. yummy, peach, cookie, daddy, bib in one topic, shoe, let, put, hat, pants in another). The word-topic assignments are used to calculate unsmoothed situation-topic distributions \(theta\) used by the TLD model.

注釈

我々は mallet ツールキットを使用して, これらの状況のセットに対するの50のトピックを推測した(McCallum, 2002). 主に軽動詞 ( have, let, see, do ) を含む支配的な話題につながるハイパーパラメータが推定された. その他の話題はあまり起きなかったが, 強い意味論的な意味が収集された(例えば, ある話題では yummy, peach, cookie, daddy, bib, 他の話題では shoe, let, put, hat, pants など) 単語-話題割り当てはTLDモデルによって, 平滑化されていない情報-話題分布 \(theta\) を計算するために使用される.

6.5 Evaluation

We evaluate against adult categories, i.e., the ‘goldstandard’, since all learners of a language eventually converge on similar categories. (Since our model is not a model of the learning process, we do not compare the infant learning process to the learning algorithm.) We evaluate both the inferred phonetic categories and words using the clustering evaluation measure V-Measure (VM; Rosenberg and Hirschberg, 2007). [8] VM is the harmonic mean of two components, similar to F-score, where the components (VC and VH) are measures of cross entropy between the gold and model categorization.

注釈

言語のすべての学習者は、最終的には類似したカテゴリに収束するので私たちは、大人のカテゴリ,つまり”ゴールデンスタンダート”に対して評価を行った. (我々のモデルは、学習プロセスのモデルではないので、 我々は、学習アルゴリズムと幼児学習プロセスを比較しない) 我々は推定された音素カテゴリと単語の両方をクラスタリングの評価尺度 V-Measure によって評価した (VM; Rosenberg and Hirschberg, 2007). [8]

For vowels, VM measures how well the inferred phonetic categorizations match the gold categories; for lexemes, it measures whether tokens have been assigned to the same lexemes both by the model and the gold standard. Words are evaluated against gold orthography, so homophones, e.g. hole and whole, are distinct gold words.

注釈

母音に対しては VM を どのくらいよく正解カテゴリとマッチする音素カテゴリを推定できたかによって計測している. 語彙素性に対してはトークンがゴールデンスタンダードとモデルの両方で同じ語彙素性に割り当てられたか否かで計測した. 単語は標準化記述に対して評価され, そのため同音異義語 (hole と whole) は個別の正解単語とした.

脚注

[8](1, 2)

Other clustering measures, such as 1-1 matching and pairwise precision and recall (accuracy and completeness) showed the same trends, but VM has been demonstrated to be the most stable measure when comparing solutions with varying numbers of clusters (Christodoulopoulos et al., 2010).

他のクラスタリング尺度,例えば 1-1 マッチング や ペアごとの recall, precision ( accuracy や completeness ) でも同じ傾向を示すが, VM はクラスターの数が異なるソリューションを比較する際に 最も安定性のある尺度であることが示されている(Christodoulopoulos et al., 2010).

6.6 Results

We compare all three models—TLD, LD, and IGMM—on the vowel categorization task, and TLD and LD on the lexical categorization task (since IGMM does not infer a lexicon). The datasets correspond to two sets of conditions: firstly, either using vowel categories synthesized from all speakers or only adult female speakers, and secondly, varying the coarseness of the observed consonant categories. Each condition (model, vowel speakers, consonant set) is run five times, using 1500 iterations of Gibbs sampling with hyperparameter sampling. Overall, we find that TLD outperforms the other models in both tasks, across all conditions.

注釈

我々は 3つのモデル —TLD, LD, IGMM— すべてを母音分類タスク,で比較し, TLD モデル と LDモデル を語彙の分類タスクで比較した( IGMM は語彙を推定しないため). データセットは2組の条件に対応している. まず, すべてのまたは,女性のみの話者のどちらかから合成された母音カテゴリを使用する条件. つづいて, 観察された子音のカテゴリの粗さを変化させる条件である. 各条件 (モデル, 母音話者, 子音のセット) は五回ずつ実行され, ハイパラメータサンプリングをもつ Gibbd サンプリング の反復を 1500 回行った. 総括すると TLD モデルはすべての条件,両方のタスクで他のモデルよりも良い性能であることが分かった.

Vowel categorization results are shown in :num:`Figure #fig3`. IGMM performs substantially worse than both TLD and LD, with scores more than 30 points lower than the best results for these models, clearly showing the value of the protolexicon and repli- cating the results found by Feldman et al. (2013a) on this dataset. Furthermore, TLD consistently outperforms the LD model, finding better phonetic categories, both for vowels generated from the combined categories of all speakers (‘all’) and vowels generated from adult female speakers only (‘w’), although the latter are clearly much easier for both models to learn. Both models perform less well when the consonant frames provide less information, but the TLD model performance degrades less than the LD performance.

注釈

母音のカテゴリ化の結果を :num:`図 #fig3` に示す. IGMMは常に TLD, LD 双方より悪い成績であり, このデータセット上の Feldman et al. (2013a) で発見された結果の反復であり, はっきりと 前-語彙の値を示しているこれらのモデルの最もよい値より 30 ポイント以上スコアが低い. 更に, TLD モデルは常に LD モデルよりも性能がよく, 全てのスピーカーの組み合わせのカテゴリから生成された母音’all’, 成人女性のみの発話だけから生成された母音 ‘W’ 両方で, 後者の方が2つのモデルともに学習が容易であるが, より良い音素カテゴリを発見できた. 子音フレーム提供する情報がが少ない場合, 両モデルの機能は低下する. しかし, この低下はTLDモデルの方が, LD モデルよりも少ない.

_images/fig33.png

母音の評価

  • ‘all’ : すべての話者からの合成した母音セットを使用
  • ‘w’ : 成人女性の母音から合成した母音セットを使用

バーは五回の実行を基準に95%信頼区間を示す. ここには示していないが, IGMMの結果は以下の通りである.

  • IGMM-all : VM score of 53.9 (CI=0.5)
  • IGMM-w : VM score of 65.0 (CI=0.2)

The gold-standard vowels are shown in gold in the background but are mostly overlapped by the inferred categories.

注釈

ゴールドスタンダードの母音は、バックグラウンドに金色で示されているが、ほとんどは推論されたカテゴリ毎に重なっている.

Both the TLD and the LD models find ‘supervowel’ categories, which cover multiple vowel categories and are used to merge minimal pairs into a single lexical item. :num:`Figure #fig4` shows example vowel categories inferred by the TLD model, including two supervowels. The TLD supervowels are used much less frequently than the supervowels found by the LD model, containing, on average, only twothirds as many tokens.

注釈

TLD , LD モデル 両方とも 複数の母音カテゴリにまたがり,一つの語彙的アイテムに含まれるミニマルペアを結合している “上位母音” カテゴリを発見している. :num:`図 #fig4` に TDL モデルによって推定された2つの上位母音を含む母音カテゴリの例を示す. TDL モデルでの上位母音は平均してトークンの \(\frac{2}{3}\) 程度の, LDモデルが発見した上位母音より少ない頻度が使用されている.

_images/fig43.png

TLD モデルによって発見された母音:

上位母音を赤く示す. 正解の母音は背景に黄色で示すが,推定されたカテゴリによっておおよそが書き換えられている.

:num:`Figure #fig5` shows that TLD also outperforms LD on the lexeme/word categorization task. Again performance decreases as the consonant categories become coarser, but the additional semantic information in the TLD model compensates for the lack of consonant information. In the individual components of VM, TLD and LD have similar VC (‘recall’), but TLD has higher VH (‘precision’), demonstrating that the semantic information given by the topics can separate potentially ambiguous words, as hypothesized.

注釈

:num:`図 #fig5` に TLD モデルが 語彙素性/単語のカテゴリ化タスクで LD モデルよりも優れていることを示す. ここでも,子音のカテゴリは、粗くなればなるほど,パフォーマンスの低下が見られるが, TLD モデルにおける意味情報の追加は子音情報の不足を補償している. VMの個々の成分を確認するとTLDとLDは、類似したVC(’recall’)を持っているが, TLDは、より高いVH(’precision’)を有している. この結果は話題によって与えられた意味情報が、潜在的にあいまいな言葉を分離することができるという仮説を実証するものである.

_images/fig52.png

語彙素性の評価

  • ‘all’ : すべての話者から合成された母音を含むデータセットを使用
  • ‘w’ : 成人女性の母音から合成された母音を含むデータセットを使用

Overall, the contextual semantic information added in the TLD model leads to both better phonetic categorization and to a better protolexicon, especially when the input is noisier, using degraded consonants. Since infants are not likely to have perfect knowledge of phonetic categories at this stage, semantic information is a potentially rich source of information that could be drawn upon to offset noise from other domains. The form of the semantic information added in the TLD model is itself quite weak, so the improvements shown here are in line with what infant learners could achieve.

注釈

結論として, TLD モデルで追加された文脈の意味情報は音素カテゴリ化と 源-語彙 の両方をより良くする. これは子音情報が劣化してノイズが多いときには特に顕著になる. 乳児は、この段階で音素カテゴリの完全な知識を持っていそうにないため, 意味情報は,潜在的に他のドメインからのノイズをうまく相殺することのできる豊かな情報源なのである. TLDモデルで追加された文脈の意味情報はそれ自体は非常に弱いが, ここに示した改善は幼児学習者が達成できるものと一致している.

7 Conclusion

Language acquisition is a complex task, in which many heterogeneous sources of information may be useful. In this paper, we investigated whether contextual semantic information could be of help when learning phonetic categories. We found that this contextual information can improve phonetic learning performance considerably, especially in situations where there is a high degree of phonetic ambiguity in the word-forms that learners hear. This suggests that previous models that have ignored semantic information may have underestimated the information that is available to infants. Our model illustrates one way in which language learners might harness the rich information that is present in the world without first needing to acquire a full inventory of word meanings.

注釈

言語獲得は複雑な課題であり, その中では多くの異なった情報源があることが有益である. 本稿では, 文脈の意味的な情報は音素のカテゴリ学習を行う際に有益なのか否かを調査した. われわtrはこの文脈情報は, 特に学習者が耳にする単語フォームの中に重度の音素的な曖昧性がある場合に, 音素学習のパフォーマンスをかなり向上させることができることを発見した. この結果は意味的な情報を無視していた先行モデルは乳児の利用できる情報を過小評価していたかもしれないことを示唆する. 我々のモデルは言語の学習者が,最初に完全な単語の意味インベントリを取得することがなくとも, 世界中に存在している優れた情報を使用できるかもしれない方法の一つを提示した.

The contextual semantic information that the TLD model tracks is similar to that potentially used in other linguistic learning tasks. Theories of cross-situational word learning (Smith and Yu, 2008; Yu and Smith, 2007) assume that sensitivity to situational co-occurrences between words and non-linguistic contexts is a precursor to learning the meanings of individual words. Under this view, contextual semantics is available to infants well before they have acquired large numbers of semantic minimal pairs. However, recent experimental evidence indicates that learners do not always retain detailed information about the referents that are present in a scene when they hear a word (Medina et al., 2011; Trueswell et al., 2013). This evidence poses a direct challenge to theories of cross-situational word learning. Our account does not necessarily require learners to track co-occurrences between words and individual objects, but instead focuses on more abstract information about salient events and topics in the environment; it will be important to investigate to what extent infants encode this information and use it in phonetic learning.

注釈

TLDモデル で追跡した文脈の意味論的な情報は潜在的には他の言語の学習タスクで使用することにも似ている. クロス状況語学習の理論 (Smith and Yu, 2008; Yu and Smith, 2007) は 言語的コンテキストと非言語的コンテキストの共起状況への感受性は個々の単語の意味を学習することを促進すると想定している. この視点に立つと, 意味的最小のペアを大量に取得するより前に文脈的意味論がよく乳児に提供さているとされる. しかし, 最近の実験的証拠は学習者が単語を聞くとき, 指示対象に対して存在しているシーンに対して彼らは常に詳細な情報を持っているわけではないことを示している (Medina et al., 2011; Trueswell et al., 2013). この証拠は、クロス状況語学習の理論への直接的な課題を提起する. 我々の主張は決して, 学習者が単語と個々の対象間に起きる共起を追跡する必要があると言っているのではなく, 代わりに, その環境で顕著なイベントやトピックについてのより抽象的な情報に焦点を当てている. 乳児がどの程度,この情報をエンコードし, 音素学習に使用するのかを調査することが必要になるだろう

Regardless of the specific way in which infants encode semantic information, our method of adding this information by using LDA topics from transcript data was shown to be effective. This method is practical because it can approximate semantic information without relying on extensive manual annotation.

注釈

乳児が意味的情報をコード化する特定の方法がどのようなものであるにしろ, 転記情報から LDA トピックモデルを使用することによって 意味的情報を付加する手法は有効であることが示された. この方法は大規模な手動による注釈に依存せず, 意味情報を近似することができるため実用的な方法である.

The LD model extended the phonetic categorization task by adding word contexts; the TLD model presented here goes even further, adding larger situational contexts. Both forms of top-down information help the low-level task of classifying acoustic signals into phonetic categories, furthering a holistic view of language learning with interaction across multiple levels.

注釈

LD モデルは単語コンテキストを追加することで音素カテゴリ化タスクを拡張し, 本稿で紹介した TLD モデルは大きな状況のコンテキストを追加することでさらに先に進んだ. どちらのトップダウン情報の形式でも, 複数のレベル間での相互に作用し,言語学習の全体像を促進し,音響信号を音素カテゴリに分類する低レベルのタスクを手助けする.

Acknowledgments

This work was supported by EPSRC grant EP/H050442/1 and a James S. McDonnell Foundation Scholar Award to the final author.

References
  • Frans Adriaans and Daniel Swingley.
    • Distributional learning of vowel categories is supported by prosody in infant-directed speech.
    • In Proceedings of the 34th Annual Conference of the Cognitive Science Society (CogSci), 2012.
    1. Bergelson and D. Swingley.
    • At 6-9 months, human infants know the meanings of many common nouns.
    • Proceedings of the National Academy of Sciences, 109(9) : 3253-3258, Feb 2012.
  • David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum.
    • Hierarchical topic models and the nested Chinese restaurant process.
    • In Advances in Neural Information Processing Systems 16, 2003.
  • Michael R. Brent and Jeffrey M. Siskind.
    • The role of exposure to isolated words in early vocabulary development.
    • Cognition, 81(2):B33–B44, 2001.
  • Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman.
    • Two decades of unsupervised POS induction: How far have we come?
    • In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 575–584, Cambridge, MA, October 2010.
    • Association for Computational Linguistics.
  • Bart de Boer and Patricia K. Kuhl.
    • Investigating the role of infant-directed speech with a computer model.
    • Acoustics Research Letters Online, 4(4): 129, 2003.
  • Brian Dillon, Ewan Dunbar, and William Idsardi.
    • A single-stage approach to learning phonological categories: Insights from Inuktitut.
    • Cognitive Science, 37(2):344–377, Mar 2013.
  • Micha Elsner, Sharon Goldwater, Naomi Feldman, and Frank Wood.
    • A cognitive model of early lexical acquisition with phonetic variability.
    • In Proceedings of the 18th Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
    • word learning.
    • Psychological Science, 20(5): 578–585, 2009.
  • Manuela Friedrich and Angela D. Friederici.
    • Word learning in 6-month-olds: Fast encoding—weak retention.
    • Journal of Cognitive Neuroscience, 23 (11):3228–3240, Nov 2011.
  • Lakshmi J. Gogate and Lorraine E. Bahrick.
    • Intersensory redundancy and 7-month-old infants’ memory for arbitrary syllable-object relations.
    • Infancy, 2(2):219–231, Apr 2001.
    1. Hillenbrand, L. A. Getty, M. J. Clark, and K. Wheeler.
    • Acoustic characteristics of American English vowels.
    • Journal of the Acoustical Society of America, 97(5 Pt 1):3099–3111, May 1995.
      1. Jusczyk and Elizabeth A. Hohne.
    • Infants’ memory for spoken words.
    • Science, 277(5334): 1984–1986, Sep 1997.
  • Patricia K. Kuhl, Karen A. Williams, Francisco Lacerda, Kenneth N. Stevens, and Bjorn Lindblom.
    • Linguistic experience alters phonetic perception in infants by 6 months of age.
    • Science, 255(5044):606–608, 1992.
  • Brian MacWhinney.
    • The CHILDES Project: Tools for Analyzing Talk.
    • Lawrence Erlbaum Associates, 2000.
      1. Mandel, P. W. Jusczyk, and D. B. Pisoni.
    • Infants’ recognition of the sound patterns of their own names.
    • Psychological Science, 6(5):314– 317, Sep 1995.
  • Nivedita Mani and Kim Plunkett.
    • Twelve-montholds know their cups from their keps and tups.
    • Infancy, 15(5):445470, Sep 2010.
  • Naomi H. Feldman, Thomas L. Griffiths, Sharon Goldwater, and James L. Morgan.
    • A role for the developing lexicon in phonetic category acquisition.
    • Psychological Review, 2013a.
  • Jessica Maye, Daniel J. Weiss, and Richard N. Aslin.
    • Statistical phonetic learning in infants: facilitation and feature generalization.
    • Developmental Science, 11(1):122–134, Jan 2008.
  • Naomi H. Feldman, Emily B. Myers, Katherine S. White, Thomas L. Griffiths, and James L. Morgan.
    • Word-level information influences phonetic learning in adults and infants.
    • Cognition, 127(3): 427–438, 2013b.
  • Jessica Maye, Janet F Werker, and LouAnn Gerken.
    • Infant sensitivity to distributional information can affect phonetic discrimination.
    • Cognition, 82(3):B101–B111, Jan 2002.
  • Abdellah Fourtassi and Emmanuel Dupoux. A rudimentary lexicon and semantics help bootstrap phoneme acquisition. Submitted.

  • Michael C. Frank, Noah D. Goodman, and Joshua B. Tenenbaum. Using speakers’ referential intentions to model early cross-situational

  • Andrew McCallum. MALLET: A machine learning for language toolkit, 2002.

  • Bob McMurray, Richard N. Aslin, and Joseph C. Toscano. Statistical learning of phonetic categories: insights from a computational approach. Developmental Science, 12(3):369–378, May 2009.

  • Tamara Nicol Medina, Jesse Snedeker, John C.

  • Trueswell, and Lila R. Gleitman. How words

  • can and cannot be learned by observation. Proceedings of the National Academy of Sciences,

  • 108(22):9014–9019, 2011.

  • Radford Neal. Markov chain sampling methods

  • for Dirichlet process mixture models. Journal

  • of Computational and Graphical Statistics, 9:

  • 249–265, 2000.

  • Linda Polka and Janet F. Werker. Developmental changes in perception of nonnative vowel

  • contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20(2):421– 435, 1994. ceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pages 985 – 992, Sydney, 2006.

  • Tuomas Teinonen, Richard N. Aslin, Paavo Alku, and Gergely Csibra. Visual speech contributes to phonetic learning in 6-month-old infants. Cognition, 108:850–855, 2008.

  • Erik D. Thiessen. The effect of distributional information on children’s use of phonemic contrasts. Journal of Memory and Language, 56(1):16–34, Jan 2007.

    1. Tincoff and P. W. Jusczyk. Some beginnings of word comprehension in 6-month-olds. Psychological Science, 10(2):172–175, Mar 1999.
  • Carl Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems 13, 2000.

  • Ruth Tincoff and Peter W. Jusczyk. Six-montholds comprehend words that refer to parts of the body. Infancy, 17(4):432444, Jul 2012.

  • Andrew Rosenberg and Julia Hirschberg. Vmeasure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 12th Conference on Empirical Methods in Natural Language Processing (EMNLP), 2007.

      1. Trubetzkoy. Grundz¨ ge der Phonologie. Vanu denhoeck und Ruprecht, G¨ ttingen, 1939. o
  • Brandon C. Roy, Michael C. Frank, and Deb Roy. Relating activity contexts to early word learning in dense longitudinal data. In Proceedings of the 34th Annual Conference of the Cognitive Science Society (CogSci), 2012.

  • Rushen Shi and Janet F. Werker. The basis of preference for lexical words in 6-month-old infants. Developmental Science, 6(5):484–488, 2003.

    1. Shukla, K. S. White, and R. N. Aslin. Prosody guides the rapid mapping of auditory word forms onto visual objects in 6-mo-old infants. Proceedings of the National Academy of Sciences, 108 (15):6038–6043, Apr 2011.
  • Linda B. Smith and Chen Yu. Infants rapidly learn word-referent mappings via cross-situational statistics. Cognition, 106(3):1558–1568, 2008.

  • Christine L. Stager and Janet F. Werker. Infants listen for more phonetic detail in speech perception than in word-learning tasks. Nature, 388: 381–382, 1997.

    1. Swingley. Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1536):3617–3632, Nov 2009.
  • Yee Whye Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Pro-

  • John C. Trueswell, Tamara Nicol Medina, Alon Hafri, and Lila R. Gleitman. Propose but verify: Fast mapping meets cross-situational word learning. Cognitive Psychology, 66:126–156, 2013.

      1. Vallabha, J. L. McClelland, F. Pons, J. F. Werker, and S. Amano. Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33):13273–13278, Aug 2007.
  • Janet F. Werker and Richard C. Tees. Crosslanguage speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7:49–63, 1984.

    1. Henny Yeung and Janet F. Werker. Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information. Cognition, 113(2): 234–243, Nov 2009.
  • Chen Yu and Linda B. Smith.
    • Rapid word learning under uncertainty via cross-situational statistics.
    • Psychological Science, 18(5):414–420, 2007.

警告

Citation for published version:

Frank, S, Feldman, N & Goldwater, S 2014, ‘Weak semantic context helps phonetic learning in a model of infant language acquisition’. in Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics.

注釈

Link

Link to publication record in Edinburgh Research Explorer

注釈

Published In

Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics

警告

General rights

Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

警告

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim.

Phonetic Feature Encoding in Human Superior Temporal Gyrus

  • Nima Mesgarani
  • Connie Cheung
  • Keith Johnson
  • Edward F. Chang

During speech perception, linguistic elements such as consonants and vowels are extracted from a complex acoustic speech signal. The superior temporal gyrus (STG) participates in high-order auditory processing of speech, but how it encodes phonetic information is poorly understood. We used high-density direct cortical surface recordings in humans while they listened to natural, continuous speech to reveal the STG representation of the entire English phonetic inventory. At single electrodes, we found response selectivity to distinct phonetic features. Encoding of acoustic properties was mediated by a distributed population response. Phonetic features could be directly related to tuning for spectrotemporal acoustic cues, some of which were encoded in a nonlinear fashion or by integration of multiple cues. These findings demonstrate the acoustic-phonetic representation of speech in human STG.

Phonemes—and the distinctive features composing them—are hypothesized to be the smallest contrastive units that change a word’s meaning (e.g., /b/ and /d/ as in bad versus dad) (1). The superior temporal gyrus (Brodmann area 22, STG) has a key role in acoustic-phonetic processing because it responds to speech over other sounds (2) and focal electrical stimulation there selectively interrupts speech discrimination (3). These findings raise fundamental questions about the representation of speech sounds, such as whether local neural encoding is specific for phonemes, acoustic-phonetic features, or low-level spectrotemporal parameters. A major challenge in addressing this in natural speech is that cortical processing of individual speech sounds is extraordinarily spatially discrete and rapid (4–7).

We recorded direct cortical activity from six human participants implanted with high-density multielectrode arrays as part of their clinical evaluation for epilepsy surgery (8). These recordings provide simultaneous high spatial and temporal resolution while sampling population neural activity from temporal lobe auditory speech cortex. We analyzed high gamma (75 to 150 Hz) cortical surface field potentials (9, 10), which correlate with neuronal spiking (11, 12).

Participants listened to natural speech samples featuring a wide range of American English speakers (500 sentences spoken by 400 people) (13). Most speech-responsive sites were found in posterior and middle STG (Fig. 1A, 37 to 102 sites per participant, comparing speech versus silence, P < 0.01, t test). Neural responses demonstrated a distributed spatiotemporal pattern evoked during listening (Fig. 1, B and C, and figs. S1 and S2).

We segmented the sentences into time-aligned sequences of phonemes to investigate whether STG sites show preferential responses. We estimated the mean neural response at each electrode to every phoneme and found distinct selectivity. For example, electrode e1 (Fig. 1D) showed large evoked responses to plosive phonemes /p/, /t/, /k/, /b/, /d/, and /g/. Electrode e2 showed selective responses to sibilant fricatives: /s/, /ʃ/, and /z/. The next two electrodes showed selective responses to subsets of vowels: low-back (electrode e3, e.g., /a/ and /aʊ/), high-front vowels and glides (electrode e4, e.g., /i/ and /j/). Last, neural activity recorded at electrode e5 was selective for nasals (/n/, /m/, and /ŋ/).

Fig. 1. Human STG cortical selectivity to speech sounds. (A) Magnetic resonance image surface reconstruction of one participant’s cerebrum. Electrodes (red) are plotted with opacity signifying the t test value when comparing responses to silence and speech (P < 0.01, t test). (B) Example sentence and its acoustic waveform, spectrogram, and phonetic transcription. (C) Neural responses evoked by the sentence at selected electrodes. z score indicates normalized response. (D) Average responses at five example electrodes to all English phonemes and their PSI vectors.

To quantify selectivity at single electrodes, we derived a metric indicating the number of phonemes with cortical responses statistically distinguishable from the response to a particular phoneme. The phoneme selectivity index (PSI) is a dimension of 33 English phonemes; PSI = 0 is nonselective and PSI = 32 is extremely selective (Wilcox rank-sum test, P < 0.01, Fig. 1D; methods shown in fig. S3). We determined an optimal analysis time window of 50 ms, centered 150 ms after the phoneme onset by using a phoneme separability analysis (f-statistic, fig. S4A). The average PSI over all phonemes summarizes an electrode’s overall selectivity. The average PSI was highly correlated to a site’s response magnitude to speech over silence (r = 0.77, P < 0.001, t test; fig. S5A) and the degree to which the response could be predicted with a linear spectrotemporal receptive field [STRF, r = 0.88, P < 0.001, t test; fig. S5B (14)]. Therefore, the majority of speech-responsive sites in STG are selective to specific phoneme groups.

Fig. 2. Hierarchical clustering of single-electrode and population responses. (A) PSI vectors of selective electrodes across all participants. Rows correspond to phonemes, and columns correspond to electrodes. (B) Clustering across population PSIs (rows). (C) Clustering across single electrodes (columns). (D) Alternative PSI vectors using rows now corresponding to phonetic features, not phonemes. (E) Weighted average STRFs of main electrode clusters. (F) Average acoustic spectrograms for phonemes in each population cluster. Correlation between average STRFs and average spectrograms: r = 0.67, P < 0.01, t test. (r = 0.50, 0.78, 0.55, 0.86, 0.86, and 0.47 for plosives, fricatives, vowels, and nasals, respectively; P < 0.01, t test).

To investigate the organization of selectivity across the neural population, we constructed an array containing PSI vectors for electrodes across all participants (Fig. 2A). In this array, each column corresponds to a single electrode, and each row corresponds to a single phoneme. Most STG electrodes are selective not to individual but to specific groups of phonemes. To determine selectivity patterns across electrodes and phonemes, we used unsupervised hierarchical clustering analyses. Clustering across rows revealed groupings of phonemes on the basis of similarity of PSI values in the population response (Fig. 2B). Clustering across columns revealed single electrodes with similar PSI patterns (Fig. 2C). These two analyses revealed complementary local- and globallevel organizational selectivity patterns. We also replotted the array by using 14 phonetic features defined in linguistics to contrast distinctive articulatory and acoustic properties (Fig. 2D; phonemefeature mapping provided in fig. S7) (1, 15).

Fig. 3. Neural encoding of vowels. (A) Formant frequencies, F1 and F2, for English vowels (F2-F1, dashed line, first principal component). (B) F1 and F2 partial correlations for each electrode’s response (**P < 0.01, t test). Dots (electrodes) are color-coded by their cluster membership. (C) Neural population decoding of fundamental and formant frequencies. Error bars indicate SEM. (D) Multidimensional scaling (MDS) of acoustic and neural space (***P < 0.001, t test).

The first tier of the single-electrode hierarchy analysis (Fig. 2C) divides STG sites into two distinct groups: obstruent- and sonorant-selective electrodes. The obstruent-selective group is divided into two subgroups: plosive and fricative electrodes (similar to electrodes e1 and e2 in Fig. 1D) (16). Among plosive electrodes (blue), some were responsive to all plosives, whereas others were selective to place of articulation (dorsal /g/ and /k/ versus coronal /d/ and /t/ versus labial /p/ and /b/, labeled in Fig. 2D) and voicing (separating voiced /b/, /d/, and /g/ from unvoiced /p/, /t/, and /k/; labeled voiced in Fig. 2D). Fricative-selective electrodes (purple) showed weak, overlapping selectivity to coronal plosives (/d/ and /t/). Sonorantselective cortical sites, in contrast, were partitioned into four partially overlapping groups: low-back vowels (red), low-front vowels (orange), high-front vowels (green), and nasals (magenta) (labeled in Fig. 2D, similar to e3 to e5 in Fig. 1D).

Both clustering schemes (Fig. 2, B and C) revealed similar phoneme grouping based on shared phonetic features, suggesting that a substantial portion of the population-based organization can be accounted for by local tuning to features at single electrodes (similarity of average PSI values for the local and population subgroups of both clustering analyses is shown in fig. S8; overall r = 0.73, P < 0.001). Furthermore, selectivity is organized primarily by manner of articulation distinctions and secondarily by place of articulation, corresponding to the degree and the location of constriction in the vocal tract, respectively (16). This systematic organization of speech sounds is consistent with auditory perceptual models positing that distinctions are most affected by manner contrasts (17, 18) compared with other feature hierarchies (articulatory or gestural theories) (19).

We next determined what spectrotemporal tuning properties accounted for phonetic feature selectivity. We first determined the weighted average STRFs of the six main electrode clusters identified above, weighting them proportionally by their degree of selectivity (average PSI). These STRFs show well-defined spectrotemporal tuning (Fig. 2E) highly similar to average acoustic spectrograms of phonemes in corresponding population clusters (Fig. 2F; average correlation = 0.67, P < 0.01, t test). For example, the first STRF in Fig. 2E shows tuning for broadband excitation followed by inhibition, similar to the acoustic spectrogram of plosives. The second STRF is tuned to a high frequency, which is a defining feature of sibilant fricatives. STRFs of vowel electrodes show tuning for characteristic formants that define lowback, low-front, and high-front vowels. Last, STRF of nasal-selective electrodes is tuned primarily to low acoustic frequencies generated from heavy voicing and damping of higher frequencies (16). The average spectrogram analysis requires a priori phonemic segmentation of speech but is modelindependent. The STRF analysis assumes a linear relationship between spectrograms and neural responses but is estimated without segmentation. Despite these differing assumptions, the strong match between these confirms that phonetic feature selectivity results from tuning to signature spectrotemporal cues.

We have thus far focused on local feature selectivity to discrete phonetic feature categories. We next wanted to address the encoding of continuous acoustic parameters that specify phonemes within vowel, plosive, and fricative groups. For vowels, we measured fundamental (F0) and formant (F1 to F4) frequencies (16). The first two formants (F1 and F2) play a major perceptual role in distinguishing different English vowels (16), despite tremendous variability within and across vowels (Fig. 3A) (20). The optimal projection of vowels in formant space was the difference of F2 and F1 (first principal component, dashed line, Fig. 3A), which is consistent with vowel perceptual studies (21, 22). By using partial correlation analysis, we quantified the relationship between electrode response amplitudes and F0 to F4. On average, we observed no correlation between the sensitivity of an electrode to F0 with its sensitivity to F1 or F2. However, sensitivity to F1 and F2 was negatively correlated across all vowelselective sites (Fig. 3B; r = –0.49, P < 0.01, t test), meaning that single STG sites show an integrated response to both F1 and F2. Furthermore, electrodes selective to low-back and high-front vowels (labeled in Fig. 2D) showed an opposite differential tuning to formants, thereby maximizing vowel discriminability in the neural domain. This complex sound encoding matches the optimal projection in Fig. 3A, suggesting a specialized higher-order encoding of acoustic formant parameters (23, 24) and contrasts with studies of speech sounds in nonhuman species (25, 26).

To examine population representation of vowel parameters, we used linear regression to decode F0 to F4 from neural responses. To ensure unbiased estimation, we first removed correlations between F0 to F4 by using linear prediction and decoded the residuals. Relatively high decoding accuracies are shown in Fig. 3C (P < 0.001, t test), suggesting fundamental and formant variability is well represented in population STG responses (interaction between decoder weights with electrode STRFs shown in fig. S9). By using multidimensional scaling, we found that the relational organization between vowel centroids in the acoustic domain is well preserved in neural space (Fig. 3D; r = 0.88, P < 0.001).

For plosives, we measured three perceptually important acoustic cues (fig. S10): voice-onset time (VOT), which distinguishes voiced (/b/, /d/, and /g/) from unvoiced plosives (/p/, /t/, and /k/); spectral peak (differentiating labials /p/ and / b / versus coronal /t/ and /d/ versus dorsal /k/ and /g/); and F2 of the following vowel (16). These acoustic parameters could be decoded from population STG responses (Fig. 4A; P < 0.001, t test). VOTs in particular are temporal cues that are perceived categorically, which suggests a nonlinear encoding (27). Figure 4B shows neural responses for three example electrodes plotted for all plosive instances (total of 1200), aligned to their release time and sorted by VOT. The first electrode responds to all plosives with same approximate latency and amplitude, irrespective of VOT. The second electrode responds only to plosive phonemes with short VOT (voiced), and the third electrode responds primarily to plosives with long VOT (unvoiced).

Fig. 4. Neural encoding of plosive and fricative phonemes. (A) Prediction accuracy of plosive and fricative acoustic parameters from neural population responses. Error bars indicate SEM. (B) Response of three example electrodes to all plosive phonemes sorted by VOT. (C) Nonlinearity of VOT-response transformation and (D) distributions of nonlinearity for all plosive-selective electrodes identified in Fig.2D. Voiced plosive-selective electrodes are shown in pink, and the rest in gray. (E) Partial correlation values between response of electrodes and acoustic parameters shared between plosives and fricatives (**P < 0.01, t test). Dots (electrodes) are color-coded by their cluster grouping from Fig. 2C.

To examine the nonlinear relationship between VOT and response amplitude for voiced-plosive electrodes (labeled voiced in Fig. 2D) compared with plosive electrodes with no sensitivity to voicing feature (labeled coronal, labial and dorsal in Fig. 2D), we fitted a linear and exponential function to VOT-response pairs (fig. S11B). The difference between these two fits specifies the nonlinearity of this transformation, shown for all plosive electrodes in Fig. 4C. Voiced-plosive electrodes (pink) all show strong nonlinear bias for short VOTs compared with all other plosive electrodes (gray). We quantified the degree and direction of this nonlinear bias for these two groups of plosive electrodes by measuring the average second-derivative of the curves in Fig. 4C. This measure maps electrodes with nonlinear preference for short VOTs (e.g., electrode e2 in Fig. 4B) to negative values and electrodes with nonlinear preference for long VOTs (e.g., electrode e3 in Fig. 4B) to positive values. The distribution of this measure for voiced-plosive electrodes (Fig. 4D, red distribution) shows significantly greater nonlinear bias compared with the remaining plosive electrodes (Fig. 4D, gray distribution) (P < 0.001, Wilcox rank-sum test). This suggests a specialized mechanism for spatially distributed, nonlinear rate encoding of VOT and contrasts with previously described temporal encoding mechanisms (26, 28).

We performed a similar analysis for fricatives, measuring duration, which aids the distinction between voiced (/z/ and /v/) and unvoiced fricatives (/s/, /ʃ/, /q/, /f/); spectral peak, which differentiates /f/ and /v/ versus coronal /s/ and /z/ versus dorsal /ʃ/; and F2 of the following vowel (16) (fig. S12). These parameters can be decoded reliably from population responses (Fig. 4A; P < 0.001, t test).

Because plosives and fricatives can be subspecified by using similar acoustic parameters, we determined whether the response of electrodes to these parameters depends on their phonetic category (i.e., fricative or plosive). We compared the partial correlation values of neural responses with spectral peak, duration, and F2 onset of fricative and plosive phonemes (Fig. 4E), where each point corresponds to an electrode color-coded by its cluster grouping in Fig. 2D. High correlation values (r = 0.70, 0.87, and 0.79; P < 0.001; t test) suggest that electrodes respond to these acoustic parameters independent of their phonetic context. The similarity of responses to these isolated acoustic parameters suggests that electrode selectivity to a specific phonetic features (shown with colors in Fig. 4E) emerges from combined tuning to multiple acoustic parameters that define phonetic contrasts (24, 25).

We have characterized the STG representation of the entire American English phonetic inventory. We used direct cortical recordings with high spatial and temporal resolution to determine how selectivity for phonetic features is correlated to acoustic spectrotemporal receptive field properties in STG. We found evidence for both spatially local and distributed selectivity to perceptually relevant aspects of speech sounds, which together appear to give rise to our internal representation of a phoneme.

We found selectivity for some higher-order acoustic parameters, such as examples of nonlinear, spatial encoding of VOT, which could have important implications for the categorical representation of this temporal cue. Furthermore, we observed a joint differential encoding of F1 and F2 at single cortical sites, suggesting evidence of spectral integration previously speculated in theories of combination-sensitive neurons for vowels (23–25, 29).

Our results are consistent with previous singleunit recordings in human STG, which have not demonstrated invariant, local selectivity to single phonemes (30, 31). Instead, our findings suggest a multidimensional feature space for encoding the acoustic parameters of speech sounds (25). Phonetic features defined by distinct acoustic cues for manner of articulation were the strongest determinants of selectivity, whereas place-of-articulation cues were less discriminable. This might explain some patterns of perceptual confusability between phonemes (32) and is consistent with feature hierarchies organized around acoustic cues (17), where phoneme similarity space in STG is driven more by auditory-acoustic properties than articulatory ones (33). A featural representation has greater universality across languages, minimizes the need for precise unit boundaries, and can account for coarticulation and temporal overlap over phonemebased models for speech perception (17).

References and Notes
    1. Chomsky, M. Halle, The Sound Pattern of English (Harper and Row, New York, 1968).
      1. Binder et al., Cereb. Cortex 10, 512–528 (2000).
    1. Boatman, C. Hall, M. H. Goldstein, R. Lesser, B. Gordon, Cortex 33, 83–98 (1997).
      1. Chang et al., Nat. Neurosci. 13, 1428–1432 (2010).
    1. Formisano, F. De Martino, M. Bonte, R. Goebel, Science 322, 970–973 (2008).
    1. Obleser, A. M. Leaver, J. Vanmeter, J. P. Rauschecker, Front. Psychol. 1, 232 (2010).
    1. Steinschneider et al., Cereb. Cortex 21, 2332–2347 (2011).
  1. Materials and methods are available as supplementary materials on Science Online.
      1. Crone, D. Boatman, B. Gordon, L. Hao, Clin. Neurophysiol. 112, 565–582 (2001).
    1. Edwards et al., J. Neurophysiol. 102, 377–386 (2009).
    1. Steinschneider, Y. I. Fishman, J. C. Arezzo, Cereb. Cortex 18, 610–625 (2008).
    1. Ray, J. H. R. Maunsell, PLOS Biol. 9, e1000610 (2011).
      1. Garofolo, TIMIT: Acoustic-Phonetic Continuous Speech Corpus (Linguistic Data Consortium, Philadelphia, 1993).
      1. Theunissen et al., Network 12, 289–316 (2001).
    1. Halle, K. Stevens, in Music, Language, Speech, and Brain, J. Sundberg, L. Nord, R. Carlson, Eds. (Wenner-Gren International Symposium Series vol. 59, Macmillan, Basingstoke, UK, 1991).
    1. Ladefoged, K. Johnson, A Course in Phonetics (Cengage Learning, Stamford, CT, 2010).
      1. Stevens, J. Acoust. Soc. Am. 111, 1872–1891 (2002).
    1. Clements, Phonol. Yearb. 2, 225–252 (1985).
      1. Fowler, J. Phonetics 14, 3–28 (1986).
      1. Peterson, H. L. Barney, J. Acoust. Soc. Am. 24, 175 (1952).
      1. Miller, J. Acoust. Soc. Am. 85, 2114–2134 (1989).
      1. Syrdal, H. S. Gopal, J. Acoust. Soc. Am. 79, 1086–1100 (1986).
      1. Sussman, Brain Lang. 28, 12–23 (1986).
    1. Nelken, Curr. Opin. Neurobiol. 18, 413–417 (2008).
    1. Mesgarani, S. V. David, J. B. Fritz, S. A. Shamma, J. Acoust. Soc. Am. 123, 899–909 (2008).
      1. Engineer et al., Nat. Neurosci. 11, 603–608 (2008).
    1. Lisker, A. S. Abramson, Lang. Speech 10, 1–28 (1967).
    1. Steinschneider et al., Cereb. Cortex 15, 170–186 (2005).
    1. Chechik, I. Nelken, Proc. Natl. Acad. Sci. U.S.A. 109, 18968–18973 (2012).
      1. Chan et al., Cereb. Cortex, published online 16 May 2013 (10.1093/cercor/bht127).
    1. Creutzfeldt, G. Ojemann, E. Lettich, Exp. Brain Res. 77, 451–475 (1989).
      1. Miller, P. E. Nicely, J. Acoust. Soc. Am. 27, 338 (1955).
      1. Liberman, Speech: A Special Code (MIT Press, Cambridge, MA, 1996).
Acknowledgments:

We thank A. Ren for technical help with data collection and preprocessing. S. Shamma, C. Espy-Wilson, E. Cibelli, K. Bouchard, and I. Garner provided helpful comments on the manuscript. E.F.C. was funded by NIH grants R01-DC012379, R00-NS065120, and DP2-OD00862 and the Ester A. and Joseph Klingenstein Foundation. E.F.C., C.C., and N.M. collected the data. N.M. and C.C. performed the analysis. N.M. and E.F.C. wrote the manuscript. K.J. provided phonetic consultation. E.F.C. supervised the project.

Supplementary Materials
  • www.sciencemag.org/content/343/6174/1006/suppl/DC1
  • Materials and Methods
  • Figs. S1 to S12
  • Reference (34)
  • 16 September 2013; accepted 17 January 2014
  • Published online 30 January 2014;
  • 10.1126/science.1245994

非言語的音響特徴

COMPARING FEATURE SETS FOR ACTED AND SPONTANEOUS SPEECH IN VIEW OF AUTOMATIC EMOTION RECOGNITION

Author:Thurid Vogt (Augsburg University, Germany Multimedia concepts and applications), Elisabeth Andre (Bielefeld University, Germany Applied Computer Science)
ABSTRACT

我々は音響的感情認識の特徴量選択におけるデータマイニング実験を示す。 ピッチ、エネルギー, MFCC 時間系列 に由来する 1000 個以上の特徴量から始め、 相関の高い特徴量を排除することで,このセットの中からデータに対し関連の高いものを選択した。 特徴量は演技音声、あるいは実際の感情を含む音声別に解析され、有意差が確認された。 全ての特徴量は自動的に計算され、自動で解析したものと手動で解析したものを比較した。 自動化の程度が高いものでも、認識精度の観点からでは、特に不利になることは無かった。

  • This work was partially funded by a grant from the DFG in the graduate program 256 and by the EU Network of Excellence Humaine.
1.INTRODUCTION

音声から感情を認識するための多くの特徴量が発見されている。 しかし、一般に認められる一定の特徴量セットは未だ決まっていない。 我われはデータマイニングを行い、データのピッチ、エネルギー、MFCC 時系列における異なった視点を提供する音響特徴量の大規模なセットを計算した。 続いて、与えられたデータセットから最もよいサブセットを自動的に選択した。 このアプローチは音声感情認識の領域では一般的なものである [1] [2] [3] 。 しかし、既存研究は数百程度の特徴量を使用しているのに対し、我われは 1000 個以上の特徴量から試行を開始した。

将来のオンライン感情認識の観点から、以下の疑問に対する解答を考察する。

  • 特徴量選択に対し、大規模な特徴量を与えることは選択される特徴量を良いものにすることができるのか?

  • どの程度の自動化が可能なのか?
    • つまり、オンラインシステム上でどの解析ユニットと特徴量が、自動的に計算可能で良い結果を残すのか ?
  • 演技あるいは実際の感情を比較した実験はあるが [2] [3] 特徴量セットに対してのものではない。
    • そのため、両方の性質が異なる際にどのような特徴量セットが最適となるのかが分からない。

次章では音声信号からの特徴量抽出の段階を説明する。 続いて、実験を行ったデータベースに関して説明し、最後に実験結果を示す。

2.FEATURE EXTRACTION

音声感情認識に対する言語学の領域で一般に使用される韻律的な特徴量はピッチ、エネルギー、 MFCC (Mel Frequency Cepstral Coefficients), ポーズ、 持続時間、 そして話速とフォルマント、声質である(e.g. [1] , [2] , [3] )。 特徴量は与えられた時間的セグメントにおけるこれらの計測値から求められた。 我々のアプローチでは多数の特徴量を計算し、特定のアプリケーションに最も関係するものを選択する。 このコンセプトは他の研究でもよく使われるものだが、本研究ではより網羅的に行う。 100 - 200 の特徴量から選択を行うのではなく、 約 1300 個の特徴量から試行を開始した。

特徴量抽出のプロセスは 3 つのステップに分けられる. セグメントの長さを選択し、それらのセグメントにおける特徴量を計算し、その後最も相関の高い特徴量セットを削減していく。 これらのステップに関して、以後詳細を述べる。

2.1. Segment length

信号のピッチ或いはエネルギーの値その物は感情に対して意味のあるものではなく、 むしろある時間に対するの特徴量の振舞が意味のあるものである。 そのため、これらの観測地の一般的な統計量、例えば、時間軸上の平均、最小値、最大値を計算した。 従って、実測値の時間軸は統計量を計算するために有効なセグメンテーションが施されている必要がある。 これらの時間セグメントは以下の背反する二つの条件を満す必要があるため、とても注意深く選択された。

  1. 感情の変化はとても早く起きるが、セグメント長が認識の変化の時間分解能を規定する
  2. 利用しやすい統計量はしばしば、長いセグメントが必要になる。

最適なトレードオフを発見するため、我々はいくつかの種類のセグメントを試した。 一つの可能性は例えば 500 ms に固定したセグメント長を利用することだ。 一方、言語学的に動機付けされた、ポーズや発話から規定される、語や文脈を含んだ語、などのセグメントを利用することもできる。 全体的な発話は通常、感情の状態のために非常に特徴的な輪郭を示すが 発話単位でのユーザーの感情の変化を認識できないため、 自然発話でのオンラインな感情認識にとっては実用的ではない。 そのため、多くの言語学的単位は充分であるが、発話セグメンテーションを使用するためには、追加の言語学的な処理プログラムが必要になる。 しかし演技音声のではこのセグメンテーションは通常規定されているため全体的な発話単位を利用でき、 これを利用することで何が可能であるのかの上限としての認識精度を考慮することができる。 単語は暫し数 ms の長さであることもあり、単一のピッチが確定的に推定可能な充分な長さを満たさないことがある。 そのため、我々は、自発感情に対する文脈に含まれる単語や長いポーズによって分類されたセグメント、そして、演技感情に対する反応や語、文脈を含む語、発話、500 ms のセグメントをテストした。 文脈に含まれる語はある語とそれの前後の単語で構成されている。

2.2. Feature calculation

特徴量選択には基本に ピッチ、エネルギー、 MFCC 時間系列を使用した。

ピッチは [4] に記述されているアルゴリズムを使用し、75-600 Hz の範囲の値を 10 ms ごとに 80ms の重複で計算した。 エネルギーと MFCC 12 次元 は音声認識のための ESMERALDA 環境 [5] を利用し観察した。 各値は 10 ms ごとに 16 ms のフレーム長で計算した。 また、 エネルギーと MFCC の最初と二番目の導関数を使用した。

これらの基本系から我われは以下の特徴量列を抽出した。

  • ピッチ: 時間軸に対する最大値、最小値、時間的距離、大きさ、最大値-最小値 間の傾き、最小値-最大値 間の傾き
  • エネルギー: 時間軸に対する最大値、最小値、時間的距離、大きさ、最大値-最小値 間の傾き、最小値-最大値 間の傾き
  • エネルギー係数: 時間軸に対する最大値、最小値
  • MFCC: MFCC 時間軸に対する 12次元全ての平均値と、第一、第二次元における平均値

これらの特長量について、それぞれ 平均、最大値、最小値、最大-最小値間のレンジ、分散、メディアン、第一クォータイル、第三クォータイル、セグメントに対する四分位範囲を計測した ( [1] ). これらの値は特徴量ベクトルを構築する。

更に以下の特徴量を特徴量ベクトルに加えた。 性差の影響を少くするために、ピッチの平均値、メディアン 平均、メディアン、第一クォータイル、第三クォータイルは 最小/最大ピッチを各セグメントごとに以下の式で正規化した。 これは結果的に中央値と四分位数である。

\[mean_{norm} = \frac{mean − min}{max − min}\]

更に、以下の特徴量を加えた。

  • 言語学的に動機づけられたセグメントのアクセント核を近似するために全体のピッチの最大値の位置
  • ピッチやエネルギーの輪郭に対する指標として、セグメント当たりのピッチとエネルギーの最大/最小値の数
  • ポーズに対する大まかな尺度として, あるセグメントの全てのフレーム数に対する音声フレーム数の割合

話速は特徴量ベクトル内で明示的に示してはいないが、エネルギーの最小-最大値間の時間的な距離がそれに対する近似値となる。

特徴量の幾つかは近似的な特徴のみしか持っていないが、 これらの利点は高速に計算可能であることである。 これはオンライン特徴量抽出への応用という側面では重要である。

最終的に、特徴量は合計 1280 個まで集まった。

2.3. Feature selection

最終セクションを記述した特徴量ベクトルは多くの特徴量を含んでおり、それらの多くは冗長なものであるか関連していないものである。 しかし、多すぎる特徴量を計算する目的は最も重要な特徴量をデータが決定できるようにすることである。

データマイニングソフトウェア Weaka [6] を使用し、最適な特徴量のサブセットを探索した。 我われは特徴量の最適なサブセットを発見するために最良優先探索と特徴量評価として, 相関に基づく特徴量選択 (CFS, [7] ) を選択した。

ナイブベイズは特徴量(4章を参照) に高い相関がある場合、パフォーマンスが低下する。 CFS はこれらの属性を正確に排除するためナイブベイズとの相性がよい。 oukann 一般に、特徴量選択はもともとの 1280 個の特徴量から大体 90-160 個まで特徴量を減らした。 上記の結果は、特に特徴選択は、すべてのアプリケーションに対して一度だけ実行される必要があるため、重要であり、多くの分類を高速にする。

3.DATABASES
3.1. Actors database

このデータベースは Technical University, Berlin [8] で収録されたものである. 10 人のプロの声優 (男性:5人, 女性: 5人) は 6 つの異なる感情 (anger, joy, sadness, fear, disgust and boredom) をなるべく自然な感情で演技した 10 回の発話が収録されている。 発話コンテンツは感情的には平静である。 被験者に不自然と認識された収録発話は削除し、 最終的に合計 493 発話が収録されている(女性: 286, 男性: 207)。

元々感情音声の合成のために使用されることを目的としているため、収録は非常に高品質である。 このデータベースは感情発話認識のための比較的単純な仕事であるが、現実的な設定からはかなり遠いものである。

3.2. Wizard-of-Oz database

Wizard-of-Oz (WOZ) 研究由来のデータは台本に従わず被験者が自然に振る舞う現実の生活データにとても近くなる。 我われの特徴量を実際の感情に近づけるため、SmartKom コーパスも評価した。 この WOZ データベースは Munich 大学で、 SmartKom プロジェクト [9] の一環として収録されたものである。 対象者はマルチモーダルな対話システムに興味があり、彼らの感情状態が観察されていることを知らない。 これらの感情は非常に現実的なものであると仮定できるが、不幸にも発話の大部分が感情的に中性のものである。 また、感情のラベルづけは音声及び画像情報を考慮して付与されているもの問題である。 しばしば、これらのラベルづけされた感情は音声信号のみから特定することが困難である。 その結果、このコーパスは演技感情のコーパスより感情推定がとても困難になってしまう。

以下の感情, SmartKom では ‘ユーザー状態’ とされているが, をラベルづけした。

  • strong joy, weak joy, surprise, helplessness, weak anger, strong anger, neutral

感情は非常に不均等な分布をしている。 実際のアプリケーションでも有り得ることだが、自然発話の内 90% はニュートラルな音声であった。

4.EVALUATION
4.1. Classification

クラスタリング のためのツールボックスとして、Weka データマイニングソフトウェアを再度利用した。 ここでは全ての実験は Naive Bayes 法を学習スキームとして利用している。 他のスキームも試したが、観察結果に大きな差はなく、Naive Bayes は高次元データを扱う際に特に高速に計算できる。 そのため、 SmartKom コーパスを利用する場合など唯一つのクラスを持つインスタンスが大多数の時でも満足のいくパフォーマンスになった。 これは分類器を一定にしたまま特徴量抽出のテストを行いたいという我われの要求を満す。

4.2. Results
4.2.1. Acted emotions:

演技感情は以下の 4 つの異なる方法で評価した。

  • 感情: 7種類 (anger vs. joy vs. sadness vs. fear vs. disgust vs. boredom vs. neutral)
  • 評価値: 3種類 (anger/sadness/fear/disgust/boredom vs. neutral vs. joy)
  • 活性度: 3種類 (anger/joy/fear/disgust vs. neutral vs. boredom/sadness)
  • 感情を含んでいるか: 2種類 (anger/joy/sadness/fear/disgust/boredom vs. neutral)

与えれられた全ての発話に対し 10 回のクロスバリデーションを行い、クラス単位の認識精度をを観察した。

表1 にセグメントとして全ての発話を使用した 4 種類の条件をフルセットと縮小セットで比較した認識結果を示す。 縮小特徴量セットは平均して 6.4 % 精度が改善した。 加えて、縮小特徴量セットの分類は早く修了する。

表2 に7つ全ての感情に対する縮小特徴量セットを利用した異なるセグメントの長さの結果を示す。 セグメントの長さが短くなるとき認識精度が大きく減少することが観測された。 全ての結果はチャンスレベルを上回るものであったが、 アプリケーションの利便性の観点から結果をみると、 文脈を含む語の結果のみが有効なものであった。

Table 1. Comparing the full feature set with the reduced feature set.
  7 emotions Evaluation Activation Emo./Non-Emo.
Full set 69.1% 67.1% 85.4% 81.9%
Reduced set 77.4% 72.5% 88.6% 85.3%
Table 2. Comparing segment lengths (reduced feature sets)
Segment length Recognition accuracy
Whole utterance 77.4%
Word in context 53.2%
500 ms 44.5%
Word 34.1 %
4.2.2. WOZ emotions:

我われは SmartKom コーパスの感情認識システムを作成した[10]_ と同じ方法で評価を行った。 彼らは異なった特徴量を使用しているが、コーパスを通して感情の表現は一貫していて, 抽出物におけるデータ量は同じであるため結果は比較可能である。

我われの結果 ( 表3 <table_3> を参照) は彼らの 彼らの結果と似ているが、 彼らの特徴量は部分的に手動(韻律の特殊性)であり、品詞情報を使用しているのに対し、我われの特徴量セットは完全に自動的に計算している。

明らかに、高度な自動化の結果が不利になることはない。 これは多くの特徴量セットがこれを補償するためであると考えられる。

二つの解析ユニットを比較すると、ここでも、長いユニット(ポーズにより分割されたセグメント)は良い結果になるが、この差は印象的なほどではない。 これは 自発音声発話ではフレーズや語の輪郭があまり明確ではないためであると考えられる。 データセットを減少されたものと全てを利用したものの結果を比較してもその差は大きくはない。 いくつかの場合、減少特徴量セットのパフォーマンスは、フルセットと、同程度または、悪い結果となった。 しかし、特徴量選択は弁別速度を早くすることはできる。

Table 3. Recognition results in % for natural emotions using segments delimited by pauses and words with context as units.
Different granularities of user states Reduced set Full set Reduced set Full set
  Pauses as borders Word with context
joyful strong joyful weak surprised neutral helpless angry weak angry strong 26 25.6 28.4 28
joyful surprised neutral helpless angry 37.5 38.7 31.2 35.7
joyful neutral helpless angry 39 40.6 39.5 36.1
joyful neutral problem 48.3 51.6 44.2 42.4
no problem helpless angry 50.3 51.9 45.9 45.4
no problem problem 68.3 73.3 59.3 59.4
not angry angry 59.9 61.1 59.1 50.5
4.3. Selected features

[1] では、選択された特徴量は疑わしいものは存在しなかった。 一般に、我われは多くのクラスを持つ場合、多くの特長量が必要であるといえる。 演技音声では ピッチに関連する特徴量が主要な役割を果たした。 自発感情にとっては注目される特徴量は MFCCs にいき、低い係数、とくに第一次元が選択された。 ピッチやエネルギーの極値は基本系列より重要であった。 ポーズは演技感情に対して、とくに悲しみの感情ではポーズの割合が多いので, とても重要な特徴量である。 この側面はしかし、実際の感情に対しては、ポーズが起きることがなく、一般化することができない。

5.CONCLUSIONS

結果として、演技音声と実際の音声では要求されるものが全く異なることが示された。 先行研究に対し、演技音声のに対する特徴選択の影響が高く、実際の感情音声に対しより認識が容易になることを発見した。 本文の新規な貢献は演技された感情と自発的な感情に対する選択された特徴量セットの違いを詳細に観察したことである。 演技された感情及び自発感情に対する良い特徴量は重複する部分が少ないことが示された。 演技感情に対してはピッチに関連する特徴量が中心であったのに対し、自発感情ではMFCC (特に低次元) に関連する特徴量が選択された。 これらの違いは自然が感情を認識することを意図する場合、ある手法の初めてのテストであっても演技音声を使用することに意味がないことを示唆する。

最終的に、我われは特徴量の高度な自動化及び、ユニットのセグメント化が不利になることを示すことはできなかったが、 これは我われが選択プロセスの中で与えた特徴量が大き過ぎたことが原因であると考える。

6.REFERENCES
[1](1, 2, 3, 4)
    1. Oudeyer, The production and recognition of Int. emotions in speech: features and algorithms, Journal of Human-Computer Studies, vol. 59, no. 1-2, pp. 157-183, 2003.
[2](1, 2, 3)
  1. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Noth, Speech Communication, vol. 40, pp. 117-143, 2003. “How to find trouble in communication,”
[3](1, 2, 3)
  1. Kustner, R. Tato, T. Kemp, and B. Meffert, “Towards real life applications in emotion recognition,”in ADS Workshop 04, Kloster Irsee, Germany, 2004, pp. 25-35.
[4]
  1. Boersma, “Accurate short-term analysis of the fudamental frequency and the harmonics-to-noise ratio of a sampled sound,” in Proc. of the Institute of Phonetic Sciences, U. of Amsterdam, 1993, pp. 97-110.
[5]
    1. Fink, “Developing HMM-based recognizers with ESMERALDA,” in Lecture notes in Artificial Intelligence, V. Matouˇsek et al., Eds., vol. 1962, pp. 229-234. Springer, Berlin, Heidelberg, 1999.
[6]
    1. Witten and E. Frank, Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000.
[7]
    1. Hall, “Correlation-based feature subset selection for machine learning,” M.S. thesis, U. of Waikato, New Zealand, 1998.
[8]
  1. Burkhardt, Simulation emotionaler Sprechweise mit Sprachsynthesesystemen, Ph.D. thesis, TU Berlin, Germany, 2001.
[9]
  1. Steininger, F. Schiel, O. Dioubina, and S. Raubold,”Development of user-state conventions for the multimodal corpus in SmartKom,” in Proc. Workshop ’Multimodal Resources and Multimodal Systems Evaluation‘ 2002, Las Palmas, 2002, pp. 33-37.
[10]
  1. Batliner, V. Zeißler, C. Frank, J. Adelhardt, R. P. Shi, and E. Noth, “We are not amused but how do you know? User states in a multi-modal dialogue system.,” in Proc. EUROSPEECH 2003, Geneva, 2003, pp. 733-736.

Exploratory study of some acoustic and articulatory characteristics of sad speech

Authors:Donna Erickson and Kenji Yoshida and Caroline Menezes and Akinori Fujino and Takemi Mochida and Yoshiho Shibuya,
Jounal:Phonetica vol. 63, no. 1, pp. 1-25, 2006.0,
Tags:プロポーザル; 感情音声

注釈

本研究では二人の女性(日本人とアメリカ人)の気軽な電話会話における自発的に発話された感情音声の音響と EMA データ による調音データを調査したものである。 発話者を真似たり、オリジナルの感情発話を読み上げたコントロールデータも収録した。 アメリカ人を対象に情報パターンを真似さている。 結果は以下の三点を示唆している。

  1. 自発的悲しみ音声は音響的、調音的特徴に読み上げ音声や感情模倣音声とは異なる。
  2. 自発的悲しみ音声と悲しみ模倣音声は音響的には同じような性質をしている(高い F0 や声質としての F1 の変化)が、調音構造として唇や顎、舌の位置が異なる。
  3. 高い F0 や声質の変化は悲しみとして聞き手が判断する音声と相関がある。
Introduction

発話者が伝えたいものは構文的な音素や語の言語的ユニットだけではなく、F0 や持続時間、強度や声のトーン、リズムやフレージングを含む発話者の声の音響的変化である。 Eldred and Price [1958] で述べたように、コミュニケーションは ‘何’ を言っているのかのみが含まれているのではなく、’どのように’ 言っているのかもふくまれている。 Fujisaki [2004] では ‘パラ言語情報’ として言及される’どのように’ 関する情報形式には 書き言葉によって伝達される離散的で、カテゴリカルな ‘言語的情報’ と 発話者の性別、年齢、感情などに基本的には左右されない情報である、非言語情報との二つがあることを示唆している。 感情、指標的、社会的、文化的、そして言語的情報を含む様々な種類の音声信号によって伝達される情報の並行通信は Bühler [1934]からある言語学の伝統的命題である。 表情豊かな音声の多重複合性の豊かさに関する様々な研究のレビューとしては、Schroeder [2004], Gobl and Ní Chasaide [2003], Douglas-Cowie et al.[2003] などを参照されたい。

聞き手の認識を評価することを含む、表情豊かな音声の複雑な性質を研究するための1つのアプローチは、 静的なものではなく、何か継続的に時間変化するものである。 感情/効果を形容詞や尺度を使って記述するものではなく、 寧ろ、Cowie et al. [2000] や Schroeder [2004] に記述されているように 例えば感情の大きさや、活性度(能動的、受動的), 評価(ネガティブ、ポジティブ) などを 聞き手に 例えば ‘Feeltrace’ などの グラフィカルインターフェースを使い評価するように求める。 ある場合にはその後, 主成分分析(PCA)などの統計解析を行い、 ある感情尺度の知覚が、どの音響的(または調音的)特性と関連しているのかを決定する。 これらの線の間で「自由な選択指標」を使うことができる[see, e.g., Campbell and Erickson, 2004]. この研究では、一人の話者から発話された 「えー」(日本語のバックチャンネリング発話) という音声を被験者に提示し, 音声をコンピュータスクリーン上にあるボックスに並べることを求めた. その後、被験者が音を知覚する方法に応じてボックスにラベルを割り当てた。 ついで、PCA を応答の基礎となる関係を決定するために行った。

その他、表情豊かな音声を研究するためによく使われる方法は、 聞き手や実験者により付与される幸せや、悲しみ、怒りなどの感情形容詞ラベルを使用したある特定の感情表現と音響や調音特性を調査することである。 様々な研究は音響的変化は特定の感情/効果に影響を及ぼすことを報告している[see e.g., review articles by Scherer,2003; Gobl and Ní Chasaide, 2003; Erickson, 2005].

明瞭度も効果に影響する。 例えば、Maekawa et al. [1999] や Erickson et al. [2000] では 発話者が異なったパラ言語的状況、例えば、疑いや賞賛、怒りなどで同じ発話を発話した もらった場合舌や顎の位置は変化することを明らかにした。

特定の感情を要求されて話し手が発話するような演技音声と自発的感情表現を区別することが重要である。 演技された感情は特定の影響を聞き手に伝えようと声優は口頭表現を制御するため、 演技音声は発話者により、これらの感情が経験されるものとは別のカテゴリになる場合が多い。 例えば、深い悲しみを表現するためにある話者は発話中に泣き始めるかもしれない。 泣くことは言語的メッセージの一部ではない。 一般に発話者は泣くことを制御できない。 これはしばしば、発話者の意図に拘わらずおき、生理的感情的に脳内で誘発された変化の結果である. Brown et al. [1993] のような研究では悲しみや高揚の結果としてのホルモン変化を示している。 悲しみや高揚など特定の感情による発話者の生物学的状態の変化が 喉頭と前喉頭関節が変化することもありうる。 Erickson et al. [2003a, b, 2004a-c] による予備調査では、演技音声と自発的で表情豊かな音声は舌や顎の明瞭度が異なるパターンを持ち、発声されていることを示した。 Schroeder et al. [1998] では、面白がる音声の聴覚-視覚的特性の調査を行い、 被験者が模倣された感情(驚きの表情)と本物の感情を区別して知覚できることを発見している。

発話者に同時に発話を行うことを強制する、例えば’泣く’といった発話者による特定の種類の強い感情表現は 感情と発話の間の可能な生理的な接続を考えるための窓口が開くため、 どの程度の感情と音声は関連をしており、どの程度に強い感情は、言語的タスクを妨害または高めることができるのかという研究をすることは 知覚と発話の情報の観点から 面白いものである。

Sad Speech

Scherer [1979] によると,悲しみ音声には最低でも2つの種類が存在する.

  1. 悲しい, 物静かな, 受動的音声
  2. 喪中を経験するようなタイプのアクティブな音声

前者はパラ言語情報として分類しなくても良いかもしれない. 後者は明確な非言語的感情の例である. 感情同期的な喚起は涙のような生理的変化に関連付けられている. アクティブな悲しみの感情はおそらく声帯振動パターンの制御に関与する運動協調だけでなく、声門上部の調音運動に影響を与える.

悲しく,物静かな音声の音響的特徴も多くの研究が存在する. それは F0 の値は低く[e.g., Iida, 2000; Eldred and Price, 1958], 低強度の音声で[e.g.,Eldred and Price, 1958], 持続時間が長い[e.g., Iida, 2000; Eldred and Price, 1958], スペクトルエネルギーの分布が変化するような声質の変化がある[e.g., Schereret al., 1991] 気息性の増加したものである.

増加した気息性は声帯振幅の開き方と関連があり, 高い声帯の振幅比(AQ)を反映し[Mokhtari and Campbell, 2002], 振幅の増加は第一倍音と他の倍音に関連し[e.g., Hanson et al., 2001], スペクトル傾斜にも同様の関連が見られる[e.g., Childers and Lee, 1991].

この種類の感情情報の収録が困難であることが原因であると思われるが 動的な悲嘆,喪しつを含む悲しみの種類の音響的特徴はあまり研究が進んでいない. しかし,ロシア語の嘆き音声の音響的研究が Mazo et al. [1995] 嘆きはいくつかのロシアの村で 悲しみを表現する際に使用される種類の情報であり, 定期的な歌、すすり泣き、興奮,叫び声、音声中断、ため息,加えて、呼吸を表す. 音響的な特徴として高い F0 をもち, 持続時間が変化とし1500 - 4500 Hz の間でエネルギーが増加し, ボーカルフライや二重音声のような声帯のゆらぎがあり, 振幅/周波数が同時に変調する.

非言語的感情情報の調音構造はすべてということならば、まだ調査されていない。 パラ言語情報の調音構造はそれなりに検討されている。 例えば、アメリカ英語では苛立ちの際、顎が低くなる [Menezes, 2003; Mitchell et al., 2000; Erickson et al., 1998] とか、 アメリカ英語[Ericksonet al., 2000], 日本語[Maekawa et al., 1999] の両方で疑わしいときに、舌背が前方に来て、称賛の際には後方にくる。 舌背の位置に対する音響的結果として疑いの際にF1とF2の双方を持ち上げ, 感嘆や、失望のときには引き下げる[e.g., Maekawaet al., 1999].

ここで我々の目的は自発感情に対する音響的、調音構造的特徴を調査することである。 今までの研究に対し [an extension of earlier work by Erickson et al.,2003a,b and 2004a-c] 人間が強く一般的な感情を表現している収録音声の調音指標(EMA)を比較することは革新的なアプローチである。 それぞれの話者は同じ語を発話し、自発的悲しみ音声の音響的、調音的特徴の比較を以下の2-3 個の条件で行った。

  • 自発感情
  • 感情模倣
  • フレージング/イントネーションの模倣
  • 読み上げ音声

悲しみ音声を使用した理由はこれが強い感情であり、発話者が収録時にその経験が豊かであったためである。 特に、我われは音響的特徴量として F0, フォルマント、 持続時間、 声質を調査した。 一方、調音構造としては、唇、舌、顎の位置を調べた。

我々は以下の条件を尋ねた。

  • 同じ言語内容の読み上げ音声の比較において発音声の表現の特徴が何か
  • 自発音声と、模倣音声の音響的、調音的特徴のどこに違いが存在するのか

‘模倣された’ 悲しみは ‘演技した’ 悲しみに似ているが、 模倣した悲しみにおいては、発話者に、自発音声の記録を聞きながら、できるだけ正確に自発音声の感情的発話運動を真似るように依頼している。 模倣音声の先行研究 [Erickson et al., 1990] では 発話者は声質ではなく、例えば、自発音声に於ける彼等独自の F0 のパターンを真似できることを示した。 感情模倣の誘発は発話者は敏感な特徴を評定する方法である。 しかし実際には自発発話の音響信号に現れているものとは異なる可能性もある。 したがって、このアプローチは、感情的音声の音響的特徴のいくつかを確認するための代替方法を提供します。 技演音声は恐らく、特定の感情を伝達するための人間のレパートリーをステレオタイプに表現したものである。 これは、感情模倣音声とも自発的感情音声とも異なるかもしれない。

A third question (3) we ask is whether there are acoustic and articulatory differences when the speaker imitates the spontaneous sad utterance vs. when she imitatesonly the phrasing and intonational patterns of the spontaneous sad utterance. Also, we ask (4) what are the acoustic and articulatory characteristics of speech that israted by listeners as very sad? Is this the same or different from spontaneous sad speech?In addition, we ask (5) whether there are common characteristics in the productionof emotional speech across different languages, e.g., American English and Japanese.On one hand, we might expect similarities because human beings experience basicemotions; however, we also would expect differences because of the complex interplaybetween the experience of emotion and the socio-linguistic constraints on the expres-sion of emotion.

A sixth question we ask is (6) what are the similarities/differences between per-ception of sad speech in diverse languages, such as American English and Japanese? Do listeners of different languages pay attention to different acoustic characteristics inperceiving whether an utterance is sad or not? The strengths of the study lie particularly in the joint acoustic and articulatorytreatment of the speech data and in the comparison of spontaneous and imitated emo-tion together with prosodic imitation without emotional expression. This type of studyhas not been done previously and as such, is a pioneering piece of work, presented hereas a pilot study, or ‘proof of method’. The weaknesses are those inherent in the non-laboratory design of the experiment: the linguistic content in terms of words/vowelsanalyzed is not strictly controlled nor is the timing of the collection of the spontaneousemotion for the speakers exactly the same, nor is the language - one speaker isAmerican, one is Japanese, albeit both are female speakers. We acknowledge the tentative nature of the results, which we hope can be used as a baseline or extension for fur-ther, more extensive work.

Methods
Data Recording

2D EMAシステムを使い, アメリカ人女性(中西部方言)と日本女性(広島方言)の音響及び調音データを収録した。 この研究の主な目的は自発感情音声 の解析収録の方法を確立することにあるため、二人の話し手の条件は厳密には同じものではない。 表 1 に様々な実験的差分を、発話者、言語、データセット、セッション、セッションのタイミング、発話、条件、対象となった母音、および、調音/音響指標 の観点から 要約している。

Table 1. Summary of experimental method: speakers/languages, data sets, sessions, timing of sessions,utterance conditions, vowels examined and articulatory/acoustic measurements
Speaker/language Sets of data Sessions Sessiontiming Conditions Word/vowel Articulatory Acousticmeasuresmeasures
American English

2 sets

  • Set 1 - spontaneous conversation
  • Set 2 - control data
2 1 month prior to mother’s death (on day emotion, of operation imitated read for pancreatic emotion, imitatedcancer) and intonation, informal, 5 months later 4 conditions: spontaneous emotion, imitated emotion, imitated intonation, leave /i/ (2 utterances) Jx, Jy, ULx, ULy, LLx, LLy, T1x, T1y, T2x, T2y, T3x, T3y duration, F0, F1, F2 and amplitude of glottal opening(AQ)s
Japanese

2 sets

  • Set 1 - 1 informal - spontaneous conversation
  • Set 2 - control data
1 4 months after mother’s sudden deat due to brain aneurysm 3 conditions: spontaneous emotion, imitated emotion, read kara /a/ (2 utterances) Jx, Jy, ULx, ULy,LLx, LLy, T1x, T1y, duration, F0, F1, F2, and spectral tilt

それぞれの発話者毎に、二つのデータセットを収録した。 セット1 他の発話者との 気楽な自発電話音声 を収録したものであり、イヤフォンとマイクロフォン を通した音声である。 発話相手は別の部屋に座っている。 アメリカ人の場合、もう一人の発話者は第三者(友人/同僚)である。 日本人の場合、発話相手は六人いる。 発話相手には被験者の私生活に関連した話題のリストに基づいて、幸福感や悲しさ、又は怒りを呼び起こすような様々な質問をリハーサルなしで行うように依頼した。 悲しんでいる感情(発話中嗚咽することを含む)を収集するための 実験のタイミングは幸運にもアメリカ人の場合、被験者の母親が丁度致命的な病気であると診断されたことを知ったタイミングであった。 また、日本人のケースにおいては、被験者は母親を脳動脈瘤のため最近失ったばかりであった。 対話相手は最近おきた悲しむべき状況に気が付き、インタビューの多くはこのことが中心に行われた。 EMA 収録はフレーム間の約 3s を ブレイクとする 20 s のウインドウ幅で行い、 対話は自然な方法に留まっていた。 音響収録は行った。 ビデオ収録も行っているが本文での解析には使用していない。

セット2 の発話は最初のデータ収録用の制御データである。 これは発話中発話者が泣いてしまうことがあったため, 制御実験のため特定の発話が選択された。 しかし、表1 に示した通り、アメリカ人英語話者に対しては収録が5ヶ月後に行われている。 一方、日本人に対しては被験者が異る実験的プロトコルの文章のリストを読み上げたあと収録は同じセッションの一時間後に行われている . 二つのセッションは感情発話から得られる最もよい制御データを選択するさい、十分な時間を用意する際に有利であった。 しかし、 調音指標をえるための EMA を調整で、二つのセッション間で5度の差がある点は不利である。 一つのセッションを他のセッションの座標と数学的にデータ形式を変更することでマッチさせることはデータ選択に掛かる時間の大きさから断念をした。 今後の課題としてこの手の制御条件から分けられたセッションで自発感情音声を収集することが推奨される。

両方の話し手は表1に示された方法に従ってオリジナルの発話を行うように依頼された。 アメリカ人に対しては4つの条件が存在するのに対し、日本人に対しては3 つの条件しか存在しないことに注意して欲しい。 この理由は以下のアイテム 2 で説明を行う。

Control Condition 1 (Imitated Emotion).

シャドウイングのように、 テープに録音されたオリジナル発話を ヘッドフォンを通して聞いたり, 台本をみたりして、(アメリカ人の場合オリジナル発話のイントネーションパターンとフレージングにマークがされている) 単語、フレーズ、イントネーション、感情を模倣した。 感情模倣発話(IE) はアメリカ人で 3 回、日本人で6回繰り返した。 解析の目的上、最初に三回繰り返された単語のみを調査した。 発話は20秒ほど長くなったが、発話者は模倣発話の間にテキストのコピーを見て行たためである。

Control Condition 2 (Imitated Intonation).

アメリカ人話者にのみ、単語のみ、フレーズのみ、イントネーション(感情をふくまない)のみの模倣(シャドウイングなど)を 台本を見たり、オリジナルの音源を聞いたりするときに行った。 イントネーション模倣発話(II)は三回繰り返した。 アメリカ人話者は音韻論の訓練を受けており、この課題を容易にこなすことができた。 しかし、日本人の場合、セッションは一回のみであり、発話者にコーチをする時間もなかったため容易ではなかった。

Control Condition 3 (Read).

音声書き起し文章によるオリジナル発話の読み上げも行った。 日本人の場合、書き起し文章は日本語のライティングシステムを使用している。 読み上げ発話(R)はアメリカ英語では二回、日本人話者では六回行った。 最初の三回の繰り返し音声の語のみを解析の対象としている。

Utterances

理想的には、我々は二つの言語を通し、同じ母音文脈を維持したい。 しかし、leave という語と かな(だからの意)を選択した。 これは両被験者が最もつよく悲しみの感情を示した語が存在したからである。 また、これ以外に同様の母音文脈で強い悲しみを示した例は存在しなかった。

アメリカ英語の悲しみ音声に関して、我われは、leave という語の例を解析した。 これは20 秒の発話中、単一発話が二回発話されており、発話全体では 4 回未満であった。

    1. 発話者が母を亡くした時の最初の実験における発話(i.e., actually sobbing);
  • (IE) オリジナルの発話の音声収録を聞きながらのフレーズ、イントネーション, 感情の模倣
    1. オリジナル発話の音声収録を聞きながらのフレーズとイントネーションのみの模倣
    1. オリジナル発話と同じ、フレーズ、イントネーションを保持した読み上げ音声

E 発話 * 2, IE 発話 * 6, II 発話 * 6, R 発話 * 2 の合計で、 16 個の収録が存在する。

日本語悲しみ発話では、 E (発話者の ‘かな’ を発話したさい、 cheek1 では涙を流している), IE, R の 3 つの発話条件のもとで 二回発話起きた “から” という語を解析の対象にしている。 E 発話で 2 回, IE 発話で 6 回, R 発話で 6 回の合計、14 回, ‘かな’ という収録があった。

Articulatory Analysis Method

We examined the movement of the EMA receiver coils attached to the (1) lower incisor(mandible) (2) upper lip, (3) lower lip, and (4) receiver coils (T1, T2, T3, T4) attached along the longi-tudinal sulcus of the speaker’s tongue. The positions of the transmitter coils determine thecoordinate system [Kaburagi and Honda, 1997] with the origin positioned slightly in front of and belowthe chin. All EMA values are positive, with increasingly positive y-values indicating increasingly raisedjaw or tongue position, and increasingly positive x-values, increasingly retracted jaw or tongue position.The coordinates of the American English speaker’s first recording were transformed by 5 degrees,which was the measured difference between the coordinates in the two sessions [Erickson et al., 2003b].

Articulatory measurements were made for the x-y coil positions for the upper and lower lip (UL,LL), for the mandible (J), and the tongue (T1, T2, T3, T4) at the time of maximum jaw opening for theutterance, using a MATLAB-based analysis program. Note that T4 recordings were not reliable foreither of the speakers, and T3 was not reliable for the Japanese speaker because of coil-tracking prob-lems. For the American English utterances, articulatory as well as acoustic measurements were madeat the time of maximum jaw opening for leave. For the Japanese utterances, articulatory measure-ments were made at the time of maximum jaw opening during the second mora in the utterance kara.

  • As seen from the video images, for the Japanese speaker, the manifestation of crying was tears running down thecheeks; for the American speaker, the face was contorted in a sobbing posture with jaw open and eyebrows knittogether, as well as tears.

A sample of a data file is shown for Japanese in figure 1. Notice there are two distinct jaw openings,one for each mora. However, for two of the utterances (one IE and one R), there was only one jawopening, which occurred at the plosion release for /k/.

_images/fig16.png

Fig. 1. Sample articulatory data of kara from utterance 81 (E). The bottom trace is the acoustic wave-form. Vertical lines indicate lowest Jy position in each mora. Articulatory measurements made duringthe second mora were used in the analysis. The x-axis shows time in seconds. The values of the tickmarkings for the y-axes are as follows: Jx 7.85-7.80 mm; Jy 12.8-12.6 mm; ULx 5.9-5.85 mm; ULy15.2-15.15 mm; LLx 6.45-6.3 mm; LLy 13.6-13.3 mm; T1x 9.2-8.8 mm; T1y 14.8-14.0 mm; T2x10.2-9.8 mm; T2y 20-14.5 mm.

Acoustic Analysis Method

アメリカ英語及び、日本語データの双方で 我われは、持続時間、F0, F1, 及び声質の計測を行った。 アメリカ英語に関しては、Parham Mokhtari との先行する共同研究[see Ericksonet al., 2003b] から, 以下に記述するように、声帯信号の生成物から推定した声帯閉鎖の AQ を調査することができた。 しかし、日本語の解析に関してはこの手法でアクセスすることができなかったため、 我われは以下に記述するように、スペクトル傾斜を調査する単純な手法を使用した.

American English.

単語レベルでの響的音解析(持続時間, F0, F1, 声帯閉鎖) を逆フィルター法と発話様式推定法を使って行った。 これは Mokhtari and Campbell [2003] を参考にしている.

われわれの研究では、F0, F1, AQ の結果に注目した。 具体的には EMA の結果から特定した、シラブル核のうち最も顎が開いた時のの時間を観察している。

Broad and Clermont [1986] で提案されたケプストラムからフォルマントへの線形マッピング法を使用し 最初の4フォルマントフリークエンシーとバンド帯域の最初の予測結果を使用して, 線形予測ケプストラムを計測した。 Although the mapping had been trained on a subsetof carefully measured formants of a female speaker of Japanese [Mokhtari et al., 2001], it was foundto yield remarkably reasonable results for the American female speaker, as judged by visual inspectionof the formant estimates superimposed on spectrograms. These formant estimates were then refined ateach frame independently, by an automatic method of analysis-by-resynthesis whereby all 8 para-meters (4 formant frequencies and bandwidths) are iteratively adjusted in order to minimize a distance between a formant-generated (or simplified) spectrum and the original FFT spectrum of the sameanalysis frame. The optimized formants centered around the point of maximum jaw opening in eachutterance of leave were then used to construct time-varying inverse filters with the aim of eliminatingthe effects of the vocal-tract resonances (or formants). The speech signal was first high-pass filtered toeliminate low-frequency rumble (below 70 Hz), then low-pass filtered to eliminate information abovethe fourth formant, and finally inverse filtered to eliminate the effect of the first 4 vocal-tract resonances. The resulting signal (52 ms) was integrated to obtain an estimate of the glottal flow waveform.

The glottal AQ, proposed independently by Fant et al. [1994] and Alku and Vilkman [1996] isdefined as the peak-to-peak amplitude of the estimated glottal flow waveform divided by the amplitudeof the minimum of the derivative of the estimated flow. It is a relatively robust, amplitude-based meas-ure which gives an indication of the effective duration of the closing phase of the glottal cycle. As dis-cussed by Alku et al. [2002], AQ quantifies the type of phonation which is auditorily perceived alongthe continuum from a more pressed (i.e., creaky) to a more breathy voice quality.

Japanese.

Acoustic analyses of the /ra/ mora - duration, spectral tilt, F0, and F1 (around the mid-portion of the /ra/ mora in kara) - were done using Kay Multispeech 3700 Workstation. We tried ordi-nary methods to characterize the creakiness of the voices, e.g., comparison of H1 with H2 or H3 [NíChasaide and Gobl, 1997]. However, because of the irregular glottal pulsing in nonmodal phonation anextra set of harmonics appear parallel to F0 and its harmonics [Gerratt and Kreiman, 2001], and it wasnot easy to reliably resolve each harmonic. However, visual inspection of the FFT results reveals thatspectral differences of the three categories of speech reside in the strength of the energy around F1, i.e.,600-1,000 Hz, with R having the strongest energy and IE, the weakest. Therefore we devised a methodto capture the gross characteristics of the harmonics. A regression analysis was made for each FFTresult (512 points) from H1 to 1,000 Hz, with the slope of the regression line taken as an index of spec-tral tilt. Fitting a single regression line to spectrum was explored in Jackson et al. [1985].

Perceptual Analysis

The words used in the acoustic/articulatory analyses (described in ‘Utterances’ above) were pre-sented for auditory judgment in randomized order 4 times with HDA200 Sennheiser headphones in aquiet room, using a G3-Macintosh computer and Psyscope software. The task was to rate each wordaccording to the perceived degree of sadness (5-point scale; 5, saddest). A practice test of 6 utterancespreceded the test. For the American English test, the 16 instances of leave were presented auditorily to11 American college students (4 males, 7 females) at The Ohio State University; for the Japanese test,the 14 instances of kara were presented auditorily to 10 Japanese female college students at Gifu CityWomen’s College.

Results
Perception Test Results

For both American English and Japanese, listeners ranked emotional and imitatedemotional speech as sadder than read speech, and in the case of American English,emotional and imitated emotional speech as sadder than imitated intonation speech, ascan be seen in figure 2.

One-way ANOVA of the perceptual ratings (mean for all the listeners) with theutterance conditions (4 levels: E, IE, II, R for American English and 3 levels: E, IE, R forJapanese) as an explanatory variable found significant main effects for both languages(American English, F(3,12) = 351.328, p < 0.000; Japanese, F(2,11) = 14.704,p < 0.001). The Bonferroni pairwise comparisons revealed that perceptual rating ofsadness is significantly higher for E and IE compared to II and R for American English(p < 0.000), and for IE compared to R for Japanese (p = 0.001). The finding that imitated intonation in American English was not rated as sad as spontaneous or imitated various studies about the multicomplex richness of expressive speech, see e.g.,Schroeder [2004], Gobl and Ní Chasaide [2003] and Douglas-Cowie et al. [2003].

_images/fig26.png

Fig. 2. Ratings by American listeners of the perceived degree of sadness of 16 utterances of leave(left side) and ratings by Japanese listeners of perceived degree of sadness of 14 utterances of kara(right side). The error bars indicate standard deviation of p = 0.68.

Imitated expressions of sadness (IE) were actually rated as slightly sadder thanspontaneous expressions of sadness (E) for both American English and Japanese.However, the differences were not significant. A larger sample size is necessary to fur-ther explore this.

Acoustic Analysis Results
_images/fig34.png

Fig.3. Results of acoustic measurements, both for American English leave (left side) and Japanese kara (right side) for each of the utterance conditions. From top to bottom, duration, F0, F1, AQ (for American English) and spectral slope (for Japanese). The error bars indicate standard deviation of p = 0.68.

Figure 3 shows the results of the acoustic analysis of American English and Japanese speech. The averaged acoustic values are shown in tables 1A and 2A in the Appendix.

Figure 3 shows that F0, F1 and voice quality (but not duration) of imitated sadspeech are similar to those for spontaneous sad speech. One-way ANOVA of theacoustic measures (F0, F1, voice quality, and duration) with the utterance conditions (E,IE, II, R for American English and E, IE, R for Japanese) as an explanatory variablefound main effects for duration, F0 and AQ for American English and for duration andF0 for Japanese (table 3A in the Appendix). Bonferroni pairwise comparisons found nosignificant differences between the E and IE conditions for F0, duration and voice quality for either the American English or Japanese. Both speakers showed a higher F0 forsad and imitated sad speech than for read speech (and imitated intonation speech). Bonferroni pairwise comparisons found F0 was significantly higher for E and IE com-pared to II and R for American English (p < 0.000), and for E and IE compared to R(p = 0.013 and p = 0.038, respectively) for Japanese. The finding of high F0 (whichwas clearly audible) is different from that usually reported for sad quiet speech, butsimilar to what was found for the active grieving seen in Russian laments.

F1 values were similar for spontaneous and imitated sad compared to those for read(and imitated intonation) speech: for the American English speaker (for the vowel /i/), sadness suggests that intonation itself is not sufficient to convey sadness; as discussed in’Results of Pearson Correlation with Ratings of Sadness and Accoustic/ArticulatoryMeasures’ below, voice quality and F0 height are salient characteristics of sadness.

F1 values for sad and imitated sad speech were higher, and for the Japanese speaker(for the vowel /a/), they were lower. These differences were not significant. However,the graphs suggest a tendency for the /i/ and /a/ vowels to centralize in sad speech; thisneeds further investigation. The finding of low F1 for sad and imitated sad utterancesfor the Japanese speaker is reminiscent of the finding of lowered F1 values for disap-pointment (also for the vowel /a/) reported by Maekawa et al. [1999].

In addition, both speakers showed changes in voice quality for the sad and imi-tated sad speech, compared with the read (and imitated intonation) speech. For theAmerican English speaker, both spontaneous and imitated sad speech have a low AQ.Bonferroni pairwise comparisons found AQ significantly lower for E and IE comparedto II (E < II: p = 0.018; IE < II: p < 0.000), and for the Japanese speaker, a steepspectral tilt (but no significant differences).

The finding of low AQ for the American English spontaneous sad utterances is dif-ferent from the high AQ previously reported for sad, quiet breathy speech. It may be thatlow AQ is seen in active grieving because crying would probably involve muscular ten-sion, including tension of the vocal folds. When vocal folds are tense/pressed, the dura-tion of vocal fold closure is large, and this would lower AQ.

_images/fig44.png

Fig. 4. Sample acoustic wave forms for Japanese kara. The top panel is sad speech (E), next is imitatedsad speech (IE), and the bottom is read (R) speech. Each tick marking on the time axis indicates 100 ms.

As for the Japanese, spontaneous sad speech was perceived by the authors to bebreathy-voiced and the imitated sad, creaky-voiced, and this can be seen in the acousticwaveforms in figure 4. The spontaneous sad speech in the top panel of the figure shows relatively smooth phonation for breathy voice [similar to that shown for breathy voice by Ishii, 2004]; the imitated sad speech in the middle panel shows irregular and sporadic phonation, typical of creaky voice [see also, e.g., Ishii, 2004, as well as e.g., Redi and Shattuck-Hufnagel, 2001], and the read speech in the bottom panel shows regular pulsing with clear vowel-formant characteristics, though some creakiness towards the end.

A comment about spectral tilt and breathiness/creaky voice: That breathy utterances have steep spectral slopes is relatively well known [e.g., Klatt and Klatt, 1990; Johnson, 2003], as is also that sad utterances tend to be breathy [i.e., Mohktari and Campbell, 2002]. Creaky utterances at low F 0 are not characterized by steep spectral tilt [e.g., Klatt and Klatt, 1990; Johnson, 2003]. However, in the Japanese data we obtained, we see a steep spectral tilt associated with the creaky voice of the imitated sadness. It may be that depending on the value of F 0 , the spectral slope of the creaky utterances changes, so that at high F 0 , creaky utterances have very steep spectral slopes. As Gerratt and Kreiman [2001] argue, characterization of variation in nonmodal phonation is not straightforward due to complications of both taxonomy and methods. Phonetic characterization of voice quality changes associated with emotions in speech has not yet been done; however, see e.g., Gobl and Ní Chasaide [2003] for their summary of voice quality analysis of expressive speech.

The averaged durations of the imitated sad speech for the American Englishspeaker were generally longer than those of the spontaneous sad speech, as well as theother conditions. However, the only significant difference was for IE compared to II(p = 0.007). Perhaps the reason the imitated sad speech was longer than the sponta-neous sad speech was because the speaker was expecting sad speech to be slow, andmay have inadvertently allowed this expectation to influence her production.

For the Japanese speaker, the spontaneous and imitated sad utterances tended to belonger than the read utterances and significantly longer for E compared to R(p = 0.023), which is consistent with what is reported in the literature for sad speech.

Articulatory Analysis Results

Figure 5 shows the averaged horizontal and vertical coil positions for theAmerican English and Japanese speakers. Whereas the acoustic characteristics of spon-taneous and imitated sadness were fairly similar, the articulatory characteristics ofthese two conditions are different: imitated sad speech tends to be similar to readspeech, or imitated intonation, rather than to spontaneous sadness. The averaged artic-ulatory values are shown in tables 4A and 5A in the Appendix. Table 2 summarizes thesignificantly different characteristics (p < 0.05 based on Bonferroni pairwise compar-isons) between sad and imitated sad speech.

Table 2. Articulatory characteristics of sad speech compared to imitated sad speech
Upper lip Lower lip Jaw Tongue tip Tongue blade Tongue dorsum
Spontaneous sad speech

retracted for Am. Eng.

  • retracted, lowered for Japanese

retracted, lowered for Am.Eng.

  • protruded, lowered for Japanese

fronted for Am. Eng.

  • fronted for Japanese
fronted for Am.Eng fronted, raised for Am.Eng.
Imitated sadspeech

protrudedfor Am. Eng.

  • protruded, raised for Japanese

protruded, raised for Am.Eng.

  • retracted, raised for Japanese

backed for Am.Eng.

  • backed for Japanese
backed for Am. Eng backed, lowered for Am. Eng

For the American English speaker for spontaneous sadness compared with theother conditions (fig. 5 left side), the upper lip and lower lip are retracted, the lower lipis raised, the jaw is lowered, the tongue tip is fronted, the tongue blade is fronted andthe tongue dorsum is fronted and raised. One-way ANOVA of the articulatory measures (UL, LL, J, T1, T2 and T3) with the utterance conditions (E, IE, II, R) as factors foundsignificant main effects (table 6A in the Appendix). Bonferroni pairwise comparisonsfound ULx more retracted for E compared to IE, II, and R (p < 0.000), LLx moreretracted for E compared to IE, II (p < 0.000), and R (p = 0.031), LLy more raised forE compared with IE, II, and R (p < 0.01), Jx more retracted for E compared to IE(p = 0.005) (while IE is more protruded than II, p = 0.004 or R, p < 0.000), Jy lowerfor E compared to IE (p = 0.018), T1x more fronted for E compared to IE, II and R(p < 0.000), T1y lower for E compared to II (p = 0.025), T2x more fronted for E com-pared to IE and II (p < 0.000) and R (p = 0.002), T3x more fronted for E compared toIE, II and R (p < 0.000) and T3y for E more raised compared to IE, II, and R(p < 0.000).

_images/fig53.png

Fig. 5. Results of articulatory x-y measurements in millimeters for UL, LL, J and T1, T2, and T3.American English leave is on left-side, and Japanese kara is on right side. Smallest y-axis values indi-cate lowest coil positions, and smallest x-axis values indicate most forward coil positions, so that thelower left corner of the graphs indicate the lowest, most forward position of the articulators (as if thespeaker were facing to the left of the page). E indicates sad speech, IE, imitated sad speech, II, imi-tated intonational speech, and R, read speech.

That the American English speaker showed tongue blade/dorsum raising andfronting for sad speech may be associated with the more open jaw used by this speaker,and the necessity to produce a phonologically recognizable high /i/ vowel when the jawis open [see e.g., Erickson, 2002].

For the Japanese speaker for sad speech compared to the other conditions (fig. 5right side), the upper and lower lips are retracted and lowered, the jaw is protruded andlowered (but lowered only compared with imitated sad speech, not read speech), andthe tongue tip is fronted. One-way ANOVA of the articulatory measures (UL, LL, J, T1,T2 and T3) with the utterance conditions (E, IE, R) as factors found significant maineffects (table 6A in the Appendix). Bonferroni pairwise comparisons found ULx moreretracted for E compared to IE and R (p < 0.000), ULy lower for E compared to IE andR (p < 0.000), LLx more retracted for E compared to IE (p < 0.000) and R (p = 0.009),LLy lower for E compared with IE (p = 0.009), Jx more protruded for E compared toIE and R (p < 0.000) and Jy lower for E compared to IE (p = 0.010), and T1x morefronted for E compared to IE (p = 0.003) and R (p = 0.012).

It is interesting that in imitating sadness, both speakers showed protruded lips,configuring a lip-pouting position. It is also interesting that this gesture is differentfrom the one used for spontaneous sad speech, in which both speakers retracted theirlips. It may be that sadness with strong crying is likely to produce lip retractionwhereas imitated sadness without crying need not.

Another interesting characteristic of sad speech by the American English speakerwas the raised upper (for one of the utterances) and lower lips (see second graph leftside of fig. 5), which matches Darwin’s [1872, pp. 162-164] description of the typicalsquare-shaped open-mouth crying face. According to Darwin, crying is initiated byknitting the eyebrows together. This maneuvre supposedly is necessary in order to pro-tect the eye during violent expiration in order to limit the dilation of the blood vessels[p.162]. Contraction of the orbicular muscles causes the upper lip to be pulled up, andif the mouth is open, then the result is the characteristic square-shaped mouth of crying.The video image of the American English speaker while crying and saying the word’leave’ was exactly this. The raised upper and lower lips follow from the knitted eye-brows of this speaker when crying. Results of Pearson Correlation with Ratings of Sadness andAcoustic/Articulatory MeasuresScatter plots of sadness ratings as a function of F0, voice quality (AQ for Americanand spectral slope for Japanese), F1, and duration are shown in figure 6. High ratings ofsadness are associated with high F0 (both American English and Japanese) low F1 (onlyfor Japanese), and steep spectral slope (only for Japanese), high AQ (for AmericanEnglish), and increased duration (only for American English). A Pearson correlationanalysis for Japanese (using the numerical results listed in tables 2A and 5A) showed asignificant linear correlation (p < 0.01) between sadness judgments and F0 (r = 0.81),spectral slope (r = ⫺0.69) and F1 (r = ⫺0.68); however, for American English raters,no Pearson correlation could be done since there was a bimodal distribution. It is notclear why the American English speaker showed a bimodal distribution. The smallsample size may have contributed to the bimodal patterns. Figure 7 shows sadness ratings and articulatory measures (LLy, T1y, and Jx for various studies about the multicomplex richness of expressive speech, see e.g.,Schroeder [2004], Gobl and Ní Chasaide [2003] and Douglas-Cowie et al. [2003]. American English, and LLy, LLx, and Jy for Japanese). Japanese listeners displayed a significant linear correlation between ratings ofsadness and raised lower lip (r = 0.76, p < 0.01), whereas American listeners showeda bimodal distribution, with low sadness ratings associated with lowered lower lip,raised tongue tip, and retracted jaw, and high sadness ratings associated with raisedlower lip, lowered tongue tip and protruded jaw, which also may be related to thisspeaker’s bimodal production of F0 values. A potential relation between supraglottalarticulation and laryngeal tension is discussed at the end of this section. For Japanese,sadness judgments showed a significant correlation (p < 0.01) with protruded lowerlip (r = ⫺0.65) and raised jaw (r = 0.76). Table 3 summarizes the pertinent acoustic and articulatory characteristics of well-perceived sad speech. The articulatory and acoustic characteristics of well-perceivedsad American English speech are included here, since it is not clear whether the patternof bimodal distribution is an artifact of the lack of a continuous range of F0 valuesand/or small sample size. We interpret the results of the correlation between listener ratings of sadness andacoustic/articulatory measures as follows. In imitating sad speech, the AmericanEnglish and the Japanese speaker successfully imitated the high F0 and changed voicequality characteristics of the spontaneous sad speech. However, it seems that there is a difference between the way a speaker articulatesspontaneous sad speech, and what articulatory characteristics ‘convey’ sadness to a lis-tener. For both American English and Japanese, speech that was given a high rating for various studies about the multicomplex richness of expressive speech, see e.g.,Schroeder [2004], Gobl and Ní Chasaide [2003] and Douglas-Cowie et al. [2003].

_images/fig6.png

Fig. 6. Scatter plots of acoustic measurements and perception ratings for American English (leftside) and Japanese (right side). E indicates sad speech, IE, imitated sad speech, II, imitated intona-tional speech, and R, read speech.

_images/fig7.png

Fig. 7. Scatter plots of articulatory measurements and perception ratings for American English (leftside) and Japanese (right side). E indicates sad speech, IE, imitated sad speech, II, imitated intonationalspeech, and R, read speech. For the vertical coil positions, the lowest value indicates lowest coil position;for the horizontal coil positions, the lowest value indicates the more forward (advanced) coil position.

Table 3. Acoustic and articulatory characteristics of well-perceived sad speech
Well-perceived sad speech Lower lip Jaw Tongue tip F0 F1 Dur. Voice quality
American English raised protruded lowered raised   long low AQ
Japanese raised, protruded raised   raised lowered   sharp spectral slope
_images/fig8.png

Fig. 8. Scatter plots of acoustic/articulatory measurements and voice quality measurements forAmerican English (left side) and Japanese (right side). E indicates sad speech, IE, imitated sad speech,II, imitated intonational speech, and R, read speech.

Table 4. Acoustic and articu-latory characteristics associatedwith voice quality
Voice quality Lower lips Jaw Tongue tip F0
American English (low AQ)     lowered raised
Japanese (sharp spectral slope) raised raised   raised

We also examined correlations between voice quality and acoustic/articulatorymeasures as shown in figure 8. Table 4 summarizes the acoustic and articulatory char-acteristics associated with voice quality. For the Japanese speaker, there was a significant correlation (p < 0.01) betweenvoice quality and F0 (r = ⫺0.64), with spectral slope becoming steeper as F0 becomeshigher. There was also a significant correlation between spectral slope and lower LL-y(r = 0.61, p < 0.05), and Jy (r = ⫺0.63, p < 0.01) with spectral slope becomingsteeper as the jaw and lower lip were more raised. For the American English speaker, F0 appears to be negatively correlated with AQin the high F0 range (300 z and above), but not in the modal F0 range (around 200 Hz)which does not show any correlation. There could be a connection between cricothyroidactivity and AQ which could account for this distribution pattern. At normal F0, presum-ably when the cricothyroid is less active, this speaker tends to have a breathy voice withAQ greater than 1.0, but at high F0 when the vocal folds are more tense due to increasedcricothyroid activity, we see lower AQ values. The relationship between speaking ranges(high, modal, low) and AQ is an interesting topic for further investigation. In addition, there was a significant correlation between AQ and T1y (r = 0.64,p < 0.01). In the mid range of tongue tip height, the scatter plot shows no relationshipwith AQ; only at the extremes of tongue tip height do we see a correlation such that lowAQ is associated with low tongue tip and high AQ with high tongue tip. This needs tobe investigated further. The results suggest a relation for American English between low AQ and raised F0,as well as lowered tongue tip, and for Japanese, between steep spectral slope and raisedF0, raised jaw, lowered lip. Perhaps the maneuvers during imitated sadness of loweringthe tongue tip (in the case of American English) or raising of the lips (in the case ofJapanese) pull the larynx forward, change the vocal fold dynamics, and bring aboutirregular phonation, and in this way, contribute to either steepness of spectral slope orcreakiness of voice, and consequently, to the listener’s perception of sadness. Thishypothesis needs to be explored further.

Conclusion

本研究では音響/調音データを収録する複製可能で新しい自発感情音声の収録方法を報告した。 この研究にはいくつかの欠点が存在する。

発見は、二人の発話者の異なる母音核を持つ二つの発話タイプに基づいている。 ここには、いくつかの疑問が残る。 どのように、自発悲しみ音声を、演技悲しみ音声や読み上げ音声と比較をするのか. 本当にリスナーな強い悲しみを感じるのか。 どのように、自発悲しみ音声、模倣悲しみ音声を多言語間で横断的に比較をするのか(日本語と英語のように).

実験の性質は実験室のように実験デザインを制御できるものではないため、 このタイプの初期探索アプローチは、より良い音声における自発的な感情の話題を調査するために、研究を進める方法を知るために収録されたものである。

結果は、アメリカ英語と日本語双方において、悲しい音声が高い F0 を特徴としていることを示している。 また、F1 だけではなく、声質(例えば声門閉鎖サイクルの特徴/スペクトルチルトなど)も変化する傾向にあることを示している。

However, we see differences in terms of articulation. For the American speaker,the upper lip was retracted, the lower lip was retracted and raised while the jaw wasretracted and lowered, the tongue tip fronted and lowered, the tongue blade fronted,and the tongue dorsum fronted and raised for spontaneous sad speech; for imitated sadspeech, the upper and lower lips were protruded and lowered, the jaw was protrudedand raised, the tongue tip, backed and raised, the tongue blade backed and the tonguedorsum, backed and lowered. For the Japanese speaker, both upper and lower lips wereretracted and lowered, the jaw was protruded and lowered, and the tongue tip frontedfor spontaneous sad speech; for imitated sad speech both upper and lower lips wereprotruded and raised, the jaw was retracted and raised, and the tongue tip, backed. The results suggest that articulation of strong emotions by a speaker, such as cry-ing while at the same time forcing oneself to speak, is different from acted/imitatedemotion. Further exploration is needed into possible physiological connectionsbetween emotion and speech in terms of production and perception, as well as into bio-logical underpinnings of sad speech. Imitated sad speech showed a tendency (although no significant differences werefound) for even more changed voice quality characteristics, i.e., more change in glottalopening characteristics or steeper spectral tilt, which perhaps contributed to listeners ratingimitated sadness as more sad than spontaneous sadness. One reason for this may be thatfor imitated sadness, the speaker is trying to ‘convey’ sadness, but real sadness is the resultof the speaker’s internal state, something that the speaker does not control that happens inspite of the speaker’s intentions. That spontaneous sadness has a distinctly different patternof articulation involving the jaw and lips (and tongue for American English) may be not forthe purpose of conveying sadness, but be the by-product of experiencing sadness. The fact that both the Japanese and the American English speaker in imitating sad-ness, pouted lips for imitated sadness, i.e., the upper and lower lips were protruded, fur-thermore suggests there may be ‘universal’ postures for imitating sadness that are notnecessarily the same as those for producing spontaneous sadness. In acted emotion, thespeaker is volitionally changing the acoustic signal to impart to the listener a mental oremotional state (paralanguage) while in spontaneous emotion the speaker is working atmaintaining the acoustic signal to convey the intended message even through emotionalinterruptions (nonlanguage). From the results of this study, we see it is important toclearly identify what a researcher of emotion intends to study and to choose the righttesting paradigm. Using actors in emotion studies will give results on paralanguage andnot results on emotion as such. Therefore, to study emotional characteristics of speech(articulatory phonetics, etc.) researchers need to move away from using actors. An interesting result with imitated emotion is that the speakers seem to have beensensitive to imitating certain parts of their spontaneous sad speech, but not to others. Forinstance, the American English speaker (a phonetician) imitated F0, F1 and voice qualityof the sad speech rather successfully, but not duration. It was as if duration was’recalled’ from a set of stereotypes of what constitutes sad speech. A similar situation various studies about the multicomplex richness of expressive speech, see e.g.,Schroeder [2004], Gobl and Ní Chasaide [2003] and Douglas-Cowie et al. [2003].

With regard to imitated intonation (in which the American English speaker imitatedthe intonational and phrasing patterns, but not F0 range), imitated intonation speech is(a) different from spontaneous sad speech/imitated sad speech in terms of acoustics andarticulation, (b) not rated highly by listeners as sad, and (c) shows low correlation of itsacoustic characteristics with speech that was highly rated by listeners as sad. Theacoustic characteristic that correlated highly with listeners ratings of sadness were highF0 and low AQ (both characteristics of spontaneous sad or imitated sad speech, but notimitated intonation speech). These results suggest that intonation and phrasing patternsmay not be sufficient to convey emotional information to listeners, a finding which isreminiscent of that reported by Menezes [2004] who showed that phrasing patterns inthemselves were not salient characteristics of irritated utterances (vs. non-irritated utter-ances). More research needs to be done in this area in order to better understand the rela-tionship between intonation, phrasing, F0 range, and voice quality. With regard to the rating of sadness by listeners, we see a difference between the waya speaker articulates spontaneous sad speech and those articulatory characteristics which’convey’ sadness to a listener. For both American English and Japanese, speech that wasgiven a high rating for sadness by listeners, involved lip-pouting with raised and/or pro-truded jaw/lower lip. However, in actual articulation of spontaneous sad speech, we see theopposite: lowered and retracted jaw and retracted lower lip for American English and low-ered jaw and lowered retracted lower lip for Japanese. The pattern of articulation for well-perceived sad speech seems to be similar to that for imitated sad speech, rather thanspontaneous sad speech. This is a very interesting finding and needs to be explored further.With regard to whether there are common characteristics in the acoustics and artic-ulation of emotional speech in such different languages as American English andJapanese, the tentative results from this study (based on one speaker each) suggestthere are common characteristics. In terms of acoustics, both speakers raised F0 andtended to change F1 and voice quality. In terms of articulation, both speakers retractedtheir upper and lower lips and lowered their jaw. With regard to whether there are similarities/differences between perception of sadspeech as a function of the language: The tentative results of this study suggest that listen-ers use similar cues to rate sadness, i.e., high F0, changed F1, and changed voice quality. To briefly summarize our findings, we can say that (1) the acoustic and articula-tory characteristics of spontaneous sad speech differ from that of read speech, or imi-tated intonation speech, (2) spontaneous sad speech and imitated sad speech seem tohave similar acoustic characteristics (high F0, changed F1 as well as voice quality forboth the American English and Japanese speaker) but different articulation, (3) thereare similarities in the way they imitate sadness, i.e., lip protrusion, and (4) the articulatorycharacteristics of imitated sad speech tend to show better correlation with ratings ofsadness by listeners than do those of spontaneous sad speech. The method and tentative results of this study are reported here in order to serveas guidelines for investigating various acoustic and articulatory characteristics of various studies about the multicomplex richness of expressive speech, see e.g.,Schroeder [2004], Gobl and Ní Chasaide [2003] and Douglas-Cowie et al. [2003].

Acknowledgements

We thank NTT Communication Science Labs for allowing us to use the EMA facilities, andParham Mokhtari for his help in analyzing the acoustic American English data, specifically voice qual-ity. I also wish to thank two anonymous reviewers for their extremely helpful comments, and especially,Klaus Kohler for his encouragement to revise the manuscript in order that this proof of method researchcould be published. This study is supported by a Grant-in-Aid for Scientific Research, JapaneseMinistry of Education, Culture, Sports, Science and Technology (2002-5): 14510636 to the first author.

Appendix
Table 1A. Averaged acousticvalues of E, IE, II, and R speech(American English)
Cat. Dur; ms F0, Hz F1, Hz F2, Hz AQ Percept.
E 250 327 355 2381 0.97 4.6
IE 370 335 342 2077 0.94 4.2
II 240 187 338 2579 1.42 1.5
R 300 199 335 2623 1.21 1.3
Table 2A. Averaged acousticvalues of E, IE, II, and R speech(Japanese)
Cat. Dur; ms F0, Hz F1, Hz F2, Hz S Slope Percept.
E 210 190 796 1496
  • 0.019
3.2
IE 175 184 785 1648
  • 0.023
3.7
R 157 150 861 1682
  • 0.013
2.4
Table 3A. ANOVA results ofacoustic measures with utteranceconditions
Speaker Dependentvariable d.f. F p
American English duration F0 F1 AQ 3,12 3,12 3,12 3,12 6.284 70.601 0.338 13.67 p < 0.01 p < 0.000 p = 0.798 p < 0.000
Japanese duration F0 F1 spectral slope 2,11 2,11 2,11 2,11 5.43 8.017 1.43 3.365 p < 0.05 p < 0.000 p = 0.280 p = 0.072
Table 4A. Averaged articulatory values of E, IE, II, and R speech (American English)
Cat. Jx Jy ULx ULy LLx LLy T1x T1y T2x T2y T3x T3y
E 7.24 12.07 6.64 14.66 6.45 13.06 7.85 13.73 8.87 15.09 9.951 15.47
IE 7.03 12.21 6.03 14.38 5.84 12.70 8.39 13.93 9.46 14.82 1.751 14.19
II 7.18 12.17 6.01 14.50 5.93 12.58 8.48 14.11 9.52 14.87 1.771 14.21
R 7.38 12.15 6.12 14.41 6.15 12.52 8.53 13.99 9.44 14.99 1.81 14.36
Table 5A. Averaged articulatory values of E, IE, II, and R speech (Japanese)
Cat. Jx Jy ULx ULy LLx LLy T1x T1y T2x T2y T3x T3y
E 7.78 12.63 5.92 15.12 6.51 13.13 8.81 14.12 9.96 14.62    
IE 8.18 12.81 5.37 15.42 6.19 13.42 9.34 14.14 10.04 14.79    
R 8.25 12.64 5.44 15.31 6.39 13.08 9.24 14.05 10.00 14.89    
Table 6A. ANOVA results ofarticulatory measures with utter-ance conditions
Speaker Dependent variable d.f. F p
American English ULx ULy LLx LLy Jx Jy T1x T1y T2x T2y T3x T3y 3,12 3,12 3,12 3,12 3,12 3,12 3,12 3,12 3,12 3,12 3,12 3,12 63.087 2.257 26.652 14.137 21.922 4.676 43.776 4.608 17.189 3.724 107.417 393.784 p < 0.000 p = 0.134 p < 0.000 p < 0.000 p < 0.000 p < 0.05 p < 0.000 p < 0.05 p < 0.000 p = 0.042 p < 0.000 p < 0.000
Japanese ULx ULy LLx LLy Jx Jy T1x T1y T2x T2y 2,11 2,11 2,11 2,11 2,11 2,11 2,11 2,11 2,11 2,11 725.195 200.83 59.89 20.835 62.78 14.138 9.862 0.237 0.101 0.871 p < 0.000 p < 0.000 p < 0.000 p < 0.000 p < 0.000 p < 0.01 p < 0.01 p = 0.793 p = 0.904 p = 0.446
References
[1]Alku, P.; Backstrom, T.; Vilkman, E.: Normalized amplitude quotient for parameterization of the glottal flow. J. acoust. Soc. Am. 112: 701-710 (2002).
[2]Alku, P.; Vilkman, E.: Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering. Speech Com. 18: 131-138 (1996).
[3]Broad, D.J.; Clermont, F.: Formant estimation by linear transformation of the LPC cepstrum. J. acoust. Soc. Am. 86: 2013-2017 (1986).
[4]Brown, W.A.; Sirota, A.D.; Niaura, R.; Engebretson, T.O.: Endocrine correlates of sadness and elation. Psychosom. Med. 55: 458-467 (1993).
[5]Bühler, K.: Sprachtheorie; 2nd ed. (Fischer, Stuttgart 1965).
[6]Campbell, N.; Erickson, D.: What do people hear? A study of the perception of non-verbal affective information in conversational speech. J. phonet. Soc. Japan 8: 9-28 (2004).
[7]Childers, D.G.; Lee, C.K.: Vocal quality factors: analysis, synthesis, perception. J. acoust. Soc. Am. 90: 2394-2410(1991).
[8]Cowie, R.; Douglas-Cowie, E.; Schroeder, M.(eds.): Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research, Belfost (2000).
[9]Darwin, C.: The expression of the emotions in man and animals; 3rd ed. (Oxford University Press, Oxford 1998).
[10]Douglas-Cowie, E.; Campbell, N.; Cowie, R.; Roach, P.: Emotional speech: towards a new generation of databases. Speech Commun. 40: spec. issue Speech and Emotion, pp. 33-60 (2003).
[11]Eldred, S.H.; Price, D.B.: A linguistic evaluation of feeling states in psychotherapy. Psychiatry 21: 115-121 (1958).
[12]Erickson, D.: Articulation of extreme formant patterns for emphasized vowels. Phonetica 59: 134-149 (2002).
[13]Erickson, D.: Expressive speech: production, perception and application to speech synthesis. J. acoust. Soc. Japan 26: 4 (2005).
[14]Erickson, D.; Abramson, A.; Maekawa, K.; Kaburagi, T.: Articulatory characteristics of emotional utterances in spoken English. Proc. Int. Conf. Spoken Lang. Processing, vol 2, pp. 365-368 (2000).
[15]Erickson, D.; Bauer, H.; Fujimura, O.: Non-F0 correlates of prosody in free conversation. J. acoust. Soc. Am. 88: S128 (1990).
[16]Erickson, D.; Fujimura, O.; Pardo, B.: Articulatory correlates of prosodic control: emotion versus emphasis. Lang. Speech 41: spec. issue Prosody and Conversation, pp. 399-417 (1998).
[17]Erickson, D.; Fujino, A.; Mochida, T.; Menezes, C.; Yoshida, K.; Shibuya, Y: Articulation of sad speech: comparison of American English and Japanese. Acoust. Soc. Japan, Fall Meeting, 2004a.
[18]Erickson, D.; Menezes, C.; Fujino, A.: Sad speech: some acoustic and articulatory characteristics. Proc. 6th Int. Semin. Speech Prod., Sydney, Dec. 2003a.
[19]Erickson, D.; Menezes, C.; Fujino, A.: Some articulatory measurements of real sadness (ThA3101p.3) Proc. Int. Conf. Spoken Lang. Processing, Jeju, Oct 2004b.
[20]Erickson, D.; Mokhtari, P.; Menezes, C.; Fujino, A.: Voice quality and other acoustic changes in sad speech (grief). IEICE Tech. Rep. SP2003 June. ATR: 43-48 (2003b).
[21]Erickson, D.; Yoshida, K.; Mochida, T.; Shibuya, Y.: Acoustic and articulatory analysis of sad Japanese speech. Phonet. Soc. Japan, Fall Meeting, 2004c.
[22]Fant, G.; Kruckenberg, A.; Liljencrants, J.; Bavergard, M.: Voice source parameters in continuous speech. Transformation of LF parameters. Proc. Int. Conf. Spoken Lang. Processing, 1994, pp. 1451-1454.
[23]Fujisaki, H.: Information, prosody, and modelling - with emphasis on tonal features of speech. Proc. Speech Prosody 2004, Nara 2004, pp. 1-10.
[24]Gerratt, B.; Kreiman, J.: Toward a taxonomy of nonmodal phonation, J. Phonet. 29: 365-381 (2001).
[25]Gobl, C.; Ní Chasaide, A.: The role of voice quality in communicating emotion, mood, and attitude. Speech Commun. 40: 189-212 (2003).
[26]Hanson, H.M.; Stevens, K.N.; Kuo, H.-K.J.; Chen, M.Y.; Slifka, J.: Towards a model of phonation. J. Phonet. 29: 451-480 (2001).
[27]Iida, A.: A study on corpus-based speech synthesis with emotion; thesis, Keio (2000).
[28]Ishii, C.T.: A new acoustic measure for aspiration noise detection. (WeA501p.10) Proc. Int. Conf. Spoken Lang. Processing, Jeju, Oct. 2004.
[29]Jackson, M.; Ladefoged, P.; Huffman, M.K.; Antoñanzas-Barroso, N.: Measures of spectral tilt. UCLA Working Papers Phonet 61: 72-78 (1985).
[30]Johnson, K.: Acoustic and auditory phonetics (Blackwell Publishing, Malden 2003).
[31]Kaburagi, T.; Honda, M.: Calibration methods of voltage-to-distance function for an electromagnetic articulometer (EMA) system J. acoust. Soc. Am. 111: 1414-1421 (1997).
[32]Klatt, D.; Klatt, L.: Analysis, synthesis and perception of voice quality variati`ons among female and male talkers. J. acoust. Soc. Am. 87: 820-857 (1990).
[33]Maekawa, K.; Kagomiya, T.; Honda, M.; Kaburagi, T.; Okadome, T.: Production of paralinguistic information: from on articulatory point of view. Acoust. Soc. Japan: 257-258 (1999).
[34]Mazo, M.; Erickson, D.; Harvey, T.: Emotion and expression, temporal date on voice quality in Russian lament. 8th Vocal Fold Physiology Conference, Kurume 1995, pp. 173-187.
[35]Menezes, C.: Rhythmic pattern of American English: An articulatory and acoustic study; PhD Columbus (2003). 24 Phonetica 2006;63:1-25
[36]Erickson/Yoshida/Menezes/Fujino/Mochida/Shibuya Menezes, C.: Changes in phrasing in semi-spontaneous emotional speech: articulatory evidences. J. Phonet. Soc. Japan 8: 45-59 (2004).
[37]Mitchell, C.J.; Menezes, C.; Williams, J.D.; Pardo, B.; Erickson, D.; Fujimura, O.: Changes in syllable and bound- ary strengths due to irritation. ISCA Workshop on Speech and Emotion, Belfast 2000.
[38]Mokhtari, P.; Campbell, N.: Perceptual validation of a voice quality parameter AQ automatically measured in acoustic islands of reliability. Acoust. Soc. Japan: 401-402 (2002).
[39]Mokhtari, P.; Campbell, N: Automatic measurement of pressed/breathy phonation at acoustic centres of reliability incontinuous speech. Proc. IEICE Trans. Information Syst E-86-D: Spec. Issue on Speech Information,pp. 574-582 (2003).
[40]Mokhtari, P.; Iida, A.; Campbell, N.: Some articulatory correlates of emotion variability in speech: a preliminary study on spoken Japanese vowels. Proc. Int. Conf. Speech Processing, Taejon 2001, pp. 431-436.
[41]Ní Chasaide, A.; Gobl, C.: Voice source variation; in Hardcastle, Laver, The handbook of phonetic sciences (Blackwell, Oxford 1997), pp. 427-461.
[42]Redi, L.; Shattuck-Hufnagel, S.: Variation in the realization of glottalization in normal speakers. J. Phonet. 29: 407-429 (2001).
[43]Sadanobu, T.: A natural history of Japanese pressed voice. J. Phonet. Soc. Japan 8: 29-44 (2004).
[44]Scherer, K.R.: Vocal communication of emotion: A review of research paradigms. Speech Commun. 40: 227-256 (2003).
[45]Scherer, K.R.; Nonlinguistic vocal indicators of emotion and psychopathology; in Izard, Emotions in personality and psychopathology, pp. 493-529 (Plenum Press, New York 1979).
[46]Scherer, K.R.; Banse, R.; Wallbott, H.G.; Goldbeck, T.: Vocal cues in emotion encoding and decoding. Motivation Emotion 15: 123-148 (1991).
[47]Schroeder, M.: Speech and emotion research: an overview of research frameworks and a dimensional approach to emotional speech synthesis; thesis 7, Saarbrücken (2004).
[48]Schroeder, M.; Aubergé, V.; Cathiard, M.A.: Can we hear smile? Proc. 5th Int. Conf. Spoken Lang. Processing, Sydney 1998. Sad Speech Phonetica 2006;63:1-25 25

All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

The role of voice quality in communicating emotion, mood and attitude

Authors:Christer Gobl, Ailbhe Ni Chasaide
Jounal:Speech Communication vol. 40, no. 1-2, pp. 189-212, 2003
Tags:プロポーザル; 感情音声; レビュー
Keywords:Voice quality; Affect; Emotion; Mood; Attitude; Voice source; Inverse filtering; Fundamental frequency; Synthesis; Perception

注釈

Abstract

This paper explores the role of voice quality in the communication of emotions, moods and attitudes. Listeners’ reactions to an utterance synthesised with seven different voice qualities were elicited in terms of pairs of opposing affective attributes. The voice qualities included harsh voice, tense voice, modal voice, breathy voice, whispery voice, creaky voice and lax - creaky voice. These were synthesised using a formant synthesiser, and the voice source parameter settings were guided by prior analytic studies as well as auditory judgements. Results offer support for some past observations on the association of voice quality and affect, and suggest a number of refinements in some cases. Listenerso ratings further suggest that these qualities are considerably more effective in signalling milder affective states than the strong emotions. It is clear that there is no one-to-one mapping between voice quality and affect: rather a given quality tends to be associated with a cluster of affective attributes.

目次

Introduction

The present paper focuses on the role that voice quality plays in the signalling of speaker affect, broadly defined to include aspects of speaker attitude, mood, emotion, etc. The experiments described are very exploratory in nature and part of ongoing research on voice source variation in speech and on its function in communicating paralinguistic, linguistic and extralinguistic information. As part of this endeavour, we have been working towards the provision of acoustic descriptions of individual voice qualities (e.g., Gobl, 1989; Gobl and Ni Chasaide, 1992; Ni Chasaide and Gobl, 1995). Although the work has been mainly analytic, synthesis has been used to test and fine-tune our descriptions, and further to explore how individual source parameters or combinations of them may cue particular voice qualities (e.g., Gobl and Ni Chasaide, 1999a).

Growing out of this work, in the present study, listeners’ responses were elicited for the affective colouring of synthetic stimuli differing in terms of voice quality. This allows us to demonstrate in the first instance some of the kinds of affective colouring that can be achieved through synthesis. Insofar as our synthetic stimuli approximate to the human qualities they were meant to capture, we hope ultimately to shed light on the role of different voice qualities in the human communication of affect. By focussing on voice quality in this experiment, the aims were: firstly, to demonstrate whether and to what extent voice quality differences such as these can alone evoke distinct affective colourings, as has traditionally been assumed by phoneticians; secondly, to see to what extent results can lend support to past assumptions concerning the affective mapping of individual qualities and help clarify where rather contradictory claims have been made. A third objective is to provide a framework for subsequent exploration of how voice quality combines with f0, and ultimately with the other known acoustic and temporal features that are involved in the expression of affect.

To date, research on the vocal expression of emotion has demonstrated that many features may be involved. Whereas there has tended to be an overwhelming focus on pitch variables (especially f0 level, and range, but also the pitch contour and the amount of jitter) many studies have included the investigation of speech rate and intensity differences (Scherer, 1981, 1986, 1989; Mozziconacci, 1995, 1998; Stibbard, 2000; Williams and Stevens, 1972; Carlson et al., 1992). Other features may play a role, such as pausing structure (see, for example, Cahn, 1990a,b), segmental features, particularly those that relate to the precision of supraglottal articulation (Kienast et al., 1999; Laukkanen et al., 1996; Scherer, 1986; Carlson et al., 1992) or even rather fine grained durational effects such as the duration of accented and unaccented syllables (Mozziconacci, 1998). When present, extralinguistic interjections such as sighs, cries, inhalations (Scherer, 1994; Schr€oder, 2000) can provide powerful indications of the speakeros emotion. Comprehensive studies dealing particularly with f0 variation, intensity, timing and spectral information have been carried out by Scherer and coresearchers over more than two decades. Useful overviews of empirical studies in this area can be found in (Scherer, 1986, 1989; Kappas et al., 1991; Frick, 1985; Murray and Arnott, 1993).

Although many researchers tend to stress its fundamental importance, relatively little is known about the role of voice quality in communicating affect. As pointed out by Scherer (1986) the tendency has been to concentrate on those parameters that are relatively easy to measure, such as f0, intensity and timing, whereas voice quality has been neglected, relatively speaking, because of the methodological and conceptualisation difficulties involved (see Section 2).

Scherer (1986) further asserts that ‘‘although fundamental frequency parameters (related to pitch) are undoubtedly important in the vocal expression of emotion, the key to the vocal differentiation of discrete emotions seems to be voice quality’’. Experimental support for the basic importance of voice quality can be found in experiments by Scherer et al. (1984), where different degradation or masking procedures were applied to spoken utterances as a way of masking features of intonation, voice quality and verbal content. Listeneros evaluations of affect appeared to be primarily determined by voice quality cues, relatively independent of distortions in f0 cues or presence/absence of identifiable verbal content. Although there have been source analyses of different voice qualities in the literature (see, for example, Alku and Vilkman, 1996; Childers and Lee, 1991; Gobl, 1989; Gobl and Ni Chasaide, 1992; Lee and Childers, 1991; Price, 1989), very few empirical studies have focussed on the voice source correlates of affective speech. Laukkanen et al. (1996) studied variations in source parameters, sound pressure level (SPL) and intraoral pressure related to stress and emotional state. Their source data were obtained using the IAIF iterative technique of inverse filtering (Alku, 1992), and they found significant variation in the glottal wave, independent of f0 and SPL, for different emotional states. Angry speech was included in the study by Cummings and Clements (1995) on styles of speech, which employed an inverse filtering technique based on that of Wong et al. (1979). Some further source data for different emotions, obtained by inverse filtering based on closed-phase covariance LPC, is reported by Klasmeyer and Sendlmeier (1995). Johnstone and Scherer (1999) present electroglottographic data on glottal parameters, including irregularities in fundamental period (jitter), for seven emotional states. Alter et al. (1999) present examples of estimates of the noise component, in terms of measures of the harmonic-to-noise ratio, for different

emotional states. In spite of these contributions, no clear picture emerges and our understanding of the voice source correlates of affect remains limited. Much of what we know about the mapping of voice quality to affect has come in the form of received wisdom, based on impressionistic phonetic observations. Some of these are summarised by Laver (1980): breathy voice has been associated with intimacy, whispery voice with confidentiality, harsh voice with anger and creaky voice with boredom, for speakers of English at any rate. From the way such traditional observations are put, one would infer that a given voice quality is associated with a particular affect. On the basis of predictions from hypothesised physiological correlates of specific emotions, and of observations in a wide range of studies (mostly based on the relative strength of high versus low frequency energy in the spectrum) Scherer (1986) suggests that tense voice is associated with anger, joy and fear; and that lax voice (at the phonatory level essentially the same as breathy voice) is associated with sadness. In a similar vein, Laukkanen et al. (1996) have reported that in their data anger was characterised by low open quotient values of the glottal flow (suggesting a rather tense setting) and that sadness, surprise and enthusiasm tended to have high open quotient values, and low glottal skew values, which would indicate a more breathy setting. Not all researchers agree however on the mapping between affect and voice quality. On the basis of a wide review of literature sources, Murray and Arnott (1993) suggest very different associations: in their Table 1, breathy voice is associated with both anger and happiness; sadness is associated with a Ôresonanto voice quality, which we would here interpret as a quality somewhere along the modal to tense voice continuum. It is hoped that the present study would shed some light on the nature of these associations, providing possible support for traditional assumptions, or clarifying where there is clear disagreement in the literature. However scant the production literature, there is even less information on perception aspects. In experiments by Laukkanen et al. (1995, 1997), the role of glottal parameters on the perception of

191

emotion were studied by manipulations to a vocalic interval (recorded with different emotions) so as to neutralise the effects of f0 , SPL and duration. They concluded that the glottal source contributes to the perception of valence as well as vocal effort. They note, however, that the type of f0 manipulations used in their experiments may lead to certain artefacts, and suggest that synthesis would be a useful tool for further research in this area. Synthesis offers in principle an ideal tool for examining how individual features of the signal contribute to the perception of affect, as demonstrated by experiments on f0 and temporal parameters (e.g., Carlson et al., 1992; Mozziconacci, 1998). The lack of empirical voice source information presents a problem in the case of voice quality. Nonetheless, there have been a number of attempts to generate emotive synthetic speech, through manipulation of a large number of parameters, including voice quality. The work by Cahn (1990a,b) and by Murray and Arnott (1995) utilised the capabilities of DECtalk: in these cases problems arise from the inherent limitations of the synthesis system, in that it did not always provide adequate control of the desired parameters. The GLOVE system, described by Carlson et al. (1991), offers a potentially high level of control, and in a study by Meurlinger (1997), the source parameters of this system were exploited in an attempt at generating synthetic speech with emotional overtones. Burkhardt and Sendlmeier (2000) describe a synthesis system for the generation of emotional speech, which uses the KLSYN88 synthesiser, which also allows direct control of many voice source parameters. They report experiments involving manipulations of f0 , tempo, voice quality and segmental features. As regards the voice quality aspects of this work, they found that falsetto voice yielded a very good response for fear, tense voice is associated with anger, falsetto and breathy voice were weakly associated with sadness. Results for boredom appeared uncertain: one experiment indicated some association with creaky or with breathy voice, but a second experiment concluded that these voice qualities reduced rather than enhanced the percept. Unfortunately, details

192

are not included concerning the source parameters used, nor how they were controlled to generate different qualities. Attempts to generate emotive speech have also been made using concatenative synthesis, e.g., by Murray et al. (2000). As these systems use prerecorded speech units, and permit very limited control of source parameters other than f0 , they are less relevant to this study. Note however the approach adopted by Iida et al. (2000), whereby recording multiple corpora with different emotional colourings provides an expanded database from which the concatenative units are drawn. Most past research carried out in the field has tended to focus on a small set of rather strong emotions, such as anger, joy, sadness and fear. Voice quality contributes greatly to the expressiveness of human speech, and signals to the listener not only information about such strong emotions, but also about milder states, which we might characterise as feelings, moods, and general states of being. Furthermore, in an obviously related way, voice quality signals information concerning the speakeros attitude to the interlocutor, the subject matter and the situation. In this study we have tried to allow for as broad as possible a set of affective states, and therefore, the range of possible affective attributes for which responses were elicited, included not only emotions (e.g., afraid, happy, angry, sad) but also attributes that relate to speaker state and mood (e.g., relaxed, stressed, bored) or speaker attitude (e.g., formal, interested, friendly). It is worth noting that a broad approach is also potentially more useful for downstream technology applications. A major area of application of this type of research is the provision of expressive voice in speech synthesis. If one wants to aspire to a synthesis that approximates how humans employ their capacity to vary the tone-of-voice, it makes little sense to begin by excluding much of the subject of interest. The voice modifications most frequently sought in specific synthesis applications tend to be ones pertaining to state, mood and attitude (e.g., relaxed, friendly, polite, etc.), rather than to the Ôstrongo emotions.

2. Voice quality and emotion: conceptual and methodological problems This area of research presents numerous difficulties. Some of these are general ones, and pertain also to any research on the vocal features of affect communication. A fundamental problem is the lack of a widely accepted system for categorising affective states, and the potential inadequacy of English language terms, such as angry to represent emotional states (see discussion in Scherer, 1986). Another major difficulty has been that of obtaining emotionally coloured speech data. These aspects have been widely aired in the literature, and will not be discussed further here. The paucity of information on the role of voice quality in communicating affect reflects the very specific additional difficulties that arise both at a conceptual level in terms of defining voice qualities, and at the methodological and technical level in obtaining reliable measures of the voice source. firstly, most work on voice quality depends on the use of impressionistic auditory labels such as breathy, harsh, etc., which are rarely defined. The problem with impressionistic labels such as Ôharsh voiceo is that they can mean different things to different researchers. Thus, a given label may refer to different phenomena while different labels may be used to describe very similar phenomena, depending simply on the userso understanding of the term. The potential uncertainty can be illustrated in terms of the discussion above on voice quality correlates of emotion: where different researchers attribute very different voice qualities to an emotion (e.g., anger is associated with tense voice in Scherer, 1986 and with breathy voice in Murray and Arnott, 1993) or the same voice quality to very different emotions, it begs the question as to whether the implied differences/similarities actually relate to voice quality phenomena or arise spuriously out of a different understanding of the descriptive terms. And whereas one might expect some degree of cross-researcher consensus on how ‘‘breathy voice’’ or ‘‘tense voice’’ might be interpreted, this is unlikely for many other terms (e.g., ‘‘blaring’’ and ‘‘grumbled’’ in Murray and Arnott, 1993, Table 1).

This is a problem that besets all research in the area of voice quality, whether in normal or pathological speech (see for example the discussion in Hammarberg, 1986), or whether it is based on simple auditory/impressionistic or empirical methods. Measurements of voice source parameters in different emotions (as presented in some of the studies mentioned below) can be very difficult to interpret meaningfully if they cannot be related to the auditory impression as well as to the underlying production correlates and their spectral consequences. Laver (1980) has proposed a classification system, which is backed by physiological and acoustic data where available, which provides, in the words of Scherer (1986) ‘‘a coherent conceptual system’’ for voice quality research. In our earlier analyses of voice quality as in the present perceptual study, we have attempted to situate our descriptions within Laveros frame of reference, pointing out where we deviate from, or extend Laveros usage (see descriptions in Section 3). The other major problem in this area of research is a methodological one, pertaining to the difficulty of obtaining appropriate measures of the glottal source. Direct, high fidelity recordings of the source signal would be very desirable. A technique involving miniature pressure transducers (Cranen and Boves, 1985; Kitzing and L€ ofqvist, 1975) inserted between the vocal folds could in principle provide this. However, the procedure involved is not only highly invasive, requiring a local anaesthetic, but may also encounter problems in transducer stability as well as possibly also interfering with the vocal production. Given the practical difficulties involved, it is not surprising that very little source data have been obtained with this technique. Inverse filtering of the oral airflow or of the speech pressure waveform offers a non-invasive alternative. Speech production may be modelled as the convolution of the source signal and the vocal tract filter response. Inverse filtering the speech signal separates source and filter by cancelling the effects of the vocal tract, and the resulting signal is an estimate of the source. However, inverse filtering of the speech signal in order to separate source and filter is inherently difficult, as it is fundamentally an ill-posed problem. In decomposing

193

the speech signal there are three basic elements: source, filter and speech signal. As only one of these is known (the speech signal), determining the other two is in principle not possible. Only by exploiting knowledge about the characteristics and constraints of the source and of the filter in particular is it possible to identify the likely contribution of each to the speech signal, and thus to separate the two. Numerous fully automatic inverse filtering algorithms have been developed, most of which are based on some form of linear predictive analysis (e.g., Alku, 1992; Alku and Vilkman, 1994; Chan and Brookes, 1989; Ding et al., 1994; Fr€ ohlich et al., 2001; Kasuya et al., 1999; Lee and Childers, 1991; Ljungqvist and Fujisaki, 1985; McKenna and Isard, 1999; Strik et al., 1992; Talkin and Rowley, 1990; Wong et al., 1979). These techniques have provided some useful information on source behaviour (e.g., Alku and Vilkman, 1996; Cummings and Clements, 1995; Laukkanen et al., 1996, 1997; Olivera, 1997; Palmer and House, 1992; Strik and Boves, 1992). However, automatic techniques tend to perform least well when there is no true closed phase to the glottal cycle and where automatic estimation of formant peaks is least reliable, as is the case for many non-modal voice qualities. A further problem concerns how to effectively measure parameters from the glottal signal. There is no single set of clearly defined source parameters that have been generally adopted, which makes comparisons difficult. Furthermore, estimating values for salient parameters from the inverse filtered signal typically involves some level of compromise, as critical timing and amplitude events of the glottal pulses are not always clear-cut. How to get optimal measures from the inverse filtered signal is therefore often not self-evident. In some techniques source and filter parameters are estimated simultaneously (e.g., Fr€ ohlich et al., 2001; Kasuya et al., 1999; Ljungqvist and Fujisaki, 1985), but often the parameters are measured from the estimated source signal. This can be done directly from the waveform, thus using only time domain information, but more common is perhaps the technique of adjusting a parametric source model in order to capture the characteristics

194

of the glottal pulses obtained from the inverse filtering. The model matching technique has the advantage of allowing for both time and frequency domain optimisation of the parameters, as well as providing suitable data for synthesis. However, parameterising data in this way will to some extent depend on the model used. Numerous source models have been proposed in the literature (e.g., Ananthapadmanabha, 1984; Fant, 1979a,b, 1982; Fant et al., 1985; Fujisaki and Ljungqvist, 1986; Hedelin, 1984; Klatt and Klatt, 1990; Price, 1989; Qi and Bi, 1994; Rosenberg, 1971; Rothenberg et al., 1975; Schoentgen, 1993; Veldhuis, 1998). However, the four-parameter LF model of differentiated glottal flow (Fant et al., 1985) seems to be emerging as the main model employed in analytic studies. This model also benefits from being incorporated within available synthesisers, such as the KLSYN88 (Klatt and Klatt, 1990). It is clearly an advantage if the same source model can be used in both analysis and synthesis. In the present study, this is the model used and it is also the model we have hitherto used in our analyses of voice source variation. Several automatic procedures for model matching exist. Some of them optimise the fit in the time domain (e.g., Jansen, 1990; Strik et al., 1993; Strik and Boves, 1994) and others employ frequency domain optimisation (e.g., Olivera, 1993). Some of the techniques have been evaluated on synthesised speech, where they seem to perform reasonably well. Nevertheless, obtaining robust and fully reliable source estimates from natural speech still seems to be a problem (Fr€ ohlich et al., 2001). As with the automatic inverse filtering techniques, the problems are likely to be worse again when dealing with non-modal voice qualities, particularly those with glottal pulse shapes substantially different from what can be generated by the source model. Given the potential for producing large amounts of data, the problems of robustness may, at least in part, be an explanation for the surprisingly small body of source data on different voice qualities reported in the literature using these automatic techniques. Interactive manual techniques for inverse filtering and parameterisation offer a way of overcom-

ing the problem of robustness, but have their own limitations (Carlson et al., 1991; Hunt et al., 1978; Ni Chasaide et al., 1992; Gobl and Ni Chasaide, 1999b). Given that subjective judgements are involved, it requires considerable expertise and knowledge on the part of the experimenter if results are not to be spurious. Across highly experienced experimenters, it seems that a high degree of consistency can be achieved (Scully, 1994). Similar findings have also been reported by Hunt (1987). The main limitation, however, of this technique is that it is extremely time-consuming, and is thus only suitable for the analysis of limited amounts of data. Notwithstanding, micro-studies involving such manual techniques have afforded useful insights into inter- and intra-speaker voice source variation (e.g., Fant, 1995; Gobl, 1988, 1989; Gobl et al., 1995; Gobl and Ni Chasaide, 1992; Herteg ard and Gauffin, 1991; Kane and Ni Chasaide, 1992; Karlsson, 1990, 1992; Karlsson and Liljencrants, 1996; Ni Chasaide and Gobl, 1993; Pierrehumbert, 1989; Scully et al., 1995). Indirect techniques such as electro-glottography (EGG) have been also used by, e.g., Johnstone and Scherer (1999) and Laukkanen et al. (1996) and can offer many useful insights. But insofar as the technique registers contact across the vocal folds, data are difficult to interpret when the vocal folds do not meet or have reduced contact during the Ôclosedo phase (see Laukkanen et al., 1996, for a discussion on the difficulties with EGG in analysing source parameters, and for a comparison with inverse filtering). Measures from the speech output spectrum can provide useful insights into aspects of voice quality. For instance, the comparison of the amplitude levels of H1 and F1 or of H1 and H2, have been frequently used in the phonetics and linguistics literature to make inferences on source behaviour. Johnstone and Scherer (1999) have used these types of measures specifically for the analysis of voice quality and emotion. Note however that the levels of the output spectrum reflect filter as well as source characteristics, and thus measures are potentially problematic (for further discussion on this, see Ni Chasaide and Gobl, 1997). The relative balance of higher versus lower frequencies measured in the long term average spectrum can also

be useful, particularly for differentiating voice quality variation in the tense - lax dimension (see observations of Scherer, 1986, also discussed above). Although these measures are in themselves useful, they provide only a gross indication of what is a multifaceted phenomenon. Furthermore, with regard to the synthesis of voice quality variation, they are not likely to be readily incorporated into current synthesis systems.

3. Experimental procedure As mentioned in Section 1, the purpose of the experiment was to explore the role of voice quality in the communication of emotions, moods and attitudes, by testing listenerso reactions to an utterance synthesised with different voice qualities. The basic procedure involved the recording of a natural utterance, which was analysed and parameterised in order to facilitate the resynthesis of an utterance with modal voice quality. Parameter settings for this synthetic stimulus were modified to generate the six non-modal voice quality stimuli. The seven stimuli were then used in a set of perception tests to elicit listenerso responses to the affective content of the stimuli. 3.1. Voice qualities In this pilot experiment on the perceived affective correlates of a selection of stimuli synthesised with different voice qualities, we tried as far as possible to capture the characteristics of particular targeted voice qualities. These included five qualities for which earlier analyses had been carried out - - modal (neutral) voice, tense voice, breathy voice, whispery voice and creaky voice - - and two additional qualities - - harsh voice and lax - creaky voice. The physiological correlates of voice quality are described by Laver (1980) in terms of three parameters of muscular tension: adductive tension (the action of the interarytenoid muscles adducting the arytenoids), medial compression (the adductive force on the vocal processes adducting the ligamental glottis) and longitudinal tension (the tension of the vocal folds themselves).

195

In Laveros system, modal voice is characterised as having overall moderate laryngeal tension. Vocal fold vibration is efficient and the ligamental and the cartilaginous parts of the glottis are vibrating as a single unit. Tense voice is described as having a higher degree of tension in the entire vocal tract as compared to a neutral setting. At the laryngeal level, adductive tension and medial compression are thought to be particularly implicated. Breathy voice involves minimal laryngeal tension. Vocal fold vibration is inefficient and the folds do not come fully together, resulting in audible frication noise. Whispery voice is characterised by low tension in the interarytenoid muscles, but a fairly high medial compression, resulting in a triangular opening of the cartilaginous glottis. Laryngeal vibration is very inefficient and is accompanied by a high degree of audible frication noise. Harsh voice involves very high tension settings. To this extent it is essentially a variety of tense voice, but may have more extreme settings. A defining characteristic is that harsh voice tends to have additional aperiodicity due to the very high glottal tension. In the present experiment, as we were interested to focus on the specific role of the aperiodic component, we have only manipulated this parameter, and retained the remaining source parameter settings of tense voice. Creaky voice is described as having high medial compression and adductive tension, but low longitudinal tension. Because of the high adductive tension, only the ligamental part of the glottis is vibrating. The quality which is here termed Ôlax - creakyo voice is not included in the system presented by Laver (1980), where creaky voice is described as having rather high glottal tension (medial compression and adductive tension). In our descriptive work referred to earlier, it was indeed found that creaky voice has source parameter values tending towards the tense. It was also our auditory impression that creaky voice, as produced by the informant in question, did have a rather tense quality. Yet, we are aware that creaky voice can often sound quite lax in auditory impressionistic terms. It is for this reason that a lax - creaky quality was included, which is essentially based on breathy voice source settings but with

196

reduced aspiration noise and with added creakiness. Although this lax - creaky voice quality to some extent runs counter to the general thrust of Laveros description for creaky voice, it is worth noting that some of the sources he cites imply a rather lax glottal setting (e.g., Monsen and Engebretson, 1977). Clearly more descriptive work on creaky voice is required both at the physiological and acoustic levels. 3.2. Speech material The starting point for generating the synthetic voice quality stimuli was a high quality recording of a Swedish utterance, ‘‘ja adj€ o’’ , where f0 peaks were located on the two stressed vowels. This utterance should be semantically neutral to our subjects, native speakers of Irish English who do not speak Swedish. The male speakeros voice was judged by the authors to be in reasonable conformity with modal voice as described by Laver (1980). The recording was carried out in an anehoic chamber, using a Br€ uel & Kjær condenser microphone at a distance of approximately 30 cm from the speaker. The utterance was recorded on a SONY F1 digital tape recorder, and no filters were employed so as to avoid introducing phase distortion. The recording was subsequently transferred to computer, digitised at 16 kHz sampling frequency and 16 bit sample resolution. At this point, the recording was high-pass filtered in order to remove any DC offset of the zero-pressure line, due to the inevitable intrusion of some inaudible low frequency pressure fluctuations into the anechoic chamber. The filter used was a third order digital Butterworth filter with a cutoff frequency of 20 Hz, and to ensure phase linearity, the speech signal was passed through this filter twice, the second pass being time-reversed (i.e. starting with the last sample, finishing with the first). 3.3. Analysis The analysis technique involved source filter decomposition and source model matching using

the software system described in (Ni Chasaide et al., 1992). This system incorporates automatic or semi-automatic inverse filtering based on closedphase covariance LPC. Further, optional, manual interactive analysis can subsequently be carried out if deemed necessary. As the amount of data here was limited to one short utterance, all the 106 pulses of the utterance were inverse filtered using the interactive technique. For this speaker there were 9 formants present in the output, within the 8 kHz frequency range determined by the sampling rate. Thus 9 antiresonances were used in the inverse filter to cancel the filtering effect of the vocal tract. The output of the inverse filter yields an estimate of the differentiated glottal flow. From this signal, data on salient source parameters were obtained by matching a parametric voice source model to the differentiated glottal flow signal. As mentioned in Section 2, the model we use is the four-parameter LF model of differentiated glottal flow (Fant et al., 1985). For similar reasons as for the inverse filtering, the fitting of the LF model to the 106 glottal pulses was done manually, using an interactive technique which facilitates parameter optimisation in terms of both time and frequency domain aspects of the glottal pulse. As the objective here was to generate good copy synthesis of the utterance, the disadvantages of the manual technique were of minor importance. On the basis of the modelled waveform the principle parameters measured were EE, RA, RG and RK, which are briefly glossed here (for a fuller description see, e.g., Fant and Lin, 1991; Ni Chasaide and Gobl, 1997). EE is the excitation strength, measured as the amplitude of the differentiated glottal flow at the main discontinuity of the pulse. The RA value is a measure that corresponds to the amount of residual airflow after the main excitation, prior to maximum glottal closure. RG is a measure of the Ôglottal frequencyo, as determined by the opening branch of the glottal pulse, normalised to the fundamental frequency. RK is a measure of glottal pulse skew, defined by the relative durations of the opening and closing branches of the glottal pulse.

3.4. Synthesis of the modal voice stimulus The KLSYN88a synthesiser (Sensimetrics Corporation, Boston, MA, see also Klatt and Klatt, 1990) was chosen for the generation of the voice quality stimuli. This is a well established formant synthesiser which allows for direct control of both source and filter parameters, and it has been shown to have the capability of producing high quality copy synthesis (Klatt and Klatt, 1990). As mentioned earlier, it also incorporates the LF voice source model (as an option), albeit in a somewhat modified implementation. To generate the modal stimulus, copy synthesis of the natural utterance was carried out using the data from the analysis. In the synthesiser, the modified LF model was selected for the voice source generation. In order to carry out the synthesis, the LF parameters of the analyses were transformed into the corresponding source parameters of KLSYN88a. It should be noted that care has to be taken when transforming parameters derived from the LF model (in this case EE, RA, RG and RK) into the corresponding parameters for the modified LF model of KLSYN88a: AV (amplitude of voicing, derived from EE), TL (spectral tilt, derived from RA and f0 ), OQ (open quotient, derived from RG and RK), SQ (speed quotient, derived from RK). See Mahshie and Gobl (1999) for details on the differences between the LF model and the version of the model in KLSYN88a. As there was no practical way of entering the data for all 106 pulses into the synthesiser, the input data were reduced by selecting values at specific timepoints for each parameter (the number of values ranging between 7 and 15, depending on the parameter). The timepoints were chosen so that the linear interpolation generated by the synthesiser between selected points would capture the natural dynamics as closely as possible. The stylisation is somewhat similar to that carried out by Carlson et al. (1991), who used the GLOVE synthesiser for the copy synthesis of a female utterance. However, they did not extract the data from a pulse-by-pulse analysis, but rather used data from a small number of analysed pulses, selected on the basis of the segmental structure.

197

Initial attempts to synthesise at a sampling rate of 16 kHz were unsuccessful, due to unpredictable behaviour of the synthesiser. Thus, the synthesiseros default sampling rate of 10 kHz was opted for, which seemed to ensure a reliable output. The default setting of 5 ms for the update interval of parameter values was also used. In the natural utterance, there were 6 formants present in the output spectrum below 5 kHz, and thus 6 formant resonators were used in the synthesis. 14 synthesis parameters were varied dynamically. The vocal tract parameters varied included the first five formant frequencies (F1, F2, F3, F4, F5) and the first and second formant bandwidths (B1, B2). Seven source parameters were varied: fundamental frequency, AV, TL, OQ, SQ, AH (aspiration noise) and DI (Ôdiplophoniao - - used for the generation of creakiness). The AH parameter controls the level of the aspiration noise source. This aperiodic source is produced by a pseudo-random number generator, with an even amplitude distribution within the range of 16 bit amplitude representation. The amplitude spectrum (when combined with the filter modelling the radiation characteristics at the lips) is essentially flat above 1 kHz. Below 1 kHz the amplitudes gradually drop off so that the level is approximately 12 dB lower at 100 Hz relative to the level above 1 kHz. When AV is nonzero (i.e. when there is voicing and aspiration simultaneously) the amplitude of the aspiration noise is modulated: for the second half of the period from one glottal opening to the next, the amplitudes of all noise samples are reduced by 50%. This modulation is always the same regardless of the particular glottal pulse shape, but the result is generally that stronger aspiration is produced in the open portion of the glottal cycle relative to the closed portion (Klatt, 1980; Klatt, unpublished chapter; Klatt and Klatt, 1990). The DI parameter alters every second pulse by shifting the pulse towards the preceding pulse and at the same time reducing the amplitude. The shift as well as the amount of amplitude reduction is determined by the DI value. Thus, the fundamental period with respect to the preceding pulse is reduced, which results in an equivalent increase

198

in the fundamental period with respect to the following pulse (Klatt and Klatt, 1990). The resulting synthesis of the natural utterance is a very close replica of the original, but it is of course not indistinguishable from it, given the data reduction procedure that was carried out. More importantly, however, the voice quality of the original was retained, and thus this synthesised utterance was used as our modal voice stimulus. 3.5. Synthesis of non-modal stimuli On the basis of the modal voice stimulus, six further stimuli were generated with non-modal voice qualities by manipulating eight parameters: the seven source parameters mentioned above and the first formant bandwidth, B1. The transforms from modal to a non-modal quality were typically not constant for any given parameter, but allowed for dynamic variation partly prompted by earlier analytic studies, e.g., allowing for differences that relate to stress variation and voice onset/offset effects (Gobl, 1988; Ni Chasaide and Gobl, 1993). Parameter values for the different voice qualities were guided by prior analytic studies (e.g., Gobl, 1989; Gobl and Ni Chasaide, 1992; Ni Chasaide and Gobl, 1995). However, as the auditory quality was the main goal here, settings were ultimately determined by auditory judgement of the effect. This was particularly the case for the settings of parameters AH and DI, for which quantitative data were not available. Fundamental frequency was varied only to the extent deemed required as part of the intrinsic, voice quality determined characteristics. The main changes carried out to the control parameters for the different stimuli are summarised below, whereas full details on the parameter dynamics can be found in fig. 1. Compared to modal voice, tense voice involved lower OQ, higher SQ, lower TL, narrower B1 and slightly higher f0 values (5 Hz). Breathy voice, again relative to modal, involved lower AV, higher OQ, lower SQ, higher TL, and wider B1 settings. The level of AH was set on the basis of auditory judgement. Creaky voice was based on modal voice, with a basic f0 lowering of 30 Hz, but for the first f0 peak this lowering was gradually re-

duced to 20 Hz. The baseline value for the DI parameter was set to 25%, changing gradually to 5% to coincide with the f0 peaks of the stressed vowels. The lax - creaky voice quality involved modifications to the source settings for the breathy voice stimuli. As mentioned above, this quality departs from the definitions presented in (Laver, 1980). However, to maintain some link with the physiological adjustments he proposes for creaky voice, the source settings for lax - creaky voice were modified from the breathy voice ones, by changing the OQ values to those of creaky voice. Further changes involved lowering f0 by 30 Hz and reducing AH by 20 dB. The baseline value for the DI parameter was set to 25%, changing gradually to 15% to coincide with the f0 peaks of the stressed vowels. The resulting stimulus was judged auditorily by the authors as a realistic reproduction of the type of lax - creaky voice discussed above. To synthesise harsh voice, the same basic source settings as tense voice were adopted. Aperiodicity was added by using the DI parameter, although it is not clear whether this form of aperiodicity is optimal for synthesising harsh voice. However, using a baseline value of 10% gradually changing to 20% to coincide with the f0 peaks of the stressed vowels, seemed to result in a reasonably convincing harsh voice quality. Whispery voice turned out to be the most problematic quality to synthesise. The first attempt was based on breathy voice settings, modified so that AV was relatively lowered, AH increased, OQ slightly lowered and SQ slightly increased. Although these transformations are in keeping with analytic data, they resulted in a very unconvincing voice quality, where the aspiration noise was unnaturally high-pitched with a ‘‘whistling’’ quality. Widening higher formant bandwidths only marginally improved the quality. In order to achieve an acceptable whispery voice quality, it was necessary to reduce the number of formants from six to five. By thus reducing the amplitude of the aspiration noise in the higher end of the spectrum, the whistling quality was avoided. The DI parameter was set to 5% throughout.

199

fig. 1. Parameter variation for the synthetic stimuli. Note that for the modal, tense, harsh and creaky stimuli, there was no aspiration noise (AH).

3.6. Perception test The perception experiment consisted of 8 short sub-tests. For each sub-test, 10 randomisations were presented of the seven stimuli (modal, tense, breathy, whispery, creaky, harsh and lax - creaky voice). The interval between each set of stimuli was 7 s, and the onset of each group was signalled by a specific earcon. Within each set of stimuli, the in-

terstimulus interval was 4 s and a short tone was presented 1 s before each stimulus, to ensure the listener was in a state of readiness. For each individual sub-test, responses were elicited only for one particular pair of opposite affective attributes (such as bored/interested) in a way that was loosely modelled on Uldall (1964). Response sheets were arranged with the opposite terms placed on either side, with seven boxes in between, the central one

200

of which was shaded in for visual prominence. Listeners were instructed that they would hear a speaker repeat the same utterance in different ways and were asked to judge for each repetition whether the speaker sounded more bored or interested, etc. In the case where an utterance was not considered to be marked for either of the pair of attributes, they were instructed to choose the centre box. Ticking a box to the left or right of the central box should indicate the presence and strength to which a particular attribute was deemed present, with the most extreme ratings being furthest from the centre box. The full set of attribute pairs tested included relaxed/ stressed, content/angry, friendly/hostile, sad/happy, bored/interested, intimate/formal, timid/confident and afraid/unafraid. The test was administered to 12 subjects, 6 male and 6 female. All were speakers of Southern Irish English, living in Dublin and their ages ranged from early 20s to late 40s. Most of the subjects were university staff or students, and the remainder were professional people. Whereas a few subjects had a knowledge of phonetics, none had previously been involved in a perception experiment involving voice quality. The test was presented in a soundproofed studio, over high-quality studio loudspeakers which were set at a level deemed to be comfortable listening level. A short break was given between each sub-test.

4. Results A 2-way ANOVA was carried out on the listeneros scores for each of the 8 sub-tests, where voice quality and subject were the factors. Results show that the voice quality and subject variable were statistically highly significant and that there was a voice quality/subject interaction. For the majority of attribute pairs tested, the differences between the individual voice qualities were statistically significant, and the significance levels for each pairwise comparison for each sub-test are shown in Table 1. The multiple comparison technique used was Tukeyos Honestly Significant Difference; this was implemented in MINITAB (Minitab, 2001). The overall mean ratings ob-

tained for the different affective attributes with each of the stimulus types is shown in fig. 2, along with median values. To provide an indication of the cross-subject variability, the interquartile range of subjects means and extreme values are also plotted. To make for easier broad comparisons across voice qualities and across affective attributes, the mean scores only are shown in fig. 3. In both figs. 2 and 3 the distance from 0 (no affective content) indicates the strength with which any attribute was perceived. The use of positive and negative values in the y-axis of the figure is not in itself important: results have simply been arranged in these figures so that the positive (or negative) sign groups together somewhat related attributes. Although there is no necessary connection between individual affective attributes, rating values across attributes are joined by lines in fig. 3 for each of the voice qualities, to make it easier to relate individual voice qualities to their affective correlates. A subset of this information is shown in a slightly different format in fig. 4, where the maximum strength (the highest mean score) with which each of the attributes was detected across all voice qualities is shown as deviations from 0 ( ¼ no perceived affect) to 3 (i.e. 3 ¼ maximally perceived). The estimated standard error of the mean is also shown. Clearly, not all affective attributes are equally well signalled by these stimuli. From fig. 4 we see that the most readily perceived ones are relaxed and stressed, and high ratings are found for angry, bored, intimate, content, formal and confident. The least readily perceived are unafraid, afraid, friendly, happy and sad. By and large, those affective attributes which got high scores in this test are more aptly described as states, moods or attitudes (the exception being angry), whereas those least well detected tend to be emotions. As can be seen in figs. 2 and 3, the individual stimuli are not associated with a single affective attribute: rather they are associated with a constellation of attributes. Thus, tense/harsh voice gets high ratings not only for stressed, but also for angry, formal, confident and hostile. The broad picture to emerge is of two groups of voice qualities, which signal polar opposite clusters of attributes (see fig. 3). The stimuli for tense/harsh voice

201

Table 1 Significance level of the difference in ratings for each pair of stimuli, shown for each of the eight sub-tests

Modal Relaxed - stressed Content - angry Friendly - hostile Sad - happy Bored - interested Intimate - formal Timid - confident Afraid - unafraid Tense Relaxed - stressed Content - angry Friendly - hostile Sad - happy Bored - interested Intimate - formal Timid - confident Afraid - unafraid Breathy Relaxed - stressed Content - angry Friendly - hostile Sad - happy Bored - interested Intimate - formal Timid - confident Afraid - unafraid Whispery Relaxed - stressed Content - angry Friendly - hostile Sad - happy Bored - interested Intimate - formal Timid - confident Afraid - unafraid Harsh Relaxed - stressed Content - angry Friendly - hostile Sad - happy Bored - interested Intimate - formal Timid - confident Afraid - unafraid Creaky Relaxed - stressed Content - angry Friendly - hostile Sad - happy

Tense

Breathy

Whispery

Harsh

Creaky

Lax - creaky

0.12

0.73

0.25 0.59 1.00 0.33 0.99 0.98 0.11 0.91

1.00 0.19 1.00 0.88 0.99 0.41

0.05

0.20 0.98

0.18

0.57

0.25

0.06

0.05

(continued on next page)

202

Table 1 (continued) Tense

Breathy

p < 0:05,

p < 0:01 and

Harsh

Creaky

Lax - creaky

Bored - interested Intimate - formal Timid - confident Afraid - unafraid

Whispery

0.12

p < 0:001.

are associated with the cluster of features just mentioned, which we might broadly characterise as involving high activation/arousal and/or high control. On the other hand, the stimuli for breathy voice, whispery voice, creaky voice and lax - creaky voice are by and large associated with opposite, low activation characteristics, shown with negative values in fig. 3. The modal stimulus, used as the starting point for the other qualities does not turn out to be fully neutral: as can be observed in figs. 2 and 3, responses veer somewhat in the direction of tense voice for a number of attributes, namely confident, formal and stressed, although not to any great degree. Distinct responses were not obtained for all synthesised qualities. Results for the tense and harsh stimuli are very similar, with the tense eliciting in all cases slightly more extreme ratings. The difference is very small, and not significant for any of the attribute rating sub-tests (see Table 1). Furthermore, what difference there is runs counter to initial expectations, which were that the addition of aperiodicity to tense voice should heighten the effects of tense voice rather than attenuate them. Caution is needed however in interpreting this result for harsh voice, as it may be more a reflection on the synthetic stimulus than a reliable indication of how listeners judge harsh voice per se (see further discussion on this below). The breathy and whispery stimuli also yield very similar response patterns, and the difference between them is only significant for the attributes afraid and timid (Table 1), where whispery voice achieves stronger ratings (fig. 3). In the case of whispery voice, results also need to be interpreted with some caution for reasons mentioned earlier, concerning the difficulty of synthesising this qual-

ity. Furthermore, it may be that whispery voice needs to be more distinctly different from breathy voice than was achieved in the present stimulus. Ratings for the creaky voice stimulus tend to be on the whole close to those of the breathy and whispery stimuli, although the differences are generally significant (see Table 1). The most striking divergence is found for the attributes afraid and timid (fig. 3). Responses to lax - creaky voice follow the same trends as creaky voice, but are more extreme: as can be observed in fig. 3, the trend of responses is very similar but is shifted towards the non-aroused, low activation end of the scale. The differences between responses for the creaky and lax - creaky stimuli are highly significant (Table 1) for all attributes except afraid - unafraid, where neither yields a strong response. Broadly speaking, it would appear that the addition of more lax settings to the creaky voice stimulus results in a considerable enhancement of its intrinsic colouring. It is rather striking in this experiment that the highest ratings for most of the affective attributes tested were obtained by just two of the range of stimuli presented. The tense stimulus accounted for the highest ratings for attributes with high arousal/activation and high power/control, whereas the lax - creaky stimulus obtained generally highest ratings for attributes with low arousal/ activation. A third stimulus, whispery voice, produced the highest ratings for the attributes timid and afraid, but note that responses for afraid in particular are not very high, and show considerable cross-subject variability. It furthermore appears to be the case that as one moves from the high activation to the low-activation group of stimuli, there is an increase in cross-subject variability (fig. 2).

203

fig. 2. Subjectso mean responses for each voice quality stimulus, in each of the eight sub-tests, showing interquartile range (box); mean (filled circle); median (horizontal line in box) and extreme values (whiskers).

204

fig. 3. Mean ratings for 12 listeners of the perceived strength of pairs of attributes for seven voice qualities. 0 ¼ no affective content and 3 ¼ maximally perceived.

fig. 4. Maximum mean ratings for 12 listeners of the perceived strength of each affective attribute, shown by the bars as deviations from 0 (no affective content) to 3 (maximally perceived). The lines through the bars indicate  the estimated standard error of the mean.

5. Discussion The results demonstrate that voice quality changes alone can evoke differences in speaker affect. They also show that unlike the one-to-one mapping often implied by traditional impressionistic observations, a specific voice quality is multicoloured in terms of affect, being associated with a cluster of mostly, though not necessarily, related attributes. It has been suggested (see, for example, Laukkanen et al., 1996, 1997) that voice quality

may serve more to communicate the valence of an emotion rather than its activation, which would depend rather on pitch, loudness and duration. In the case of the qualities represented by the present stimuli, the differentiation appears not to be in terms of valence but rather activation, and to a lesser extent, power. The attributes associated with the tense/harsh stimuli have high activation and/ or high power, but include affects with positive (confident, interested, happy) and negative (angry, stressed) valence. The other, non-modal group of stimuli - - the breathy, whispery, creaky and especially the lax - creaky voiced stimuli - - are associated with attributes which have low activation but both positive (relaxed, content, intimate, friendly) and negative (sad, bored) valence. As a preface to the following discussion we would stress certain limitations of this study. firstly, the reader should bear in mind that results tell us about voice quality in the human communication of affect only insofar as the synthesised stimuli are good approximations of the intended voice qualities. Secondly, voice qualities vary in a continuous, not a categorical fashion: there can be differing degrees of say breathy voice or tense voice: by choosing single points in these continua, we are only exploring to a limited extent what the role of a particular quality such as tense voice may be in affect signalling. A question for future research will be to look at how more gradient changes in source parameters relate to the associated affect. For example, if parameters associated with tense voice are varied in a more continuous fashion, will this yield correspondingly different degrees of anger? Alternatively, it is not inconceivable that one might find different affective correlates, such as happy and angry for different parts of the continuum. finally, we would point out that the qualities investigated here are only a partial sampling of the voice quality types that speakers may use in affect signalling. In all these senses, this study must be viewed as an exploratory exercise. We look now at whether the associations of voice quality and affective states traditionally assumed, or mentioned in the literature, are supported by the results for the range of synthesised stimuli in this study. Breathy voice has tradition-

ally been thought to have connotations of intimacy (Laver, 1980). The present results suggest that although the breathy stimulus did have some such colouring, the percept was much more effectively signalled by the lax - creaky stimulus. In his review of earlier studies, Scherer (1986) has suggested that lax voice (i.e., breathy voice at the phonatory level) would be associated with sadness. The results of Laukkanen et al. (1996) also point to such an association. Whereas the breathy stimuli did achieve a somewhat sad response in this study, the effect was not very strong, and ratings for this attribute were also considerably higher for the lax - creaky stimulus. Note however in fig. 2, that there is more cross-subject variability in ratings for the latter quality. The large difference in total range of responses for the lax - creaky case reflects the fact that one of the twelve subjects responded very differently from the others, and perceived this stimulus as moderately happy. Very different suggestions linking breathy voice with anger and happiness are presented in the literature summary by Murray and Arnott (1993, Table 1). These associations are not supported by present results: in both cases, listeners rated the breathy stimulus as being associated rather with the opposite attributes. To sum up on results for the breathy voiced stimulus in this experiment: there is some support for past suggestions linking breathy voice with intimacy and sadness, none for a link with anger or happiness. Even in the case of intimacy and sadness, the response rates obtained here were not particularly high, and not at all as high as for the lax - creaky stimulus. Furthermore, for both these stimuli, response rates for sad or intimate were at about the same levels as for other attributes, such as content and relaxed. In his review of the literature, Scherer (1986) associates tense voice with anger, and also with joy and with fear. Laukkanen et al. (1996) also found an association between anger and source parameter values that would indicate tense voice, and a similar association would also be indicated by Burkhardt and Sendlmeier (2000). The association of tense voice with anger is strongly supported in the present results. As can be seen in fig. 2, re-

205

sponses are high and show little variability across subjects. The association of tense voice with joy finds some support in that there is a moderate colouring towards happy in responses for the tense stimulus, which is nonetheless significant, as a comparison of tense versus modal stimuli in Table 1 and fig. 2 indicates. A comparison of the happy and angry responses for the tense stimulus in fig. 2 shows not only that mean ratings for the former are lower, but also that they vary more across subjects. Nevertheless, the tense stimulus was the only one of the present set that yields a happy connotation (except for harsh voice, which is not here differentiated from tense). The association of tense voice with fear as suggested by Scherer (1986) is not supported here, as mean responses for the tense stimulus are close to zero for this attribute. Furthermore, there was very high cross-listener variability in fear ratings for the tense stimulus: compare, for example with responses for the modal stimulus in fig. 2, where the mean is also close to zero but there is more agreement across subjects. The whispery voice stimulus provides the strongest responses for fear, but note that fear is nevertheless one of the least well signalled attributes in this experiment. Furthermore, as can be observed in fig. 2, not all listeners necessarily associate whispery voice with fear. Burkhardt and Sendlmeier (2000) report that falsetto voice (not included in this study) is a successful voice quality for portraying fear. One might conjecture that some type of whispery falsetto voice with appropriate aperiodicity would be an effective quality for portrayals of fear. Laver (1980) has suggested that harsh voice (a variety of tense voice) is associated with anger. As mentioned earlier, the fact that listeners did not differentiate between the harsh and tense stimuli in this test probably reflects the similarity between them, the only difference being the addition of aperiodicity (as controlled by the DI parameter in KLSYN88a) in the former. In order to produce a well-differentiated harsh voice, it may be the case that a greater degree of aperiodicity would be required and/or a different type of aperiodicity. Furthermore, harsh voice may also require more

206

extreme settings of those parameters that reflect glottal tension. Although in this test, more extreme tension settings were not adopted, this is something we would hope to include in further tests. In Murray and Arnottos summary, sadness is associated with a resonant voice quality (Murray and Arnott, 1993, Table 1). We would interpret the term resonant to be a quality somewhere on the modal - tense continuum. As can be seen in fig. 2, neither the tense nor the modal stimuli elicited a sad response: as mentioned above, the shift from modal to tense enhanced the happy rather than the sad overtones. To sum up on tense voice: the present study provides strong support for the association of anger with tense voice. The linkage between tense voice and anger is hardly surprising, being intuitively to be expected, and probably the most widely suggested association of voice quality - affect one finds in the literature. There is also some support for some degree of association of tense voice with joy, as suggested by Scherer (1986). Other previously suggested associations of tense voice with fear or with sadness are not supported here. Note, however, that in the present study, a number of further strong associations with tense voice are suggested. The fact that these have not been previously reported may simply relate to the fact that the overwhelming focus of past studies in this field has been on the Ôstrongo emotions - - anger, joy, fear, sadness. The tense stimulus in this experiment yielded very high ratings for stressed, formal, confidant, hostile and interested. And whereas the attributes stressed and hostile are clearly very related to angry, others such as confident, formal and interested appear to be rather different in terms of valence, power and even degree of activation. It tends to be taken as axiomatic that creaky voice signals boredom for speakers of English (see Laver, 1980). In the present experiment, high response rates were achieved by the lax - creaky stimulus, which combines features of creaky and breathy voice. It is worth noting (figs. 2 and 3) that this stimulus is considerably more potent in signalling boredom than the creaky voice stimulus which was modelled on Laveros (1980) specification of creaky voice, and on our own earlier ana-

lyses of creaky voice (e.g., Gobl, 1989; Gobl and Ni Chasaide, 1992). And as pointed out for other qualities, there is not a one-to-one mapping to affective attributes: lax - creaky voice also gets high ratings for relaxed, intimate and content, and moderately high ratings (the highest in this test) for sad and friendly. In the responses for intimate, and particularly for sad and friendly, there would appear to be greater cross-subject variability (fig. 2). Note however, that the very extended total range of values here results from atypical responses of a single subject in each case. In contrast to the rather high ratings for boredom obtained with the lax - creaky stimulus here, Burkhardt and Sendlmeier (2000) report that this association is not clearly indicated, and may even be counter-indicated. Two factors might be responsible for these differences in results. firstly, the stimuli presented to subjects may have been very different, but as there is little detail in that study on source parameter settings for the generation of the different voice qualities, a direct comparison is not possible here. As our results indicate that not all types of creaky voice are necessarily highly rated for boredom, a difference in the stimuli could be highly relevant. A further factor may be cross-language differences. Creaky voice is often mentioned as related to the expression of boredom for speakers of English, and this is not necessarily assumed to be universal. Burkhardt and Sendlmeieros subjects were German, and differences in results could be influenced by this difference in subjects. When assessing the strength of ratings achieved for individual attributes by the present stimuli, it must be borne in mind that, however important, voice quality is only one of a number of vocal features that speakers may exploit to communicate affect. When we find a strong and consistent association of affect with a particular stimulus (e.g., the tense stimulus and angry) in this experiment, we can be fairly confident that this type of quality can alone evoke the affect, even though in real discourse features other than voice quality may further enhance its perception. In cases where we find a moderate association between a stimulus and a particular affect (e.g., the tense stimulus and happy) it is less obvious what this might be telling

us. It could mean that the quality approximated by the tense stimulus is not quite appropriate for the communication of happiness. Or it might indicate that although appropriate, the voice quality is not a dominant cue to this affect, and that some other critical features (such as tempo or specific f0 variations) are lacking without which happiness is not effectively conveyed. It is striking that for the range of voice qualities that were synthesised for this experiment, milder states were better signalled than the strong emotions (the exception being anger). It may well be the case, that to communicate strong emotions one would need, at the very least, to incorporate those large f0 dynamic changes described in the literature on the vocal expression of emotion, e.g., by Scherer (1986) or Mozziconacci (1995, 1998). In the present stimuli, only relatively small f0 differences were included such as were deemed intrinsic correlates of individual voice qualities. A possible hypothesis at this stage is that voice quality and pitch variables may have at least partially different functions in affect signalling, with voice quality playing a critical role in the general communication of milder affective distinctions (general speaker states, moods and attitudes), especially those that have no necessary physiological component, and pitch variables, such as major differences in f0 level and range being more critical in the signalling of strong emotions where physiologically determined glottal changes are more likely. This type of hypothesis would we feel be compatible with arguments and findings of other researchers, e.g., Scherer (1986) who has suggested that whereas large f0 shifts signal gross changes in activation/arousal levels, voice quality variables may be required to differentiate between subtle differences in affect. Support for this viewpoint can be construed from the results of the experiments of Scherer et al. (1984), which are unusual in that the typically studied strong emotions are excluded. Voice quality emerged in that study as the overwhelmingly important variable that correlated with listeneros judgements of affect. The possibility of voice quality and f0 serving different and potentially independent functions in affect signalling have also been raised by Laukkanen et al. (1997), Murray

207

and Arnott (1993), Scherer et al. (1984) and Ladd et al. (1985). An alternative hypothesis that should also be borne in mind is that voice quality differences, but of a much more extreme nature than those simulated in the present study would be required for the signalling of strong emotions. This would imply that both voice quality and f0 variables function in a similar and essentially gradient fashion in the signalling of strong emotions. This would not necessarily entail that the communication of mild affective states might not rely more heavily on voice quality differentiation. The way in which voice quality variables combine with pitch variables is the focus of some of our current ongoing research. To test the first hypothesis mentioned above, we are looking at whether large pitch excursions, as described by Mozziconacci (1995), with and without voice quality manipulations would achieve a better signalling of the strong emotions. Some preliminary results are included by Ni Chasaide and Gobl (2002). We also plan to test the extent to which the relatively smaller f0 differences included in the present stimuli (deemed intrinsic to these voice qualities) may have contributed to the perception of these affects. Of course, f0 itself is a source parameter and an intrinsic part of voice quality. The fact that these have to date been studied as separate entities is at least partially a reflection on the methodological constraints that pertain to voice source analysis. The broader linguistic phonetic literature, dealing with languages which have register (voice quality) and tonal contrasts highlights two things. firstly, f0 and voice quality can operate in a largely independent way, and secondly, there are broad tendencies for them to covary, so that for a number of register contrasts there are salient pitch correlates, whereas for a number of tonal contrasts there may be striking voice quality correlates (see discussion of this point in Ni Chasaide and Gobl, 1997). Even within modal voice in the mid-pitch range, there are some interactions between f0 and other source parameters. To the extent that these have been studied, results appear to be sometimes contradictory and suggest that they may depend on rather complex factors (see, for instance, Fant, 1997;

208

Koreman, 1995; Pierrehumbert, 1989; Strik and Boves, 1992; Swerts and Veldhuis, 2001). For the very large differences in pitch level and range, described in the literature on the vocal expression of (strong) emotions, it seems very unlikely that these would occur without major adjustments to voice quality. If this is the case, the absence of the voice quality domain in analyses is a serious deficit, and likely to lead to unsatisfactory results in synthesis. This could provide one explanation as to why perception tests of large f0 excursions to cue emotions can sometimes yield disappointing results (see, for example, Mozziconacci, 1995, 1998).

6. Conclusions In this study we have focussed on voice quality, which is of course one of a variety of features used in the communication of affect. Results illustrate that differences in voice quality alone can evoke quite different colourings in an otherwise neutral utterance. They further suggest that there is no one-to-one mapping between voice quality and affect: individual qualities appear rather to be associated with a constellation of affective states, sometimes related, sometimes less obviously related. Some previously reported associations between voice quality and affect (e.g., anger and tense voice) are supported by the present results, whereas others are clearly not (e.g., tense voice and fear). In certain cases (e.g., the association of creaky voice with boredom, or breathy voice with sadness or intimacy) refinements would be suggested. For these affects, the lax - creaky stimulus (which combined features of breathy and creaky voice) yielded considerably higher responses. Furthermore, the Ôbroad paletteo approach adopted here, whereby listeners rated a rather wide variety of affective states, rather than the smaller selection of strong emotions more typically included, allowed other strong associations to emerge, such as the formal, confident and interested colouring of tense voice. Results also permit us to see at a glance which of the synthetic utterances presented here were rated as the most friendly, stressed, relaxed, etc.

The voice qualities presented in this experiment were considerably more effective in signalling the relatively milder affective states and generally ineffective in signalling strong emotions (excepting anger). This raises the question as to whether the role of pitch dynamics and voice quality may be somewhat different in the communication of affect: voice quality may be critical in the differentiation of subtle variations in affective states, whereas large pitch excursions, such as described in the emotion literature may be critical to the signalling of strong emotions. The findings are based on synthetic stimuli and tell us about how voice quality in human speech communication only insofar as the targeted voice qualities were successfully synthesised. Specific difficulties were encountered in the synthesis of whispery voice and the similarity in responses to the whispery and breathy stimuli suggests that further work would be required at the level of generating a better simulation of the former quality in particular. Similarly, in the case of harsh voice, results also indicate that this stimulus may not have been optimal. For both the whispery and harsh stimuli, caution is required in interpreting results. This highlights the need for further work on both the production and perception correlates of these two qualities in particular, but also more generally, on all qualities. One other aspect that we would hope to explore concerns how more gradual changes in source parameters along a given voice quality continuum relate to changes in the perception of associated affect(s). While results demonstrate that voice quality differences alone can impart very different affective overtones to a message, this does not imply that speakers use this feature in isolation. As discussed in Section 1, there are other source features (pitch dynamics), vocal tract features (segmental differences) and temporal features which speakers can and do exploit for such paralinguistic communication. As a step towards understanding how these may combine, we are currently extending the present study to look at how voice quality combines with f0 variables in signalling emotions. It is hoped that these efforts will contribute to the bigger picture, which concerns not only how voice quality combines with the other known vocal correlates of

affective speech, but also how the precise meaning of an utterance results from an interaction of these vocal cues with its verbal content.

Acknowledgements We are grateful to Elizabeth Heron of the Department of Statistics, TCD, for assistance with the statistical analysis.

References Alku, P., 1992. Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication 11, 109 - 118. Alku, P., Vilkman, E., 1994. Estimation of the glottal pulseform based on discrete all-pole modeling. In: Proceedings of the International Conference on Spoken Language Processing, Yokohama, pp. 1619 - 1622. Alku, P., Vilkman, E., 1996. A comparison of glottal voice source quantification parameters in breathy, normal and pressed phonation of female and male speakers. Folia Phoniatrica et Logopaedica 48, 240 - 254. Alter, K., Rank, E., Kotz, S.A., Pfeifer, E., Besson, M., Friederici, A.D., Matiasek, J., 1999. On the relations of semantic and acoustic properties of emotions. In: Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, pp. 2121 - 2124. Ananthapadmanabha, T.V., 1984. Acoustic analysis of voice source dynamics. STL-QPSR 2 - 3, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 1 - 24. Burkhardt, F., Sendlmeier, W.F., 2000. Verification of acoustical correlates of emotional speech using formant-synthesis. In: Cowie, R., Douglas-Cowie, E., Schr€ oder, M. (Eds.), Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Queenos University, Belfast, pp. 151 - 156. Cahn, J., 1990a. The generation of affect in synthesized speech. Journal of the American Voice I/O Society 8, 1 - 19. Cahn, J., 1990b. Generating expression in synthesized speech. Technical report, MIT Media Laboratory, Boston. Carlson, R., Granstr€ om, B., Karlsson, I., 1991. Experiments with voice modelling in speech synthesis. Speech Communication 10, 481 - 489. Carlson, R., Granstr€ om, B., Nord, L., 1992. Experiments with emotive speech, acted utterances and synthesized replicas. Speech Communication 2, 347 - 355. Chan, D.S.F., Brookes, D.M., 1989. Variability of excitation parameters derived from robust closed phase glottal inverse filtering. In: Proceedings of Eurospeech o89, Paris, paper 33.1.

209

Childers, D.G., Lee, C.K, 1991. Vocal quality factors: Analysis, synthesis, and perception. Journal of the Acoustical Society of America 90, 2394 - 2410. Cranen, B., Boves, L., 1985. Pressure measurements during speech production using semiconductor miniature pressure transducers: impact on models for speech production. Journal of the Acoustical Society of America 77, 1543 - 1551. Cummings, K.E., Clements, M.A., 1995. Analysis of the glottal excitation of emotionally styled and stressed speech. Journal of the Acoustical Society of America 98, 88 - 98. Ding, W., Kasuya, H., Adachi, S., 1994. Simultaneous estimation of vocal tract and voice source parameters with application to speech synthesis. In: Proceedings of the International Conference on Spoken Language Processing, Yokohama, pp. 159 - 162. Fant, G., 1979a. Glottal source and excitation analysis. STLQPSR 1, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 85 - 107. Fant, G., 1979b. Vocal source analysis - a progress report. STL-QPSR 3 - 4, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 31 - 54. Fant, G., 1982. The voice source - acoustic modeling. STLQPSR 4, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 28 - 48. Fant, G., 1995. The LF-model revisited. Transformations and frequency domain analysis. STL-QPSR 2 - 3, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 119 - 156. Fant, G., 1997. The voice source in connected speech. Speech Communication 22, 125 - 139. Fant, G., Lin, Q., 1991. Comments on glottal flow modelling and analysis. In: Gauffin, J., Hammarberg, B. (Eds.), Vocal Fold Physiology: Acoustic, Perceptual, and Physiological Aspects of Voice Mechanisms. Singular Publishing Group, San Diego, pp. 47 - 56. Fant, G., Liljencrants, J., Lin, Q., 1985. A four-parameter model of glottal flow. STL-QPSR 4, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 1 - 13. Frick, R.W., 1985. Communicating emotion: the role of prosodic features. Psychological Bulletin 97, 412 - 429. Fr€ ohlich, M., Michaelis, D., Strube, H.W., 2001. SIM - simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals. Journal of the Acoustical Society of America 110, 479 - 488. Fujisaki, H., Ljungqvist, M., 1986. Proposal and evaluation of models for the glottal source waveform. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Tokyo, pp. 31.2.1 - 31.2.4. Gobl, C., 1988. Voice source dynamics in connected speech. STL-QPSR 1, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 123 - 159. Gobl, C., 1989. A preliminary study of acoustic voice quality correlates. STL-QPSR 4, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 9 - 21. Gobl, C., Ni Chasaide, A., 1992. Acoustic characteristics of voice quality. Speech Communication 11, 481 - 490.

210

Gobl, C., Ni Chasaide, A., 1999a. Perceptual correlates of source parameters in breathy voice. In: Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, pp. 2437 - 2440. Gobl, C., Ni Chasaide, A., 1999b. Techniques for analysing the voice source. In: Hardcastle, W.J., Hewlett, N. (Eds.), Coarticulation: Theory, Data and Techniques. Cambridge University Press, Cambridge, pp. 300 - 321. Gobl, C., Monahan, P., Ni Chasaide, A., 1995. Intrinsic voice source characteristics of selected consonants. In: Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 1, pp. 74 - 77. Hammarberg, B., 1986. Perceptual and acoustic analysis of dysphonia. Studies in Logopedics and Phoniatrics 1, Huddinge University Hospital, Stockholm, Sweden. Hedelin, P., 1984. A glottal LPC-vocoder. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, San Diego, pp. 1.6.1 - 1.6.4. Herteg ard, S., Gauffin, J., 1991. Insufficient vocal fold closure as studied by inverse filtering. In: Gauffin, J., Hammarberg, B. (Eds.), Vocal Fold Physiology: Acoustic, Perceptual, and Physiological Aspects of Voice Mechanisms. Singular Publishing Group, San Diego, pp. 243 - 250. Hunt, M.J., 1987. Studies of glottal excitation using inverse filtering and an electroglottograph. In: Proceedings of the XIth International Congress of Phonetic Sciences, Stockholm, Tallinn, Vol. 3, pp. 23 - 26. Hunt, M.J., Bridle, J.S., Holmes, J.N., 1978. Interactive digital inverse filtering and its relation to linear prediction methods. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Tulsa, OK, pp. 15 - 18. Iida, A., Campbell, N., Iga, S., Higuchi, H., Yasumura, M., 2000. A speech synthesis system with emotion for assisting communication. In: Cowie, R., Douglas-Cowie, E., Schr€ ode, M. (Eds.), Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Queenos University, Belfast, pp. 167 - 172. Jansen, J., 1990. Automatische extractie van parameters voor het stembron-model van Liljencrants & Fant. Unpublished master thesis, Nijmegen University. Johnstone, T., Scherer, K.R., 1999. The effects of emotions on voice quality. In: Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, pp. 2029 - 2032. Kane, P., Ni Chasaide, A., 1992. A comparison of the dysphonic and normal voice source. Journal of Clinical Speech and Language Studies, Dublin 1, 17 - 29. Kappas, A., Hess, U., Scherer, K.R., 1991. Voice and emotion. In: Feldman, R.S., Rime, B. (Eds.), Fundamentals of Nonverbal Behavior. Cambridge University Press, Cambridge, pp. 200 - 238. Karlsson, I., 1990. Voice source dynamics for female speakers. In: Proceedings of the International Conference on Spoken Language Processing, Kobe, Japan, pp. 225 - 231. Karlsson, I., 1992. Modelling voice source variations in female speech. Speech Communication 11, 1 - 5.

Karlsson, I., Liljencrants, J., 1996. Diverse voice qualities: models and data. SMH-QPSR 2, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 143 - 146. Kasuya, H., Maekawa, K., Kiritani, S., 1999. Joint estimation of voice source and vocal tract parameters as applied to the study of voice source dynamics. In: Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, pp. 2505 - 2512. Kienast, M., Paeschke, A., Sendlmeier, W.F., 1999. Articulatory reduction in emotional speech. In: Proceedings of Eurospeech o99, Budapest, pp. 117 - 120. Kitzing, P., L€ ofqvist, A., 1975. Subglottal and oral pressure during phonation - - preliminary investigation using a miniature transducer system. Medical and Biological Engineering 13, 644 - 648. Klasmeyer, G., Sendlmeier, W.F., 1995. Objective voice parameters to characterize the emotional content in speech. In: Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 1, pp. 182 - 185. Klatt, D.H., 1980. Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America 67, 971 - 995. Klatt, D.H., unpublished chapter. Description of the cascade/ parallel formant synthesiser. Sensimetrics Corporation, Cambridge, MA, Chapter 3, 79 pp. Klatt, D.H., Klatt, L.C., 1990. Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America 87, 820 - 857. Koreman, J., 1995. The effects of stress and F0 on the voice source. Phonus 1, University of Saarland, pp. 105 - 120. Ladd, D.R., Silverman, K.E.A., Tolkmitt, F., Bergman, G., Scherer, K.R., 1985. Evidence for the independent function of intonation contour type, voice quality and F0 range in signaling speaker affect. Journal of the Acoustical Society of America 78, 435 - 444. Laukkanen, A.-M., Vilkman, E., Alku, P., Oksanen, H., 1995. On the perception of emotional content in speech. In: Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 1, pp. 246 - 249. Laukkanen, A.-M., Vilkman, E., Alku, P., Oksanen, H., 1996. Physical variation related to stress and emotionally state: a preliminary study. Journal of Phonetics 24, 313 - 335. Laukkanen, A.-M., Vilkman, E., Alku, P., Oksanen, H., 1997. On the perception of emotions in speech: the role of voice quality. Scandinavian Journal of Logopedics, Phoniatrics and Vocology 22, 157 - 168. Laver, J., 1980. The Phonetic Description of Voice Quality. Cambridge University Press, Cambridge. Lee, C.K., Childers, D.G., 1991. Some acoustical, perceptual, and physiological aspects of vocal quality. In: Gauffin, J., Hammarberg, B. (Eds.), Vocal Fold Physiology: Acoustic, Perceptual, and Physiological Aspects of Voice Mechanisms. Singular Publishing Group, San Diego, pp. 233 - 242.

Ljungqvist, M., Fujisaki, H., 1985. A method for simultaneous estimation of voice source and vocal tract parameters based on linear predictive analysis. Transactions of the Committee on Speech Research, Acoustical Society of Japan S85-21, 153 - 160. Mahshie, J., Gobl, C., 1999. Effects of varying LF parameters on KLSYN88 synthesis. In: Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, pp. 1009 - 1012. McKenna, J., Isard, S., 1999. Tailoring Kalman filtering toward speaker characterisation. In: Proceedings of Eurospeech o99, Budapest, pp. 2793 - 2796. Meurlinger, C., 1997. Emotioner i syntetiskt tal. M.Sc dissertation, Speech, Music and Hearing, Royal Institute of Technology, Stockholm. Minitab, 2001. Minitab Inc. MINITAB Statistical Software, Release 13, Minitab, State College PA, 2001. Monsen, R.B., Engebretson, A.M., 1977. Study of variations in the male and female glottal wave. Journal of the Acoustical Society of America 62, 981 - 993. Mozziconacci, S., 1995. Pitch variations and emotions in speech. In: Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 1, pp. 178 - 181. Mozziconacci, S., 1998. Speech variability and emotion: production and perception. Ph.D. thesis, Technische Universiteit Eindhoven, Eindhoven. Murray, I.R., Arnott, J.L., 1993. Towards the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. Journal of the Acoustical Society of America 93, 1097 - 1108. Murray, I.R., Arnott, J.L., 1995. Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication 20, 85 - 91. Murray, I.R., Edgington, M.D., Campion, D., Lynn, J., 2000. In: Cowie, R., Douglas-Cowie, E., Schr€ oder, M. (Eds.), Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Queenos University, Belfast, pp. 173 - 177. Ni Chasaide, A., Gobl, C., 1993. Contextual variation of the vowel voice source as a function of adjacent consonants. Language and Speech 36, 303 - 330. Ni Chasaide, A., Gobl, C., 1995. Towards acoustic profiles of phonatory qualities. In: Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 4, pp. 6 - 13. Ni Chasaide, A., Gobl, C., 1997. Voice source variation. In: Hardcastle, W.J., Laver, J. (Eds.), The Handbook of Phonetic Sciences. Blackwell, Oxford, pp. 427 - 461. Ni Chasaide, A., Gobl, C., 2002. Voice quality and the synthesis of affect. In: Keller, E., Bailly, G., Monaghan, A., Terken, J., Huckvale, M. (Eds.), Improvements in Speech Synthesis. Wiley and Sons, New York, pp. 252 - 263. Ni Chasaide, A., Gobl, C., Monahan, P., 1992. A technique for analysing voice quality in pathological and normal speech. Journal of Clinical Speech and Language Studies, Dublin 1, 1 - 16.

211

Olivera, L.C., 1993. Estimation of source parameters by frequency analysis. In: Proceedings of Eurospeech o93, Berlin, pp. 99 - 102. Olivera, L.C., 1997. Text-to-speech synthesis with dynamic control of source parameters. In: van Santen, J.P.H., Sproat, R.W., Olive, J.P., Hirschberg, J. (Eds.), Progress in Speech Synthesis. Springar-Verlag, New York, pp. 27 - 39. Palmer, S.K., House, J., 1992. Dynamic voice source changes in natural and synthetic speech. In: Proceedings of the International Conference on Spoken Language Processing, Banff, pp. 129 - 132. Pierrehumbert, J.B., 1989. A preliminary study of the consequences of intonation for the voice source. STL-QPSR 4, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, pp. 23 - 36. Price, P.J., 1989. Male and female voice source characteristics: inverse filtering results. Speech Communication 8, 261 - 277. Qi, Y.Y., Bi, N., 1994. Simplified approximation of the 4parameter LF model of voice source. Journal of the Acoustical Society of America 96, 1182 - 1185. Rosenberg, A.E., 1971. Effect of glottal pulse shape on the quality of natural vowels. Journal of the Acoustical Society of America 49, 583 - 598. Rothenberg, M., Carlson, R., Granstr€ om, B., Lindqvist-Gauffin, J., 1975. A three-parameter voice source for speech synthesis. In: Fant, G. (Ed.), Proceedings of the Speech Communication Seminar, Stockholm, 1974, Vol. 2. Almqvist and Wiksell, Stockholm, pp. 235 - 243. Scherer, K.R., 1981. Speech and emotional states. In: Darby, J. (Ed.), The Evaluation of Speech in Psychiatry and Medicine. Grune and Stratton, New York, pp. 189 - 220. Scherer, K.R., 1986. Vocal affect expression: A review and a model for future research. Psychological Bulletin 99, 143 - 165. Scherer, K.R., 1989. Vocal measurement of emotion. In: Plutchik, R., Kellerman, H. (Eds.), Emotion: Theory, Research, and Experience, Vol. 4. Academic Press, San Diego, pp. 233 - 259. Scherer, K.R., 1994. Affect bursts. In: van Goozen, S.H.M., van de Poll, N.E., Sergeant, J.A. (Eds.), Emotions. Lawrence Erlbaum, Hillsdale, NJ, pp. 161 - 193. Scherer, K.R., Ladd, R.D., Silverman, K.E.A, 1984. Vocal cues to speaker affect: testing two models. Journal of the Acoustical Society of America 76, 1346 - 1356. Schoentgen, J., 1993. Modelling the glottal pulse with a selfexcited threshold autoregressive model. In: Proceedings of Eurospeech o93, Berlin, pp. 107 - 110. Schr€ oder, M., 2000. Experimental study of affect bursts. In: Cowie, R., Douglas-Cowie, E., Schr€ oder, M. (Eds.), Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Queenos University, Belfast, pp. 132 - 137. Scully, C., 1994. Data and methods for the recovery of sources. Deliverable 15 in the Report for the Speech Maps Workshop,

212

Esprit/Basic Research Action no. 6975, Vol. 2, Institut de la Communication Parlee, Grenoble. Scully, C., Stromberg, K., Horton, D., Monahan, P., Ni Chasaide, A., Gobl, C., 1995. Analysis and articulatory synthesis of different voicing types. In: Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 2, pp. 482 - 485. Stibbard, R., 2000. Automated extraction of ToBI annotation data from the Reading/Leeds emotional speech corpus. In: Cowie, R., Douglas-Cowie, E., Schr€ oder, M. (Eds.), Proceedings of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Queenos University, Belfast, pp. 60 - 65. Strik, H., Boves, L., 1992. On the relation between voice source parameters and prosodic features in connected speech. Speech Communication 11, 167 - 174. Strik, H., Boves, L., 1994. Automatic estimation of voice source parameters. In: Proceedings of the International Conference on Spoken Language Processing, Yokohama, pp. 155 - 158. Strik, H., Jansen, J., Boves, L., 1992. Comparing methods for automatic extraction of voice source parameters from continuous speech. In: Proceedings of the International Conference on Spoken Language Processing, Banff, Vol. 1, pp. 121 - 124.

Strik, H., Cranen, B., Boves, L., 1993. fitting a LF-model to inverse filter signals. In: Proceedings of Eurospeech o93, Berlin, pp. 103 - 106. Swerts, M., Veldhuis, R., 2001. The effect of speech melody on voice quality. Speech Communication 33, 297 - 303. Talkin, D., Rowley, J., 1990. Pitch-synchronous analysis and synthesis for TTS systems. In: Proceedings of the ESCA Workshop on Speech Synthesis, Autrans, France, pp. 55 - 58. Uldall, E., 1964. Dimensions of meaning in intonation. In: Abercrombie, D., Fry, D.B., MacCarthy, P.A.D., Scott, N.C., Trim, J.L.M. (Eds.), In Honour of Daniel Jones. Longman, London, pp. 271 - 279. Veldhuis, R., 1998. A computationally efficient alternative for the Liljencrants - Fant model and its perceptual evaluation. Journal of the Acoustical Society of America 103, 566 - 571. Williams, C.E., Stevens, K.N., 1972. Emotions and speech: some acoustical correlates. Journal of the Acoustical Society of America 52, 1238 - 1250. Wong, D., Markel, J., Gray, A.H., 1979. Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Transaction on Acoustics, Speech and Signal Processing 24 (4), 350 - 355.

Breath Group Analysis for Reading and Spontaneous Speech in Healthy Adults

  • Original Paper
    • Folia Phoniatr Logop 2010;62:297–302
    • DOI: 10.1159/000316976
    • Published online: June 28, 2010
  • Yu-Tsai Wang

  • Jordan R. Green

  • Ignatius S.B. Nip

  • Ray D. Kent

  • Jane Finley Kent

Key Words:Breath group, Reading, Spontaneous speech
Abstract
Aims:

The breath group can serve as a functional unit to define temporal and fundamental frequency (f0) features in continuous speech. These features of the breath group are determined by the physiologic, linguistic, and cognitive demands of communication. Reading and spontaneous speech are two speaking tasks that vary in these demands and are commonly used to evaluate speech performance for research and clinical applications. The purpose of this study is to examine differences between reading and spontaneous speech in the temporal and f0 aspects of their breath groups.

Methods:

Sixteen participants read two passages and answered six questions while wearing a circumferentially vented mask connected to a pneumotach. The aerodynamic signal was used to identify inspiratory locations. The audio signal was used to analyze task differences in breath group structure, including temporal and f0 components.

Results:

The main findings were that spontaneous speech task exhibited significantly more grammatically inappropriate breath group locations and longer breath group duration than did the passage reading task.

Conclusion:

The task differences in the percentage of grammatically inadequate breath group locations and in breath group duration for healthy adult speakers partly explain the differences in cognitive-linguistic load between the passage reading and spontaneous speech.

Introduction

The respiratory system provides an aerodynamic source of energy and maintains a roughly constant subglottal air pressure during speech production through fairly precise, ongoing control of the respiratory musculature [1] [2] . Speech is structured in terms of breath groups based on the patterns of airflow from the lungs [3] . The features of breath groups are governed not only by respiratory needs, but also by the varying demands of grammatical structure [4] . Because the location and durations of breath groups are determined by physiologic needs, linguistic accommodations, and cognitive demands, these features may differ across speaking tasks such as passage reading and spontaneous speech.

Characteristics of nonspeech and speech breathing for reading and spontaneous speech in healthy speakers in different age-groups and gender have been reported [5] [6] [7] [8] [9] [10] [11] [12] [13] , but none of these studies reported fundamental frequency (f0) features within breath groups. The breath group has been proposed as a useful functional unit of prosodic analysis, helping to define temporal and f 0 features for connected speech [14] , especially because these features are determined by locations of inspiration. Inspiratory locations usually precede linguistic structural boundaries following grammatical rules; however, inspirations at grammatically inappropriate loci in utterances sometimes occur even for healthy speakers [5] [6] [12] [13] [14] . Bunton [5] reported a 19% occurrence of inappropriate breath locations for normal extemporaneous speech for 3 aged men and 3 aged women. Hammen and Yorkston [6] reported a 2.1% occurrence of inappropriate inspiratory locations for reading passages for 22 females and 2 males. Winkworth et al. [12] [13] reported 3.2 and 15.3% occurrences of inappropriate inspiratory locations for reading passages and spontaneous speech, respectively, for 6 healthy young women.

The temporal features of breath groups are mainly described in terms of breath group duration (BGD), interbreath-group pause (IBP), and inspiratory duration (ID). Statistics on these parameters portray the basic ventilatory pattern of speech. For example, average BGD values range from 3.36 [13] to 3.58 s [11] for reading and from 2.42 [5] to 3.84 s [12] for spontaneous speech. The ID value for reading is 0.59 s [11] . But the full understanding of how these temporal measures vary with speaking task awaits systematic investigation with suitably sensitive methods.

Determining how reading and spontaneous speech tasks differentially affect breath group organization is important because they are often an integral part of the clinical assessment battery used to evaluate dysarthria and other disorders of speech and voice. In addition, the understanding of breath group patterning is important to the improvement of naturalness in speech synthesis [15] . More generally, breath group organization contains a rich source of segmental and prosodic cues used by listeners to perceive and comprehend speech [16] . Proper intonational variations within the breath group provide listeners with cues about linguistic structure [17] .

The current investigation extends extant speech breathing studies by examining task differences on temporal and f0 parameters in a relatively larger number of healthy adult talkers based on aerodynamically determined inspiratory loci. The purposes of this study were (1) to compare the occurrence of inappropriate inspiratory locations between passage reading and spontaneous speech, and (2) to analyze temporal and f0 patterns of speech breathing, including BGD, IBP, ID, and mean of the f0 within the breath group (mean f0), maximum of the f0 within the breath group (max f0), and range of the f0 within the breath group (range f0) between passage reading and spontaneous speech in normal adult speech based on actual inspiratory locations determined aerodynamically.

Methods
Participants

Participants were 16 healthy adults (6 males, 10 females), aged 20–64 years (mean: 40.3 years; standard deviation: 14.8 years). Participants were native speakers of North American English with no reported speech and language disorders. Participants had adequate auditory, visual, language and cognitive skills to read passages and answer questions.

Stimuli

Speech samples, including the “Bamboo” [18] and “Grandfather” passages [19] and spontaneous speech, were obtained from each participant. The first task involved reading of the “Bamboo’ and “Grandfather’ passages at a comfortable speaking rate and loudness. The “Bamboo’ passage was designed to maximize the number of voiced consonants at word and phrase boundaries so that pauses in speech can easily be identified. To obtain spontaneous speech samples, participants were then asked to talk about the following six topics in as much detail as possible: their family, activities in an average day, their favorite activities, what they do for enjoyment, and their plans for their future. Each answer was at least 1 min in length and consisted of at least 6 breath groups (as monitored by an airflow transducer). Participants were given time to familiarize themselves with the passages and to formulate answers to a question before the recording was initiated.

Experimental Protocol

Participants were seated and were instructed to hold a circumferentially vented mask (Glottal Enterprises MA-1L) tightly against their face. The mask was coupled to an airflow transducer (Biopac SS11lA), which was used to continuously record expiratory and inspiratory flows during the speaking tasks. The facemask was reported not to affect the breathing patterns [20] . A professional microphone (Sennheiser) was placed approximately 2–4 cm away from the vented mask. The speaking tasks were presented via PowerPoint on a large screen using an LCD projector (ViewSonic PJ501). Participants were video-recorded using a Canon XL-1s digital video recorder. Video was sampled using Microsoft Windows Movie Maker. Audio signals were recorded at 48 kHz, 16-bit signal with the video. For each video recording, Adobe Audition 1.5 was used to separate the audio signal from the video signal, so that the audio signal could be used for the analysis of breath group structure.

The audio signal and the output signals from the airflow transducer were recorded simultaneously using Biopac Student Lab 3.6.7. Airflow was sampled at 1,000 Hz and low-pass-filtered at 500 Hz. This signal was subsequently used for the identification of actual inspiratory loci. An experimenter marked all the onsets of a new breath on each airflow signal, as indicated by an easily identified peak in the trace (fig. 1). The total numbers of inspirations determined from the airflow signals were 273 and 1,106 for passage reading and spontaneous speech, respectively.

_images/fig17.png

注釈

Fig.1.

A demonstration of measures of BGD, IBP and ID based on acoustic and aerodynamic signals. The arrows indicate the locations of inspiration for the Bamboo passage.

Appropriateness of Inspiratory Locations

The appropriateness of inspiratory locations for the passage reading and spontaneous speech samples was determined by a judge with training in linguistics based on the rules given by Henderson et al. [21] . Inspirations locating at the end of a sentence or punctuation points such as comma or colon, or before noun, verb, adverbial phrases or other phrases are considered appropriate. Inspirations occurring within phrases or words are considered syntactically inappropriate. The percentage of appropriate breath group loci was calculated to compare the appropriateness of inspiratory locations between the passage reading and the spontaneous speech tasks.

Figure 1 shows inspiratory locations and measures of breath group structure based on acoustic and airflow signals. Top and bottom panels represent waveform and airflow signal from Biopac, respectively. For all tasks, the first BGD was not included in the analysis because the timing patterns associated with the first part of each utterance were expected to be variable and, therefore, nonrepresentative.

Temporal Components

As shown in figure 1, inspiratory locations were used to segment acoustic signals into BGD and IBP. BGD in this study was defined as the duration of groups of speech events produced on a single breath [3] , and was measured from the start to the end of the speech signal produced on a breath group based on the acoustic waveform. IBP was measured as the interval between successive BGDs. ID was measured manually between the nearest minima on both sides of each inspiration and indicates actual inspiratory behavior for each IBP.

f 0 Components

After the temporal breath group parameters had been measured, a pitch trace was generated with TF32 [22] for each breath group sample. When the pitch tracking algorithm generated errors, the raw f0 trace was corrected manually using TF32 software [22] , most frequently required to delete erroneous f 0 trace occurring on stop bursts or noise signals and to add a portion of the f 0 trace on which phonation occurred but without f 0 trace, as previously reported [14] . The manually corrected f 0 traces within each breath group sample were used to obtain measures of mean f 0, max f 0, and range f 0 (maximum f 0 –minimum f 0).

Measurement Agreement

To estimate intra- and interanalyst measurement agreement, the first author and another individual with experience in acoustic measurement remeasured acoustic data produced by 2 randomly selected participants (12.5% of the entire data corpus). These measurements were taken for both the passage reading and spontaneous speech samples approximately 2 months after completion of the first measures. The Pearson correlation coefficient of BGD between the two measures was 0.99 for intra-analyst and 0.99 for interanalyst. The Pearson correlation coefficient of IBP between the two measures was 0.99 for intra-analyst and 0.99 for interanalyst. The mean absolute difference between the two measures was 11.9 ms for intra-analyst and 13.2 ms for interanalyst in BGD; 11.6 ms for intra-analyst and 12.7 ms for interanalyst in IBP, respectively.

Parameter Passage reading Spontaneous speech t(15) P
BGD.s 3.50±0.62 4.35±0.72 -3.85 0.002
IBP.s 0.65±0.16 0.70±0.12 -1.09 0.295
ID.s 0.55±0.12 0.58±0.08 -1.09 0.295
f0 mean Hz        
Male 118±12 112±11 1.93 0.073
Female 186±24 184±243    
f0 range Hz        
Male 169±15 166±17 -1.06 0.304
Female 269±37 277±35    
f0 max Hz        
Male 99±10 97±14 0.31 0.758
Female 197±33 196±35    

注釈

Table 1. Means and standard deviations for BGD, IBP, ID, f 0 mean, f 0 max, and f 0 range in the reading and spontaneous speech samples

Statistical Analysis

x2 test was used to analyze task differences in the appropriateness of inspiratory locations. Paired t tests were performed for task differences in temporal parameters (including BGD, IBP, and ID) and f 0 parameters (including f 0 mean, f 0 max, and f 0 range) of breath group structure at ␣ = 0.05 level.

Results
Appropriateness of Inspiratory Loci

The number of inappropriate breathing locations was 5 out of 273 (1.8%) and 143 out of 1,106 (13%) for the passage reading and the spontaneous speech task, respectively. The number of inappropriate breathing locations was significantly larger for the spontaneous speech task than for the passage reading task [x2(1) = 24, p = 0.0001].

Breath Group Structure

Summaries of BGD, IBP, ID, f0 mean, f0 max and f0 range data for the passage reading and the spontaneous speech tasks for each participant are shown in table 1.

Breath Group Duration.

For the passage reading task, the mean and standard deviation of the total 273 BGDs were 4.05 and 1.5 s, and the range was 8.43 s, from a minimum of 0.93 s to a maximum of 9.36 s. For the spontaneous speech task, the mean and standard deviation of the total 1,106 BGDs were 4.88 and 1.93 s, and the range was 13.12 s, from a minimum of 0.9 s to a maximum of 14.02 s. A paired t test was performed based on the mean values of BGD for different tasks for each participant. The spontaneous speech task had a significantly longer BGD than the passage task.

Inter-Breath-Group Pause.

For the passage task, the mean and standard deviation of the total 273 IBPs were 0.64 and 0.24 s, respectively, and the range was 1.55 s, from a minimum of 0.25 s to a maximum of 1.8 s. For the spontaneous speech task, the mean and standard deviation of the total 1,106 IBPs were 0.69 and 0.28 s, respectively, and the range was 3.16 s, from a minimum of 0.23 s to a maximum of 3.4 s. There was no significant difference for IBP between passage and spontaneous speech tasks.

Inspiratory Duration.

For the passage reading task, the mean and standard deviation of the total 273 IDs were 0.54 and 0.18 s, respectively, and the range was 1.02 s, from a minimum of 0.19 s to a maximum of 1.21 s. For the spontaneous speech task, the mean and standard deviation of the total 1,106 IDs were 0.57 and 0.18 s, respectively, and the range was 1.37 s, from a minimum of 0.19 s to a maximum of 1.56 s. There was no significant difference in IBP between the passage and spontaneous speech tasks.

Mean f0.

The task difference of f 0 mean was not significant.

Max f0.

The task difference of f0 max was not significant.

Range f0.

There was no significant difference in f0 range between passage and spontaneous speech tasks.

Discussion

The results of this study confirm and extend earlier reports on respiratory function in speech. The main result of the current study is that the spontaneous speech task exhibited significantly more grammatically inappropriate BG locations and longer BGD than did the passage reading task.

Appropriateness of Inspiratory Loci

The percentages of inappropriate inspiratory locations found in this study were similar to values found in previous studies for both reading [6] [13] and spontaneous speech [12] [14] . Some of the grammatically inappropriate inspiratory loci were due to the insertion of a filler, but none occurred within words. Therefore, the significantly greater number of inappropriate breathing locations for spontaneous speech than for reading was unlikely due to poor planning of the utterances, but rather due to greater efforts required to coordinate inspiratory locations into a less predictable grammatical structure. Another possible reason is the heavier cognitive load required for spontaneous speech than for oral reading. Increased cognitive-linguistic demands have been reported to lead to a reduced number of syllables per breath group, slower speaking rate, and a greater lung volume expended per syllable [10] . In the current study, inappropriate inspiratory locations probably had little or no impact on speech intelligibility given that (1) none of them occurred within words, and (2) segmental and prosodic features within breath groups were intact within breath groups.

Breath Group Structure

Compared to previous reports, the BGD values observed in the current study were longer in spontaneous speech [5] [12] , but comparable in reading [11] [13] ; moreover, the ID values were comparable to those in a previous report [11] . Differences among studies are probably due to variations in the methods used to elicit spontaneous speech samples. In this study, the significantly longer BGD in spontaneous speech than in reading for healthy adult speakers is probably due to the differences in cognitive-linguistic loading between these two tasks [23] . There were no significant task differences in IBP or ID, which indicates that the inspiratory control during speech was consistent between these different tasks for healthy adult talkers. The above results suggest that the overall speech breathing cycle (IBP + BGD) in the spontaneous speech task was longer than that in reading.

The noninspiratory pause, defined as IBP minus actual ID, might be an index of the efforts involved with coordinating speech production subsystems and cognitive load in the communicative task. That is, the portion of pause that is not accounted for by actual inspiration may be determined by other factors, including motor control and cognitive effort. Further studies compiling acoustic and aerodynamic measures are needed to test this hypothesis by recruiting participants with speech motor disorders or cognitive deficits. The absence of task differences in f0 mean, f0 max, or f0 range indicates: (1) f0 control is uniform for these speaking behaviors, which simplifies the programming of laryngeal behavior in connection with respiratory activity, and (2) either task is suitable for assessing f0 of healthy talkers during connected speech. However, because all the participants in this study had normal vocal function, additional studies are required to explore the possibility of f0 differences across tasks in participants with impaired vocal control.

Implications for Speech Breathing

Speech respiration differs from resting respiration in having a shorter inspiratory duration with increased velocity of airflow, and a longer expiratory duration with a decrease in velocity. Conrad and Schonle [23] concluded that respiratory patterns for a variety of tasks fall along a continuum from those produced during rest to those produced during speech. They noted that the degree of activation of the respiratory pattern for speech is determined by the degree of internal verbalization and that respiratory patterns for different tasks become more speechlike as they increased in their cognitive-linguistic processing demands. For example, vocalized arithmetic showed a much stronger speech pattern than did reading. Increased internal verbalization (cognitive-linguistic processing) also could explain the longer BGD for spontaneous speaking in the present study. If spontaneous speaking is taken to represent a high cognitive-linguistic load task, then the respiratory pattern for relatively unconstrained speech has the following temporal profile: BGD of about 4–5 s, ID of 0.6 s, and a breath group interval of 0.7 s. The ratio of BGD to ID is about 8:1. These values may be useful for clinical application, including assessment of respiratory function for speech or as guidelines for intervention. The fact that global features of f0 pattern are highly similar across reading and spontaneous speaking tasks is evidence of a simplifying regularity in the control of laryngeal function vis-à-vis respiratory patterns.

Acknowledgments

This work was supported in part by Research Grant number 5 R01 DC00319, R01 DC000822, and R01 DC006463 from the National Institute on Deafness and Other Communication Disorders (NIDCD-NIH), and NSC 94-2614-B-010-001 and NSC 952314-B-010-095 from National Science Council, Taiwan. Additional support was provided by the Barkley Trust, University of Nebraska-Lincoln, Department of Special Education and Communication Disorders. Some of the data were presented in a poster session at the 5th International Conference on Speech Motor Control, Nijmegen, 2006. We would like to acknowledge HsiuJung Lu and Yi-Chin Lu for data processing.

References
[1]Hixon TJ, Mead J, Goldman MD: Dynamics of the chest wall during speech production: function of the thorax, rib cage, diaphragm, and abdomen. J Speech Hear Res 1976; 19: 297–356.
[2]Hixon TJ, Goldman MD, Mead J: Kinematics of the chest wall during speech production: volume displacements of the rib cage, abdomen, and lung. J Speech Hear Res 1973; 16: 78–115.
[3](1, 2) Kent RD, Read C: The Acoustic Analysis of Speech, ed 2. San Diego, Singular, 2002.
[4]Grosjean F, Collins M: Breathing, pausing and reading. Phonetica 1979;36:98–114.
[5](1, 2, 3, 4, 5) Bunton K: Patterns of lung volume use during an extemporaneous speech task in persons with Parkinson disease. J Commun Disord 2005;38:331–348.
[6](1, 2, 3, 4) Hammen VL, Yorkston KM: Respiratory patterning and variability in dysarthric speech. J Med Speech Lang Pathol 1994; 2: 253–261.
[7]Hodge MM, Rochet AP: Characteristics of speech breathing in young women. J Speech Hear Res 1989;32:466–480.
[8]Hoit JD, Hixon TJ: Age and speech breathing. J Speech Hear Res 1987;30:351–366.
[9]Hoit JD, Hixon TJ, Altman ME, Morgan WJ: Speech breathing in women. J Speech Hear Res 1989;32:353–365.
[10](1, 2) Mitchell HL, Hoit JD, Watson PJ: Cognitivelinguistic demands and speech breathing. J Speech Hear Res 1996;39:93–104.
[11](1, 2, 3, 4, 5) Solomon NP, Hixon TJ: Speech breathing in Parkinson’s disease. J Speech Hear Res 1993; 36:294–310.
[12](1, 2, 3, 4, 5, 6) Winkworth AL, Davis PJ, Adams RD, Ellis E: Breathing patterns during spontaneous speech. J Speech Hear Res 1995;38:124–144.
[13](1, 2, 3, 4, 5, 6) Winkworth AL, Davis PJ, Ellis E, Adams RD: Variability and consistency in speech breathing during reading: lung volumes, speech intensity, and linguistic factors. J Speech Hear Res 1994;37:535–556.
[14](1, 2, 3, 4) Wang YT, Kent RD, Duffy JR, Thomas JE: Dysarthria in traumatic brain injury: a breath group and intonational analysis. Folia Phoniatr Logop 2005;57:59–89.
[15]Keller E, Bailly G, Monaghan A, Terken J, Huckvale M (eds): Improvements in Speech Synthesis: COST 258: The Naturalness of Synthetic Speech. Chichester, Wiley & Sons, 2001. Folia Phoniatr Logop 2010;62:297–302
[16]Lieberman P: Intonation, Perception, and Language. Cambridge, MIT Press, 1967.
[17]Lieberman P: Some acoustic and physiologic correlates of the breath group. J Acoust Soc Am 1966;39:1218.
[18]Green JR, Beukelman DR, Ball LJ: Algorithmic estimation of pauses in extended speech samples of dysarthric and typical speech. J Med Speech Lang Pathol 2004; 12:149–154.
[19]Darley FL, Aronson AE, Brown JR: Motor Speech Disorders. Philadelphia, Saunders, 1975.
[20]Collyer S, Davis PJ: Effect of facemask use on respiratory patterns of women in speech and singing. J Speech Lang Hear Res 2006; 49: 412–423.
[21]Henderson A, Goldman-Eisler F, Skarbek A: Temporal patterns of cognitive activity and breath control in speech. Lang Speech 1965; 8:236–242.
[22](1, 2) Milenkovic P: Time-Frequency Analysis for 32-Bit Windows. Madison, 2001.
[23](1, 2) Conrad B, Schonle P: Speech and respiration. Arch Psychiatr Nervenkr 1979; 226: 251–268.

R 言語周辺

書籍

一覧