Module 2 - Voice Onset Time (VOT)

음성 2023. 8. 16. 21:58

Courses Speech Processing Module 2 – Acoustics of Consonants and Vowels Videos Voice Onset Time (VOT)

https://speech.zone/courses/speech-processing/module-2-acoustics-of-consonants-and-vowels/videos-2/voice-onset-time-vot/

Voice Onset Time (VOT)

speech.zone

유성음은 성대(vocal fold) 진동의 결과입니다. 이전 동영상에서 파형, 스펙트럼 및 스펙트로그램에서 유성음이 어떻게 나타나는지 살펴본 바 있으며, 자음 생성의 위치와 방식과는 무관하다는 것을 알고 있습니다. 우리는 성대가 진동하는지 여부를 나타내기 위해 "유성(voiced)"과 "무성(voiceless)"이라는 용어를 사용하지만, 이러한 용어는 다른 이벤트와 비교하여 진동이 언제 발생하는지에 대해 많은 것을 알려주지 않습니다.

"come and get it"와 같은 표현을 잠시 생각해 보세요. 파형waveform)과 스펙트로그램을 살펴보면 표현 전체에서 다양한 위치에서 발성이 시작되고 끝나는 것을 볼 수 있습니다. 때로는 자음과 모음 모두에 걸쳐 여러 발화에 걸쳐 발성이 지속되는 경우도 있습니다. 오디오를 듣지 않고도 스펙트로그램 하단의 voice bar를 관찰하고 파형의 주기적 구조(periodic structure)를 관찰하면 유성음이 존재한다는 것을 알 수 있습니다.

이 표현의 시작 부분에서 볼 수 있듯이 무성음 정지(voiceless stop)에 이어 모음 유성음이 나오는 것처럼 한 phone에서 다음 phone으로 발성이 번갈아 나타납니다. (구절의 끝 부분에서는 모음에 이어 무성음 정지(voiceless stop)가 나오는 반대 패턴이 나타납니다.)
자세히 보면 [g] 소리가 닫히는 동안 발성이 멈추는 아주 짧은 시간 간격이 있음을 알 수 있습니다. IPA에서 [g] 기호가 유성 양순 중지(voiced bilabial stop)를 의미한다는 것을 기억한다면, 왜 유성음은 vocing 없이 생성될까요?

결과적으로, stops 시 입이 닫히는 시점과 발성의 정렬(alignment)은 언어마다 다르며, 이러한 정렬은 음성 시작 시간(Voice Onset Time)을 참조하여 설명합니다.
VOT은 일반적으로 정지/파열음에만 사용되므로 그 이유를 잠시 생각해 보겠습니다.

먼저, 몇몇 소리는 일반적으로 유성음이라는 것을 기억하세요:
- 자음(sonorants)
다른 소리들은 유성음 또는 무성음으로 쌍을 이루며 다음과 같이 나타납니다.
- Obstruents

Voicing은 성대를 움직이기 위해 후두를 통과하는 공기의 흐름에 따라 달라집니다. 성대가 열려 있어 공기가 자유롭게 흐를 수 있기 때문에 일반적으로 전체 발성 시간 동안 지속적으로 소리가 납니다. 이렇게 해보세요: 양순 비음 [m]을 흥얼거려 보세요. 얼마나 오래 지속할 수 있을까요? - 영원히!

반면에 obstruents는 성대(vocal tract)가 수축하여 공기의 흐름을 방해하거나 막는 경우입니다. 이는 발성 방법과 시간에 영향을 미칩니다.
마찰음에서는 이런 현상이 발성에 큰 지장을 주지는 않지만(비록 발성과 마찰이 동시에 일어나기 위해서는 공기역학적인 균형이 필요하지만), 마찰음은 발성을 크게 방해하지 않습니다. 여기서는 이에 대해서는 다루지 않겠습니다.)
이렇게 해보세요: [v] 소리를 얼마나 오래 지속할 수 있을까요? - 꽤 오래!
stops를 고려하면 상황이 조금 더 흥미로워지기 시작합니다. stops은 성대를 완전히 닫는 것이기 때문에 목소리를 내기 위해 얼마나 많은 공기가 흐를 수 있는지에 한계가 있습니다.
이렇게 해 보세요: 유성 양순 정지(voiced bilabial stop)[b]에서 발성을 얼마나 오래 유지할 수 있나요? (발성을 더 오래 유지하려고 하면 어떻게 될까요?) (참고: 무성 stops에는 발성이 포함되지 않으므로 숨을 참을 수 있는 한 그 폐쇄(closure)를 유지할 수 있어야 합니다.)

Voicing is the result of vocal fold vibration. In previous videos we have seen how voicing appears in waveform, spectrum and spectrogram, and we know that it is independent of place and manner in consonant production. We use the terms “voiced” and “voiceless” to indicate whether the vocal folds are vibrating, but these terms do not tell us much about when those vibrations occur, relative to other events.
Consider for a moment a phrase like “come and get it”. When we examine the waveform and spectrogram, we can see that voicing starts and ends at various places throughout the phrase. Sometimes, voicing persists throughout a number of phones, across both consonants and vowels, as we see here. Without even listening to the audio, we can see that voicing is present by observing the voice bar here at the bottom of the spectrogram, and also by observing the periodic structure of the waveform.
At other times, voicing alternates from one phone to the next, as we see at the start of this phrase where a voiceless stop is followed by voicing in a vowel. (The opposite pattern appears at the end of the phrase, where the vowel is followed by a voiceless stop.)
If you look closely, you might also notice that there is a very brief interval of time where voicing stops during the closure of the [g] sound. If we remember that in the IPA the [g] symbol stands for a voiced bilabial stop, this becomes very curious indeed. Why is a voiced sound produced without voicing?
As it turns out, the alignment of voicing with oral closure during stops varies across languages, and we describe this alignment by referring to Voice Onset Time.
Voice Onset Time is typically used only for stops/plosives, so let’s briefly take a moment to consider why this is.
First, recall that some sounds are typically voiced:
- Sonorants
While others come in pairs of either voiced or voiceless sounds
- Obstruents
Voicing depends on air being able to flow through the larynx in order to set the vocal folds in motion. Sonorants are typically voiced continuously throughout the entire duration because the vocal tract is open, allowing air to flow freely.
Try this: hum a bilabial nasal [m]; how long can keep it going?— Forever!
In obstruents, on the other hand, there is constriction in the vocal tract, which impedes or obstructs airflow. This will have implications for how and when voicing can occur.
In fricatives, this doesn’t really get in the way of voicing too much (although there are aerodynamic tradeoffs in order for both voicing and frication to happen at the same time. We won’t get into those here.)
Try this: how long can you sustain a [v] sound? – quite a long time!
When we consider stops, things start to get a bit more interesting. Since stops involve complete closure of the vocal tract, there is a limit to how much air can flow in order to create voicing.
Try this: how long can you sustain voicing in a voiced bilabial stop [b]? (what happens when you try to keep voicing going longer?) (Sidenote: since voiceless stops don’t involve voicing, you should be able to hold that closure as long as you can hold your breath)
So, we can now see that there are physical limitations to how long voicing can overlap with oral stop closures. While voicing *could* begin and end at any time during sonorants or fricatives, it tends to persist throughout those sounds. In stops, however, the timing of voicing relative to stop closure and release is variable.
Phoneticians describe voice onset time (VOT) in plosives relative to the release burst. This is analogous to a number line, where the burst is located at zero. Voicing before the burst is measured in negative numbers, while voicing that begins after the burst is measured in positive numbers. Note that VOT (like most durations in speech) is typically reported in milliseconds.

As a result of this style of measurement, there are three types of VOT:
pre-voicing, or voicing lead
zero voicing, or short voicing lag
and post-voicing, or long voicing lag
First let’s consider the case of prevoiced stops. Stops that are produced with prevoicing, or negative VOT, will show evidence of voicing during the oral closure, followed by a release burst.
This spectrogram shows an example of a voiced bilabial stop [b], produced between two vowels. Here we can see that voicing continues throughout the stop closure, which is shaded in gray. Voicing is evident in both the waveform, where periodic oscillations are present, as well as in the spectrogram where we can see a voice bar and vertical striations indicative of glottal pulses. The duration of voicing prior to the release of oral closure is 158 ms, which we report as a negative voice onset time of -158 ms.
Stops produced with zero voice onset time have voicing that begins simultaneously (or nearly simultaneously) with the release of oral closure.
This spectrogram shows an example of a voiceless bilabial stop [p], produced between two vowels. We can see the closure of the stop both in the waveform where the signal is flat, and in the spectrogram where there is no shading anywhere in the frequency range. The release burst is shaded in gray, and we can see that the burst duration is short and voicing begins immediately after that release. We often refer to this type of stop release as having “zero” VOT, but often it in fact involves a very short lag of a few milliseconds. In this case the lag lasts for 13 ms after the initial release burst.
The third type of VOT is post-voicing, also called long-voicing lag or positive VOT.
Stops that are produced with positive VOT will typically have no evidence of voicing during oral closure, and the release burst will be followed by an interval or aspiration, or turbulent noise resembling frication.
This spectrogram shows an example of an aspirated voiceless bilabial stop [pʰ] produced between two vowels. Here again we can see that the stop closure is voiceless by examining the waveform and spectrogram, though you may notice that voicing does not end immediately when the closure begins. This is known as “residual voicing” and is quite common, even in voiceless stops. In this case, the burst release is followed by a bit of noise, which often appears in stops with long-lag VOT. We call this noise aspiration. If we look at the waveform and spectrogram here, we can see that this closely resembles a fricative, and indeed aspiration noise is a type of frication. Because voicing begins sometime after the release of the oral closure, we report the 62 ms of lag as a positive number.
So far we have seen how voicing may align with the closure and release phases of stops. Now we will think a bit about how this aligns with the IPA. The IPA is a system of phonetic transcription based on articulatory parameters, but the precise alignment to articulatory (and acoustic) events is generally not specified. This is in part because one of the main goals of the IPA is to capture linguistically relevant contrasts in the sound system of a language – not to faithfully represent the particulars of any one production of speech.
In fact, studies have shown that not all voiceless sounds are voiceless in the same way. We might think, for example, that all voiceless unaspirated stops have the same VOT values. Perhaps we might expect them all to have zero (or small positive) VOT of roughly the same magnitude. However, place of articulation actually has an effect on VOT, with bilabial sounds having the shortest VOT, followed by alveolars, then by velars.
Languages may also differ as to how they maintain voicing contrasts in their sound systems, and linguists often use symbols in a confusing way when describing those contrasts. For example, both Spanish and English are said to have voiced and voiceless stops, which we transcribe using the appropriate IPA symbols for such sounds.
However, if we look at the acoustic productions of these sounds, we see that voiced stops in English have zero VOT, while voiceless stops have positive VOT (and aspiration). In spanish voiced stops are are pre-voiced, while voiceless stops have zero VOT and are unaspirated. Nevertheless, we use the [b] symbol to represent both the English zero VOT ‘b’ sound as well as the negative VOT ‘b’ of Spanish.
Furthermore, some languages even have more than 2 voicing contrasts, adding complexity to the question of how to represent such productions with a phonetic transcription system.
For example, Thai maintains 3 voicing categories: voiced, voiceless unaspirated and voiceless aspirated, while
Hindi maintains a 4: voiced, voiced aspirated, voiceless unaspirated, and voiceless aspirated.
So, despite using the same terminology to identify voiced and voiceless sounds, languages can and do differ with respect to how they align voicing with stop closure, and these differences may not always be apparent from phonetic transcriptions alone.

'음성' 카테고리의 다른 글

Module 3 - Time domain (0)	2023.08.24
Module 2 - Vowel Space (0)	2023.08.16
Module 2 - Acoustic characteristics of consonants (0)	2023.08.16
Module 2 - Acoustic characteristics of vowels (0)	2023.08.16
Module 2 - Spectrogram (0)	2023.08.16

ABOUT ME

저는 딥러닝을 모릅니다 저는 딥러닝을 모릅니다

'음성' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'음성' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바