Sharatkumar Chilaka
Understanding Mispronunciation Detection Systems - Part 1
Updated: Jan 3
With growing globalization, there is a substantial rise in the market for learning a foreign language, and one among them is English pronunciation learning. Pronunciation teaching essentially involves a one-to-one interaction between pupil and teacher, which is unaffordable for many pupils. Hence, automated pronunciation teaching has become a popular research domain.

Introduction
Study work on the automatic identification of pronunciation errors and measurement of pronunciation began in the 1990s with a series of events from the late 90s till the early 2000s. The commercialization of CAPT at the beginning of 2000 proved problematic and, thus, development activities have slowed down. Interest started up again about thirteen years ago with increased computing capacity, smart devices, and enhanced speech recognition. To understand the study better some of the basic terminologies such as What is Pronunciation error and what types of Pronunciation errors need to be understood.
What is a Pronunciation error?
Pronunciation is the style in which a phrase or a word is spoken. It is hard to calculate the word 'pronunciation error' because there is no definition available for correct and incorrect pronunciation. Instead, a complete spectrum is available from native-sounding speech remains to unintelligible speech.
What are the types of Pronunciation errors?
Errors in pronunciation can be classified into categories of phonemic and prosodic errors. The errors where phonemes may be replaced, omitted, or added with another phoneme are phonemic errors. Errors in Non-native accents can be classified based on rhythm, intonation, and stress on the prosodic side.

Identification and teaching of pronunciation errors in their entirety is a difficult issue and hence most of the previous studies have tackled only the components of phonemic and prosodic errors. A phoneme represents the smallest possible unit of speech audio when compared with a syllable, word, or phrase. The pronunciation quality judgment variability is higher for shorter units. The shorter the unit, the greater the uncertainty would be in pronunciation accuracy judgment.
A debate that began on the purpose of learning pronunciation concluded that training should be more focused on students’ intelligibility rather than sounding like native speakers. Although it is essential that advanced learners sound more like an L1 English speaker, obviously it is less essential than fundamental intelligibility.
There are two major industrial applications of pronunciation error detection:
as part of the measurement of pronunciation and
as part of the instruction of pronunciation.
Each implementation presents its own challenges, especially on the pronunciation training side.
What kind of knowledge will be shared?
Through this 3 part series article, the reader will be able to gain a fair understanding of how Mispronunciation detection systems are built. End-to-End Mispronunciation detection algorithm for L2 English speakers based on the Hybrid CTC-ATT approach will be explored. Models based on CNN-RNN-CTC and their variations will be discussed.
Questions that will be addressed?
How to improve mispronunciation detection techniques using End-to-End model architectures for L2 English speakers?
How to simplify the model-building process for mispronunciation-related systems.
Scope and Limitations
The focus of this article is limited to mispronunciation error identification in adult read speech only. Though the study can also be used to create mispronunciation detection systems for children's speech data. This is especially useful to pupils belonging to rural areas, where the shortage of teachers is a major problem. Also, the experiments are being conducted on L2-Arctic and TIMIT datasets which are recorded in ideal conditions. When a CAPT system is used inside a mobile app, the recorded speech for the evaluation would contain background noise as well. Currently, this issue is not being addressed in this article
Speech Processing Basics

Speech Processing involves the study and usage of NLP, Deep Learning, and Digital Signal Processing concepts. Extend this paragraph more.
Audio data analysis involves analyzing and processing audio signals captured by digital devices viz. microphones, which have several applications in Computer Assisted Language Learning(CALL) systems(Purwins et al., 2019). The sound signal is defined in terms of wavelength, bandwidth, decibels, and so on. A signal may be represented in terms of Amplitude and Time.
.
Sampling Frequency
Audio sampling is the method of converting a continuous audio signal into a sequence of distinct values in signal processing. The sampling rate is the rate at which sound waves are digitized at discrete intervals (For CD-quality audio, the sample rate is normally 44.1kHz, which means per second 44,100 samples are taken). Every sample is the wave amplitude at a specific time period, and the bit depth defines how detailed the sample would be, often known as the signal's dynamic range. (A 16-bit sample has a value range from 0 to 65,536). The number of samples considered in a given period of time is called Sampling frequency. A low sampling frequency creates more loss of information but is inexpensive to compute and easy, while a high sampling frequency gives more information loss at a higher computing cost

Audio Spectrogram
An Audio spectrogram is a visual representation of the audio signal in terms of the amplitude of frequencies over a time period. All words are made up of distinct phoneme sounds and each of which has different vowel frequencies hence spectrograms can be used to phonetically identify words spoken by humans. Figure 3 represents a spectrogram of spoken words “nineteenth century”. In Figure 3 time is represented on the X-axis, frequency is represented on the Y-axis and the legend on the right side shows color intensity which is proportional to amplitude intensity. Audio spectrograms are created by applying Fast Fourier Transform on the audio signals.

Figure 3 Spectrogram of spoken words “nineteenth century” (Wikipedia, 2021)
MFCCs
The process of Automatic speech recognition begins by extracting features from the audio signal that represent the linguistic content. Mel Frequency Cepstral Coefficients have been proven to be more accurate in simulating the human auditory system's response(Chauhan and Desai, 2014). It represents the envelope of the time power spectrum of the speech signal. Hence MFCCs are a feature widely used in automatic speech and speaker recognition. MFCCs of a signal is a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope.

Figure 4 Plot of Mel Frequency Cepstral Coefficients (MFCCs) Spectral Envelope
ARPABET
ARPABET is a series of phonetic transcription codes created in the 1970s by the Advanced Research Projects Agency (ARPA) as part of their Speech Understanding Research project. It uses distinct sequences of ASCII characters to reflect phonemes and allophones of General American English. Table 1 shows all the ARPAbet symbols with their usage
ARPABET | Example | Annotation |
---|---|---|
AA | | |
AE | | |
AH | | |
AO | | |
AW | | |
AY | | |
B | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |