Sharatkumar Chilaka
Understanding Mispronunciation Detection Systems - Part 1
Updated: Aug 28

Table of Contents
What kind of knowledge will be shared?
Which questions will be addressed?
What is a Pronunciation error?
What are the types of Pronunciation errors?
Automatic Speech Recognition Systems
Mispronunciation Detection Systems
Research studies using GOP based DNN-HMM systems
Research studies using End-to-end models
Abbreviations
The below table contains the full forms of some of the frequently used abbreviations in this article
Abbreviations | Full Forms |
---|---|
CALL | Computer Assisted Language Learning |
CAPT | Computer Assisted Pronunciation Training |
MDD | Mispronunciation Detection and Diagnosis |
ASR | Automatic Speech Recognition |
DNN | Deep Neural Network |
HMM | Hidden Markov Model |
GMM | Gaussian Mixture Model |
CNN | Convolution Neural Network |
RNN | Recurrent Neural Network |
CTC | Connectionist Temporal Classification |
ATT | Attention Architecturebased model |
GOP | Goodness of Pronunciation |
MFCCs | Mel-Frequency Cepstral Coefficients |
| |

Speaking is the most natural way for humans to communicate with each other. As the world becomes more globalised, there is a growing demand for learning foreign languages, particularly English pronunciation. However, traditional pronunciation teaching, which involves one-on-one interaction between a student and a teacher, is often too expensive for many students. As a result, automated pronunciation teaching has become a popular area of research.
What kind of knowledge will be shared?
Through this 3 part series article, the reader will be able to gain a fair understanding of how Mispronunciation detection systems are built. End-to-End Mispronunciation detection algorithm for L2 English speakers based on the Hybrid CTC-ATT approach will be explored. Models based on CNN-RNN-CTC and their variations will be discussed.
Which questions will be addressed?
How to improve mispronunciation detection techniques using End-to-End model architectures for L2 English speakers?
How to simplify the model-building process for mispronunciation-related systems.
This article discusses state-of-the-art research done in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems which is a subset of CALL. Studies on automated pronunciation error detection began in the 1990s, but the development of full-fledged CAPTs has only accelerated in the last decade due to an increase in computing power and the availability of mobile devices for recording speech required for pronunciation analysis.

In the beginning, techniques were created using posterior likelihood called Goodness of Pronunciation using GMM-HMM and DNN-HMM approaches. These methods are difficult to implement compared to the newer ASR based End-to-End mispronunciation detection systems. This article will explore models that use the End-to-End (E2E) approach with Connectionist Temporal Classification and Attention-based sequence decoder. Recently, these models have shown significant improvement in accurately detecting mispronunciations. The article will also compare different CNN-RNN-CTC models to help determine a better approach for developing an efficient mispronunciation detection system.
First, let's start by understanding some basic terms like what is a pronunciation error and what are the different types of pronunciation errors.
What is a Pronunciation error?
Pronunciation refers to how a phrase or word is spoken. It is difficult to determine what constitutes a 'pronunciation error' because there is no clear definition of correct or incorrect pronunciation. Instead, there is a wide range of speech styles, ranging from sounding like a native speaker to being completely unintelligible.
What are the types of Pronunciation errors?
Phonemic and Prosodic are the two major categories of pronunciation errors.
Phonemic Pronunciation Errors
Phonemic pronunciation errors involve mispronouncing phonemes, which can lead to misunderstandings or misinterpretations of words. These errors typically occur when a speaker substitutes one phoneme for another, omits a phoneme, or adds an extra phoneme. For example, confusing the sounds /b/ and /p/ in English can result in words like "bit" being pronounced as "pit" or vice versa. Phonemic errors involve mistakes related to individual speech sounds (phonemes) that can change the meaning of words.
Prosodic Pronunciation Errors
Prosodic pronunciation errors involve misinterpreting or misusing prosodic features such as stress, rhythm and intonation, which can affect the natural flow and meaning of speech. These errors might lead to sentences sounding robotic, monotone, or lacking the appropriate emphasis. For instance, if a speaker uses the wrong intonation pattern in a question, it can make the sentence sound like a statement instead. Prosodic errors involve mistakes related to the rhythmic and intonation patterns of speech, affecting the overall delivery and interpretation of spoken language.

Identifying and teaching pronunciation errors as a whole is a challenging problem. As a result, previous studies have mainly focused on phonemic and prosodic errors.
A phoneme represents the smallest possible unit of speech audio when compared with a syllable, word, or phrase. The accuracy of judging pronunciation quality is more uncertain for shorter units. The shorter the unit, the more uncertain the judgment of pronunciation accuracy would be.
A discussion about the purpose of learning pronunciation has reached the conclusion that the emphasis should be on students' ability to be understood rather than sounding exactly like native speakers. While it is important for advanced learners to sound more like native English speakers, it is not as crucial as being easily understood.
There are two main uses of pronunciation error detection in industries:
to measure pronunciation and
to teach pronunciation
Both applications have their own difficulties, particularly when it comes to pronunciation training.

The focus of this article is limited to mispronunciation error identification in adult read speech only. The study can also be used to create mispronunciation detection systems for children's speech data. This is especially useful to pupils belonging to rural areas, where the shortage of teachers is a major problem.
Also, the experiments are being conducted on L2-Arctic and TIMIT datasets which are recorded in ideal conditions. When a CAPT system is used inside a mobile app, the recorded speech for the evaluation would contain background noise as well. Currently, this issue is not being addressed in this article


Speech Processing involves studying and using concepts like Signal Processing, Deep Learning, and Natural Language Processing. In simple terms, it means converting audio data from sound waves to digital format and extracting important sound features. This is done using Digital Signal Processing techniques. Then, these sound features are used to train machine learning models that can understand and predict the sounds and letters in the audio data. Deep Learning models have shown good results in this area. Finally, a Pronunciation and Language model is used to arrange the words in the correct order. The language model can predict the next word based on the previous words.
Basics of Signal Processing
Audio data analysis is the process of analyzing and working with sound signals that are recorded by digital devices like microphones. These signals are used in various applications, such as CALL systems. Here are some simple explanations of the fundamental ideas behind signal processing.
Sampling Frequency
Audio sampling is the process of converting a continuous audio signal into a series of separate values in signal processing. The sampling rate refers to how often sound waves are converted into digital form at specific intervals (For CD-quality audio, this is typically 44.1kHz, meaning that 44,100 samples are taken per second). Each sample represents the amplitude of the wave at a particular moment in time, and the bit depth determines the level of detail in each sample, also known as the dynamic range of the signal. A 16-bit sample can have a value ranging from 0 to 65,536.

Sampling Frequency refers to the number of samples taken within a specific timeframe. When the sampling frequency is low, there is a greater loss of information, but it is cheaper and easier to calculate. On the other hand, a high sampling frequency results in less loss of information, but it requires more computing power and is more expensive.
Audio Spectrogram
An Audio spectrogram is a visual representation of the audio signal in terms of the amplitude of frequencies over a time period. Audio spectrograms are created by applying the Fast Fourier Transform on the audio signals.
All words are made up of distinct phoneme sounds each of which has different vowel frequencies hence spectrograms can be used to phonetically identify words spoken by humans. The below figure represents a spectrogram of spoken words “nineteenth century”. In this figure, time is represented on the X-axis, frequency is represented on the Y-axis and the legend on the right side shows color intensity which is proportional to amplitude intensity.

Figure 3 Spectrogram of spoken words “nineteenth century” (Wikipedia, 2021)
MFCCs
Mel-Frequency Cepstral Coefficients (MFCCs) are derived from the spectrogram and have proven to be more accurate in simulating the human auditory system's response. Hence MFCCs are a feature widely used in automatic speech and speaker recognition. It represents the envelope of the time power spectrum of the speech signal. MFCCs of a signal are a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope.

Plot of Mel Frequency Cepstral Coefficients (MFCCs) Spectral Envelope
Spectrograms vs MFCCs
Spectrograms provide a visual representation of frequency content over time, capturing detailed temporal dynamics and prosodic features. MFCCs, on the other hand, offer a more compact representation of the spectral characteristics of audio signals, particularly suited for tasks like speech recognition. While spectrograms are valuable for a wide range of audio analysis tasks, MFCCs are a powerful tool in speech and audio processing, especially when efficient feature extraction is crucial.
ARPABET
ARPABET is a series of phonetic transcription codes created in the 1970s by the Advanced Research Projects Agency (ARPA) as part of their Speech Understanding Research project. It uses distinct sequences of ASCII characters to reflect phonemes and allophones of General American English.
ARPABET transcriptions play a crucial role in building Automatic Speech Recognition (ASR) systems. ASR systems require large amounts of transcribed audio data for training Machine learning models. During data collection, human annotators often use ARPABET transcriptions to phonetically transcribe the spoken words in the training corpus. These transcriptions map the spoken words to their corresponding phonetic representations, capturing the pronunciation variations that occur in natural speech. During the decoding phase of ASR, acoustic models convert acoustic features from the input audio into a sequence of phonetic symbols.
Following are some examples of Phonetically transcribed words
Words | Phonetic Transcriptions |
---|---|
dog | D AO G |
cat | K AE T |
rain | R EY N |
tree | T R IY |
sun | S AH N |
elephant | EH L AH F AH N T |
language | L AE NG G W AH JH |
Phonetic transcription is not only used in training and evaluating automatic speech recognition (ASR) systems, but also for building text-to-speech (TTS) synthesis models, analyzing speech patterns, and investigating phonetic characteristics of spoken language. While ARPABET is specific to American English, similar phonetic transcription systems exist for other languages to represent their unique phonological features.
Automatic Speech Recognition Systems
A majority of the components used in speech recognition systems are also used in building mispronunciation detection systems. Hence it is essential to understand how an automatic speech recognition system works.
The task of an ASR system is to recognize the sequence of words present in the speech signal. It works by breaking down audio into individual sounds, converting these sounds into digital format, and using machine learning models to find the most probable word fit in a particular language. All phoneme sounds have different frequency patterns which can be learned by machine learning algorithms.
Speech Properties
Speech essentially has two types of properties physical and linguistic properties. A speaker’s age, gender, personality, accent, and background noise while recording affects the way speech is produced. All these aspects combined form the physical properties of speech.
Now since there are so many variations and nuances in these physical properties of speech it is extremely hard to come up with all rules possible for speech recognition. Not only do we have to deal with the physical properties of speech, but we must deal with linguistic properties as well.
For example, consider two sentences “I read the book last night” and “This is a red book”. Observe the words “read” and “red” have similar pronunciations but they are interpreted differently based on their context. So, language itself is complex and it has a lot of nuances and variations that we must come up with all possible rules for it as well for having effective speech recognition.
DNN-HMM based ASR
To properly deal with the variations and nuances that come with the physicality of speech such as age, gender, microphone, and environmental conditions an Acoustic model is built. An Acoustic model is a Deep Neural Network and Hidden Markov Model(DNN-HMM) model that takes speech features (MFCC) as input and outputs the transcribed text. For the neural network to properly transcribe speech data to text it needs to be trained on huge amounts of speech data. Speech is a naturally occurring time sequence which means a neural network that can process sequential data is required, for this purpose Recurrent Neural Networks (RNNs) can be used.
Now to deal with the linguistic aspect of the speech and inject the linguistic features into the transcriptions, a language model, pronunciation model and a rescoring algorithm are used.
The output of an acoustic model is the probability of possible phones at each timestep. Pronunciation models then map these phones onto words. Now instead of emitting the words with the highest probability as output transcript, the language model helps determine what is a more likely sentence by building a probability distribution over sequences of words it trained upon. It is used to re-score the probabilities depending on the context of the sentence. By using a language model, the linguistics properties can be injected into the output of an acoustic model and the accuracy of the transcriptions can be increased.

End-to-End ASR
In a conventional ASR system, there exists an acoustic model, language model, and pronunciation model to map acoustic features all the way to the words. Each of these modules is trained and adjusted independently with a distinct goal in mind. Errors generated in one module may not cooperate with errors in another. More recently it has been possible to represent this conventional model setup by one neural network. This gives many advantages in terms of model simplicity, and model size, and optimization also becomes much easier. These types of models are called end-to-end models because they try to encompass the functionality of conventional ASR model components into one big model. Attention-based Sequence to Sequence and Connectionist Temporal Classification (CTC) are some examples of such models.


Mispronunciation detection systems also consist of the same components as that of an ASR system. The process of feature extraction, and acoustic modelling remains the same. Initial mispronunciation detection was ASR-based and made use of the Goodness of Pronunciation (GOP) algorithm to perform phone-level pronunciation error detection.
Goodness of Pronunciation
The aim of the GOP measure is to provide a score for each phone of an utterance. The orthographic transcription is known while computing this score. The equation of GOP for a given phone p is as follows.

where Q is the set of all phone models
NF(p) is the number of frames in the acoustic segment 𝑶(𝒑).
A block diagram of the resulting scoring mechanism is shown below. The front-end feature extraction converts the speech waveform into a sequence of mel frequency cepstral coefficients (MFCC). These coefficients are used in two recognition passes: the forced alignment pass and the phone recognition pass. In the forced alignment pass, the system aligns the speech waveform with the corresponding phonetic transcription. In the phone recognition pass, each phone can follow the previous one with equal probability.
Using the results obtained, individual GOP scores for each phone is calculated as per previous equations. Then, a threshold is applied to each GOP score to identify and reject phones that are badly pronounced. The specific threshold to use depends on how strict we want to be in our evaluation.

Forced Alignment
Forced alignment is a crucial step in assessing and detecting mispronunciations. A speech recognition system (often based on Hidden Markov Models or deep learning techniques) is used to perform forced alignment. This system uses the reference phonetic transcription to align the phonemes with the actual audio signal produced by the speaker.
The forced alignment process produces an alignment grid that maps each phoneme in the reference transcription to a specific time segment in the speech signal. This alignment grid provides a fine-grained mapping of when each phoneme begins and ends in the audio. The alignment information can be used to calculate scores that indicate the degree of similarity between the produced phonemes and the reference phonemes. These scores can be aggregated to provide an overall measure of pronunciation accuracy, often referred to as the "Goodness of Pronunciation" (GOP) score.
Drawbacks of GOP
GOP based methods however could only provide the functionality of mispronunciation detection and lacked the ability to provide a diagnosis for mispronunciation. Extended Recognition Networks (ERN) address this by providing diagnosis feedback for insertion, substitution, and deletion errors. It does this by extending the decoding network of ASR with phonological rules and can thus provide the diagnosis feedback by comparing an ASR output and the corresponding text prompt. ERN decoding networks require a high number of phonological rules which also degrades the performance of ASR resulting in poor performance of MDD systems.
Recently End-to-End ASR systems have also shown promising results for MD tasks. These methods do not use force alignment and integrate the whole training pipeline. The conceptual simplicity and practical effectiveness of end-to-end neural networks have recently prompted considerable research efforts into replacing the conventional ASR architectures with integrated E2E modelling frameworks that learn the acoustic and language models jointly.

There was a great amount of research done on Mispronunciation detection techniques for developing efficient Computer Assisted Pronunciation Training systems. A complete CAPT system would consist of having methods to determine Mispronunciation detection and providing diagnosis feedback. Here are some important research studies that have been conducted on the topic of detecting mispronunciation. The references to the citation mentioned can be found at the end of this article.
Research studies using GOP based DNN-HMM systems.
(Witt and Young, 2000) This paper explores a measure called 'Goodness of Pronunciation' (GOP) that assesses pronunciation accuracy by considering individual thresholds for each phone based on confidence scores and rejection statistics from human judges. The study incorporates models of the speaker's native language and includes expected pronunciation errors. The GOP measures are evaluated using a database of non-native speakers annotated for phone-level pronunciation errors. The results indicate that a likelihood-based pronunciation scoring metric can achieve usable performance with the enhancements. This technique is now commonly found in pronunciation assessment and identification of mispronunciation tasks.
(Lo et al., 2010) focuses on using phonological rules to capture language transfer effects and generate an extended recognition network for mispronunciation detection.
(Qian et al., 2010) proposes a discriminative training algorithm to minimize mispronunciation detection errors and diagnosis errors. It also compares handcrafted rules with data-driven rules and concludes that data-driven rules are more effective in capturing mispronunciations.
(Witt, 2012) This paper talks about the latest research in CAPT as of early 2012. It discusses all the important factors that contribute to pronunciation assessment. It also provides a summary of the research done so far. Furthermore, it gives an overview of how this research is used in commercial language learning software. The paper concludes with a discussion on the remaining challenges and possible directions for future research.
(Hu et al., 2015) This paper suggests different ways to improve the detection of mispronunciations. First, the acoustic model is refined using DNN training to better distinguish between different sounds. Then, F0 is added to the model to identify pronunciation errors caused by incorrect stress or tone. The measurement of pronunciation quality is enhanced using a DNN-HMM based system. Finally, a neural network based classifier is proposed to improve generalization. Experimental results on English and Chinese learning show the effectiveness of these approaches.
(Li et al., 2016) This paper suggests using speech attributes like voicing and aspiration to detect mispronunciation and provide feedback. It focuses on improving the detection of mispronunciations at the segmental and sub-segmental levels. In this study, speech attribute scores are used to assess the quality of pronunciation at a subsegmental level, such as how sounds are made. These scores are then combined using neural network classifiers to generate scores for each segment. This proposed framework reduces the error rate by 8.78% compared to traditional methods such as GOP, while still providing detailed feedback.
Most existing methods for detecting and diagnosing mispronunciations focus on categorical phoneme errors, where one native phoneme is replaced with another. However, they do not consider non-categorical errors. This study (Li et al., 2018) aims to improve mispronunciation detection by developing an Extended Phoneme Set in L2 speech (L2-EPS), which includes both categorical and non-categorical phoneme units. By analyzing clusters of phoneme-based phonemic posterior-grams (PPGs), L2-EPS is identified. Experimental results show that including non-categorical phonemes in L2-EPS enhances the representation of L2 speech and improves mispronunciation detection performance.
(Sudhakara et al., 2019) This study proposes a new formulation for GoP that considers both sub-phonemic posteriors and state transition probabilities (STPs). The proposed method is implemented in Kaldi, a speech recognition toolkit, and tested on English data collected from Indian speakers.
Research studies using End-to-End models.
(Lo et al., 2020) This study presents a novel approach to the Mispronunciation Detection (MD) task using a hybrid CTC-Attention model. The approach combines the strengths of both models and eliminates the need for phone-level forced-alignment. Input augmentation with text prompt information is also used to customize the E2E model for the MD task. Two MD decision methods are adopted: decision-making based on recognition confidence or speech recognition results. Experiments show that this approach simplifies existing systems and improves performance. Input augmentation with text prompts shows promise for the E2E-based MD approach.
(Yan et al., 2020) Most Mispronunciation Detection and Diagnosis (MDD) methods focus on fixing categorical errors but struggle with non-categorical or distortion errors. This study uses a novel approach called end-to-end automatic speech recognition (E2E-based ASR) to improve MDD. By adding an anti-phone set to the original phone set, both categorical and non-categorical mispronunciations can be detected and diagnosed more accurately, resulting in better feedback. The study also introduces a transfer-learning approach to estimate the initial model of the E2E-based MDD system without using phonological rules. Extensive experiments on the L2-ARCTIC dataset show that the improved system outperforms existing baseline systems and pronunciation scoring methods (GOP) in terms of F1-score, with improvements of 11.05% and 27.71% respectively.
(Zhang et al., 2020) An end-to-end ASR system was initially built using the hybrid CTC/attention architecture. The system's performance was further enhanced by incorporating an adaptive parameter, resulting in good results for the APED task of Mandarin. This new method eliminates the need for force alignment, segmentation, and complex models, making it a convenient and suitable solution for L1-independent CAPT. The proposed system based on the improved hybrid CTC/attention architecture is comparable to the state-of-the-art DNN-DNN ASR system and has a stronger impact on F-measure metrics, which are important for the APED task.
(Feng et al., 2020) A mispronunciation detection and diagnosis (MDD) system typically consists of multiple stages, including an acoustic model, a language model, and a Viterbi decoder. To integrate these stages, a new model called SED-MDD is proposed for sentence-dependent mispronunciation detection and diagnosis. This model takes mel-spectrogram and characters as inputs and outputs the corresponding phone sequence. Experiments have shown that SED-MDD can learn the phonological rules directly from the training data, achieving an accuracy of 86.35% and a correctness of 88.61% on L2-ARCTIC, outperforming the existing model CNN-RNN-CTC.
(Fu et al., 2021) This paper presents a new text-dependent model that utilizes prior text in an end-to-end structure, similar to SED-MDD. The model aligns the audio with the phoneme sequences of the prior text using the attention mechanism, achieving a fully end-to-end system. To address the imbalance between positive and negative samples in the phoneme sequence, three simple data augmentation methods are proposed, effectively improving the model's ability to capture mispronounced phonemes. Experiments on L2-ARCTIC show that the model's best performance improved from 49.29% to 56.08% in the F-measure metric compared to the CNN-RNN-CTC model.

Witt, S.M. and Young, S.J., (2000) Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 302, pp.95–108.
Lo, W.-K., Zhang, S., Meng, H. (2010) Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. Proc. Interspeech 2010, 765-768, doi: 10.21437/Interspeech.2010-280
Qian, X., Soong, F.K., Meng, H. (2010) Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). Proc. Interspeech 2010, 757-760, doi: 10.21437/Interspeech.2010-278
Witt, Silke. (2012). Automatic Error Detection in Pronunciation Training: Where we are and where we need to go.
Hu, Wenping & Qian, Yao & Soong, Frank & Wang, Yong. (2015). Improved Mispronunciation Detection with Deep Neural Network Trained Acoustic Models and Transfer Learning based Logistic Regression Classifiers. Speech Communication. 67. 10.1016/j.specom.2014.12.008.
W. Li, S. M. Siniscalchi, N. F. Chen and C. -H. Lee, "Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016, pp. 6135-6139, doi: 10.1109/ICASSP.2016.7472856.
Li, X., Mao, S., Wu, X., Li, K., Liu, X., Meng, H. (2018) Unsupervised Discovery of Non-native Phonetic Patterns in L2 English Speech for Mispronunciation Detection and Diagnosis. Proc. Interspeech 2018, 2554-2558, doi: 10.21437/Interspeech.2018-2027
Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K. (2019) An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities. Proc. Interspeech 2019, 954-958, doi: 10.21437/Interspeech.2019-2363
Lo, T.-H., Weng, S.-Y., Chang, H.-J., Chen, B. (2020) An Effective End-to-End Modeling Approach for Mispronunciation Detection. Proc. Interspeech 2020, 3027-3031, doi: 10.21437/Interspeech.2020-1605
Yan, B.-C., Wu, M.-C., Hung, H.-T., Chen, B. (2020) An End-to-End Mispronunciation Detection System for L2 English Speech Leveraging Novel Anti-Phone Modeling. Proc. Interspeech 2020, 3032-3036, doi: 10.21437/Interspeech.2020-1616
Zhang, L.; Zhao, Z.; Ma, C.; Shan, L.; Sun, H.; Jiang, L.; Deng, S.; Gao, C. End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture. Sensors 2020, 20, 1809. https://doi.org/10.3390/s20071809
Feng, Y., Fu, G., Chen, Q. and Chen, K., "SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 3492-3496, doi: 10.1109/ICASSP40776.2020.9052975.
Fu, K., Lin, J., Ke, D., Xie, Y., Zhang, J., & Lin, B. (2021, April 16). A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques. ArXiv.org. Retrieved August 28, 2023, from /abs/2104.08428