top of page
  • Writer's pictureSharatkumar Chilaka

Understanding Mispronunciation Detection Systems - Part 2

Updated: 3 days ago

This is the second part of a two-part article series on Understanding Mispronunciation Detection Systems. In this article, readers will learn about the step-by-step process of building and training the model and the tools and technologies needed for conducting the experiments. The proposed model section explains how different components of the model structure work together. It will also cover the evaluation metrics used to measure the model's performance.

Automatic Speech Recognition Systems

Mispronunciation Detection Systems

The objective of an ASR system is to transcribe spoken language into text.

The objective of an MDD system is to identify and analyse mispronunciations.

The training data consists of audio paired with corresponding correct transcriptions of the spoken words.

The training data includes instances of correct pronunciations as well as various types of mispronunciations. This may involve annotations specifying the types of mispronunciations (e.g., phonemic errors and prosodic errors).

Acoustic modelling in ASR focuses on mapping acoustic features to linguistic units for accurate transcription.l

Acoustic modelling in MDD involves recognizing acoustic patterns associated with correct and incorrect pronunciation for the purpose of mispronunciation detection.

The decoding process results in a final transcription of the spoken words, representing the system's best estimate of the spoken language given the observed acoustic features.

The decoding process results in an assessment of the pronunciation quality, indicating whether it is likely to be correct or whether there are potential mispronunciations.

Evaluation metrics include word error rate (WER), phoneme error rate (PER), and other transcription accuracy measures.

Evaluation metrics focus on the system's ability to correctly identify and classify mispronunciations, which may involve precision, recall, F1 score, or other classification metrics.

Widely used in applications such as voice assistants, transcription services, and voice command recognition.

Applied in language learning platforms, pronunciation assessment tools, and educational software to provide feedback on and help improve a user's pronunciation skills.

While ASR and MDD share challenges related to acoustic modeling and handling diverse linguistic variations, MDD introduces a set of complexities related to the nuanced assessment of pronunciation correctness. The subjective nature of pronunciation makes MDD a task that involves understanding and learning the subtle differences between acceptable variations and actual mispronunciations. It underscores the importance of carefully curated training data, expert annotation, and a nuanced understanding of linguistic norms for effective MDD model development.

In Part 1 of this article, we understood that End-to-end ASR models are simpler to build and optimize as they do not have multiple modules to train separately when compared with conventional ASR systems. Initially, Graves & Jaitley presented a system based on the combination of the deep bidirectional LSTM recurrent neural network architecture and Connectionist Temporal Classification objective function. They demonstrated that character-level speech transcription can be performed by a recurrent neural network with minimal preprocessing and no explicit phonetic representation. They also introduced a novel objective function that allows the network to be directly optimised for word error rate, even in the absence of a lexicon or language model.

While CTC with RNN makes it feasible to train End-to-end speech recognition systems, it is computationally expensive and sometimes difficult to train. (Zhang et al., 2016 ) suggested an End-to-end speech framework for sequence labeling, by using a combination of hierarchical CNNs and CTC without the need for recurrent connections. The proposed model is not only computationally efficient but also has the capacity to learn temporal relations that are required for it to be integrated with CTC.

(Chorowski et al., 2015) present an Attention-based Recurrent Sequence Generator (ASRG), a recurrent neural network that stochastically generates an output sequence from an input. it is based on a hybrid attention mechanism that combines both content and location information in order to select the next position in the input sequence for decoding. This proposed model could recognize utterances much longer than the ones it was trained on. Also, the deterministic nature of ASRG's alignment mechanism allows the Beam search procedure to be simpler, which allows for faster decoding.

(Watanabe et al., 2017) propose a hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, a multiobjective learning method is employed by attaching a CTC objective to an attention-based encoder network as a regularization. This greatly reduces the number of irregularly aligned utterances without any heuristic search techniques. A joint decoding approach is used by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. This method has outperformed both CTC and an attention model on ASR tasks in real-world noisy conditions as well as in clean conditions. This work can potentially be applied to any sequence-to-sequence learning task.

(Lo et al., 2020; Yan et al., 2020; Zhang et al., 2020) propose a hybrid CTC and attention-based end-to-end architecture for MDD. (Zhang et al., 2020) introduce a dynamic parameter adjustment method for parameter α  used by (Watanabe et al., 2017). (Yan et al., 2020) also uses an anti-phone collection to generate additional speech training data with a label-shuffling scheme for a novel data-augmentation operation. The label of the phone at each point of its reference transcript is either kept unchanged or randomly replaced with an arbitrary anti-phone label for every utterance in the original speech training dataset. (Lo et al., 2020) perform input augmentation with text prompt information to make the resulting E2E-based model more tailored for MDD.

(Leung et al., 2019) presents the CNN-RNN-CTC approach to develop an end-to-end speech recognition approach for the task of MDD. This approach does not need the presence of any phonemic and graphemic information and also force-alignment is not required. It is one of the first models that proposes an end-to-end model for MDD and works well with the CU-CHLOE corpus spoken by Cantonese and Mandarin speakers.

(Feng et al., 2020) build SED-MDD, a sentence-dependent end-to-end model for MDD. It is the first model that uses both linguistic and acoustic features to deal with MD&D problems. The model consists of a sentence encoder whose goal is to extract robust sequential representations of a given sentence and a sequence labelling model with an attention mechanism. It is trained from scratch with random initialisation and does not make use of phonological rules and forced alignment, it just needs audio files, transcription and annotation files. Moreover, the model is evaluated on two publicly available corpora TIMIT and L2Arctic which makes it a strong baseline for researchers.

(Fu et al., 2021) present a system similar to SED-MDD but instead of feeding character sequences, phoneme sequences are fed to the sentence encoder. MDDs aim to detect phoneme level errors hence it makes sense to use phoneme sequence. They also propose 3 easy data augmentation techniques to handle data imbalance issues between positive and negative samples in the L2 Arctic dataset, which also improves the accuracy of the model when compared with CNN-RNN-CTC and SED-MDD.

The figure below illustrates the step-by-step process of building MDD systems. It starts with preparing the data and extracting the acoustic features required to train the model. The next steps involve training the Acoustic and language models and decoding the results.

During this phase, speech data is acquired, annotated, augmented and sorted. Each of these steps is an individual task that would require considerable time and effort. Although there exist freely available speech corpora for experimentation purposes, they can be of limited use when it comes to building MDDs for L2 English speakers. ASR and MDD systems must be trained and tested on different speakers, the more speakers you have the better.

After acquiring relevant speech data it needs to be annotated using tools such as PRAAT. This is essentially the stage of labelling the data. Data Augmentation is performed to increase training samples, it also addresses data imbalance issues. More details about data augmentation and freely available speech datasets will be explored in the third part of the article.

In the case of speech data, apart from having audio files, a transcript and annotation files are also required. One of the popular data preparation techniques is Kaldi-style data preparation. It is designed to ensure that audio data is well-structured, aligned with text transcriptions, and ready for feature extraction and model training. It follows a consistent workflow, making it easier for researchers and practitioners to work with large-scale speech datasets while maintaining data quality and consistency. The following is the list of files that need to be created.

  • text - This file named “text” contains transcripts of all the utterances present in the speech corpus. The content inside the text file has the format <utterance_id> <text_transcription> where utterance_id is the speaker ID and the file name of a particular audio sample is appended together.

  • wav.scp - This file contains the location of all the audio files present in the corpus. The content inside the text file has the format <utterance_id ><full_path_to_audio_file> where utterance_id is the same as described above and full_path_to_audio_file is the location of the file on the hard disk. Kaldi required the audio files to be in single channel .wav format. If the audio files are in different formats, then it uses the SOX tool to convert them to WAV format before extracting the features.

  • utt2spk - This is a text file that contains the mapping of each audio utterance to its corresponding speaker. It has the following format for organising content: <utterance_id><speaker_id> where utterance_id is the same as described above and speaker_id is the id/code name given to the speaker.

  • Segments - If the corpus presents one audio file for each speaker which contains several utterances then we would need to create a segments file. This is a text file containing the start and end times for each utterance in the audio file. The content format is as follows: utterance_id file_name start_time end_time. This is an optional file and it is required when multiple utterances are present inside each audio file.

The above-mentioned files can be prepared by executing Bash or Python scripts. Kaldi also provides some example data preparation scripts for some of the commonly used speech corpora such as TIMIT, LibriSpeech, WSJ, Switchboard and many others.

After organizing audio data, transcripts, and pronunciation information according to the Kaldi-style data preparation process, we can extract acoustic features like Mel-Frequency Cepstral Coefficients (MFCCs) and Filterbank (Fbank) features from the audio files. Kaldi provides tools and scripts to perform feature extraction. The feature extraction tools are often used in combination with Kaldi's data iteration scripts. These scripts iterate through the audio files, read the audio data, and pass it through the feature extraction tools to produce feature vectors for each frame. The extracted feature vectors are saved in Kaldi's native format, often as binary files that can be efficiently read during training and decoding.

The extracted feature vectors serve as input features for training ASR and MDD models. During training, the feature vectors are paired with the corresponding transcript information to learn the relationships between acoustic features and phonetic sequences. During decoding (transcription of unseen audio), the same features are used to generate hypotheses or predictions. Kaldi provides executables like compute-mfcc-feats for MFCC feature extraction and compute-fbank-feats for Fbank feature extraction.

MFCC's and Filter Banks

Mel-Frequency Cepstral Coefficients (MFCCs) and Filter banks are both commonly used acoustic features in the field of speech processing and automatic speech recognition (ASR). They serve as representations of the spectral characteristics of speech signals, but they are computed differently and have distinct properties.


Filter bank

​MFCCs tend to capture both spectral and timbral characteristics.

Fbank features are more focused on spectral characteristics.

MFCCs typically have a lower dimensionality compared to filter banks due to the application of the discrete cosine transform (DCT). This can lead to faster training and reduced computational complexity.

​Filter banks capture more spectral information than MFCCs, which can be beneficial for tasks that require fine-grained spectral detail.

​The dimensionality reduction in MFCCs can lead to a loss of fine spectral detail, which might be important for some tasks.

​The higher dimensionality of filter bank features can lead to increased computational requirements and slower training, especially when used with deep neural networks

MFCCs are decorrelated

​Filter bank features are highly correlated features.

In practice, both MFCCs and filter banks have been used successfully in deep learning-based ASR systems. The choice between them often depends on empirical experimentation and domain-specific considerations. It's common for researchers to experiment with both representations and choose the one that works best for their specific ASR task and dataset.

After performing Data preparation and Feature extraction, Acoustic models are built as a next. As explained in Part 1 of this article, an Acoustic model is built not only to capture the relationship between acoustic features of speech to corresponding linguistic units but also to properly deal with the variations and nuances that come with the physicality of speech such as age, gender, microphone, and environmental conditions. Attention-based Sequence to Sequence and Connectionist Temporal Classification (CTC) are some examples of such models. Connectionist Temporal Classification is a technique that is used with encoder-only architecture and sequence-to-sequence models make use of Encoder-Decoder architecture for ASR and MDD tasks.

What is an Encoder-Decoder Architecture?

Encoder-Decoder architectures are neural network structures that allows the model to capture and process complex sequential information, making it well-suited for tasks involving variable-length input and output sequences, such as transcribing speech or detecting mispronunciations. These architectures consist of two main components: an encoder and a decoder.


The encoder processes the input sequence and transforms it into a fixed-size representation, often referred to as a context vector. This context vector captures the essential information from the input sequence in a condensed form. This means that the model is optimized to acquire understanding from the input.


The decoder takes the context vector produced by the encoder and generates the output sequence.

The output sequence is produced step by step, with the decoder using its internal state and the context vector to generate each element of the output sequence. This means that the model is optimized for generating outputs.

In ASR and MDD systems, the encoder processes the input acoustic features (such as spectrograms or MFCCs) of a speech signal, capturing relevant information about the spoken sequence. The decoder generates the corresponding sequence of words or phonemes, translating the encoded information back into a textual representation.

Connectionist Temporal Classification

(Graves et al., 2006) introduced Connectionist Temporal Classification (CTC) which is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. CTC is a way to get around not knowing the alignment between the input and the output. It’s especially well suited to applications like speech and handwriting recognition. In a CTC setup, the Neural network outputs a character score for each time step which can be used for calculating loss while training the model and decoding the corresponding text/transcriptions for image/audio files.

CTC Loss function

CTC loss is calculated on the basis of the output matrix of the neural network and the corresponding ground truth text. In the case of a speech recognition task, the Neural network outputs probability distribution for all the characters/phonemes at each time step. Next, an alignment space is created using a Dynamic programming algorithm. This space includes all possible ways of mapping output symbols to input symbols. For each input sequence, there are multiple paths through the alignment space that produce the same output sequence. The probability of each possible path is computed based on the predicted probabilities for the output symbols at each time step. The sum of the probabilities over all possible paths corresponding to the target sequence is calculated. This is the likelihood of the target sequence given the input sequence. Now a negative logarithm of the likelihood is taken to obtain the CTC loss.

The objective function is derived from the principle of maximum likelihood. That is, minimising it maximises the log-likelihoods of the target labelling. Note that this is the same principle underlying the standard neural network objective functions.

Let us understand this by an example. Consider "success" as the target transcription that needs to be predicted. The Neural network generates a probability distribution of all characters/phonemes and blank symbols for every time step. The target is to calculate the sum of all probabilities of valid CTC alignments.

CTC Alignments

CTC introduces a new symbol called blank symbol <b> to the existing phoneme/character set. While creating an alignment, blanks are added between repeated labels (<b> is added to distinguish between the consecutive characters c and s in the word success. The blanks are also present when there is no target label for the time step. For example silence between words in a speech. During inference, these blanks are simply ignored and repeating characters separated by blanks are merged, this way the expected word is derived from the alignment.

Alignments are created adhering to the following criteria.

  • Alignments length should be equal to the length of the input.

  • Alignments should be monotonic, which means output at the next time step can be the same label or can advance to the next label.

  • Alignments should have a many-to-one relationship, which means one or more input elements can align to a single output element.

Any alignment that follows the above criteria and which maps to Y after merging repeats and removing blanks is allowed. Below are some of the valid CTC alignments/paths for the target word "success".

Here we are considering the input size as 16 timesteps, hence input vector X can be represented as (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16). Now Y is (y1, y2, y3, y4, y5, y6, y7) since the target word "success" has 7 letters. This also means that 16 time steps need to be mapped to the output size of 7 hence many-to-one relationship can be formed.

After the alignment space is created, the probability is calculated for each alignment. This is done by multiplying the probability distribution value of each character in the alignment. At this point, we have calculated the probabilities of all the alignments and now they can be summed up. The total sum value is the likelihood value and, a negative logarithm of the likelihood is taken to obtain the CTC loss.

Unlike the aim of maximum likelihood training to maximise the log probabilities in our case, the aim is to minimize the negative log probabilities.

Attention based Sequence-to-sequence models

Most of the Sequence-to-sequence models use enoder-decoder architecture for ASR and MDD tasks. The performance of a basic encoder–decoder deteriorates rapidly as the length of an input speech audio increases. When input features are longer than the ones used in the training corpus, it becomes difficult for the neural network to cope up with. This happens because the encoder has to compress all the necessary information of the audio into a fixed-length vector. To address this issue Bahdanau et al., 2014 introduced attention mechanism to encoder-decoder architecture which allows the model to focus on relevant parts of the input sequence while generating each element of the output sequence. During the output generation the model performs soft search for a set of positions in a source sentence where the most relevant information is concentrated. Subsequently, the model then makes predictions for the target word by considering the context vectors associated with these identified source positions along with all previously generated target words.

For each input text sentence, the encoder a Bidirectional RNN generates a sequence of

concatenated forward and backward hidden states (h1, h2, h3, . . . hN) where, N is the

number of words in input text sequence. Each annotation (encoder output for each input time step) contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. The context vector is computed as a weighted sum of these annotations.

The approach of taking a weighted sum of all the annotations as is computing an expected annotation. The context vector is therefore is the expected annotation over all the annotations with probabilities αij .

The probability αij , or its associated energy eij , reflects the importance of the annotation hj with respect to the previous hidden state si−1 in deciding the next state si and generating yi. Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to.

Now 𝜶 is a vector of (N, 1) dimension and of whose elements represent the weights

assigned to words in the input sequence. For example if 𝜶 is [0.3, 0.2, 0.2,0.3] and the input

sequence is “How are you doing” the corresponding context vector here would be ci = 0.3 ∗

h1 + 0.2 ∗ h2 + 0.2 ∗ h3 + 0.3 ∗ h4 where h1, h2, h3 and h4are hidden states (annotations)

corresponding to the words “How”, “are”, “you” and “doing” respectively.

Global attention and Local attention

Luong et al., 2015. introduced the concept of  Global attention and Local attention. In global attention model input from all hidden states must be taken into consideration, which results in increased computation. Now this happens because to obtain the final layer of the feedforward connection, all hidden states are added into a matrix and multiplied by a correct dimension weight matrix. To avoid this problem Local attention mechanism is used where only a few hidden states are considered.

Transformer models

The Transformer was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017, and it has since become a foundational architecture in natural language processing tasks. It is tge first sequence-to-sequence model based entirely on attention replacing the recurrent layers(RNN) with multi-headed self attention. The Transformer architecture can be configured both as an Encoder-Decoder model and as an Encoder-Only model depending on the task to be achieved.

Model Architecture

In the "Attention is all you need" paper the encoder consists of a stack of 6 identical layers denoted as N. Each layer comprises two sub-layers: the first utilizes a multi-head self-attention mechanism, while the second employs a straightforward, position-wise fully connected feed-forward network.

CTC relies on a conditional independence assumption regarding the characters in the output sequence. While the CTC model may generate an output transcript that sounds similar to the input audio, it may not precisely match the intended transcription. For instance, consider the following comparison between the output and target transcriptions:

Target transcript: To illustrate the point, a prominent Middle East analyst in Washington recounts a call from one campaign.

Output transcript: Twoalstrait the point, a prominent midille East analyst in Washington recouncacall from one campaign.

In this example, the output transcript sounds plausible but clearly lacks correct spelling and grammar. Such issues can be addressed by incorporating a language model into CTC during training, to enhance accuracy. This language model essentially acts as a spellchecker on top of the CTC output. A language model can be built using the text present in corpus.In the Kaldi setup, additional software packages like IRSTLM are installed. IRSTLM serves as a language modelling tool for creating n-gram language models.

The following commands illustrate how to build a 2-gram language model using the training corpus: -i $srcdir/lm_train.text -n 2 -o lm_phone_bg.ilm.gz
compile-lm lm_phone_bg.ilm.gz -t=yes /dev/stdout > $srcdir/

While the conditional independence assumption made by CTC is not always detrimental, as it ensures a robust model. For example, a speech recognizer trained on phone conversations between friends might not be suitable for transcribing customer support calls due to differences in language. The flexibility of a CTC acoustic model allows for the easy substitution of a new language model when transitioning between domains.

(Li et al., 2017) and (Leung et al., 2019) use a hierarchical evaluation structure to evaluate the performance of the Mispronunciation detection model. Below is a diagram for the same.

The expected outcomes for mispronunciation detection are True Acceptance and True Rejection, while the unexpected outcomes are False Acceptance and False Rejection.

  • True Acceptance (TA) is the number of phonemes annotated and recognized as correct pronunciation.

  • True Rejection (TR) is the number of phonemes annotated and recognized as mispronunciation.

  • False Rejection (FR) is the number of phonemes annotated as a correct pronunciation. but recognized as a mispronunciation.

  • False Acceptance (FA) is the number of phonemes annotated as mispronunciation but recognized as correct pronunciation.

  • Correct Diagnosis (CD) is the number of phones correctly recognized as mispronunciations and correctly diagnosed as matching the annotated phonemes.

  • Diagnosis Error (DE) is the number of phones correctly recognized as mispronunciations but incorrectly diagnosed as different from the annotated phonemes.

We focus on True Rejection cases for mispronunciation diagnosis and take into account those with Diagnostic Errors. False Rejection Rate (FRR), False Acceptance Rate (FAR) and Diagnostic Error Rate (DER) can be calculated as below.

Other metrics such as Precision, Recall and F-measure are also widely used as performance measures for mispronunciation detection.

In addition, the accuracies of mispronunciation detection and mispronunciation diagnosis are calculated as follows:

15 views0 comments

Recent Posts

See All
bottom of page