top of page

Understanding Mispronunciation Detection Systems - Part 2

Updated: Jul 9

Table of Contents

Welcome to the second installment of the comprehensive three-part series on Understanding Mispronunciation Detection Systems. In this article, the intricate process of constructing and training these sophisticated models is meticulously explored. Each step, from data preparation to feature extraction and acoustic modeling, is examined in detail. Through this article readers gain invaluable insights into the foundational elements essential for successfully developing MDD systems.

The figure below illustrates the step-by-step process of building E2E MDD systems. It starts with preparing the data and extracting the acoustic features required to train the model. The next steps involve training the Acoustic and language models and decoding the results.

MDD experiments workflow

During this phase, speech data is acquired, annotated, augmented, and sorted. Each step is an individual task requiring considerable time and effort. While freely available speech corpora exist for experimentation, they often have limited utility in building MDDs for L2 English speakers. It is crucial that ASR and MDD systems are trained and tested on different speakers, and having a greater number of speakers significantly enhances the system's effectiveness.

After acquiring relevant speech data it needs to be annotated using tools such as PRAAT. This is essentially the stage of labelling the data. Data Augmentation is performed to increase training samples, it also addresses data imbalance issues. More details about data augmentation and freely available speech datasets will be explored in the third part of the article.

In the case of speech data, apart from having audio files, a transcript and annotation files are also required. One of the popular data preparation techniques is Kaldi-style data preparation. It is designed to ensure that audio data is well-structured, aligned with text transcriptions, and ready for feature extraction and model training. It follows a consistent workflow, making it easier for researchers and practitioners to work with large-scale speech datasets while maintaining data quality and consistency. The following is the list of generic files that need to be created.

  • text - This file named “text” contains transcripts of all the utterances present in the speech corpus. The content inside the text file should follow the format <utterance_id> <text_transcription> where utterance_id is the speaker ID and the file name of a particular audio sample is appended together.

  • wav.scp - This file contains the location of all the audio files present in the corpus. The content inside the text file needs to have the format <utterance_id ><full_path_to_audio_file> where utterance_id is the same as described above and full_path_to_audio_file is the location of the file on the hard disk. Kaldi requires the audio files to be in single channel .wav format. If the audio files are in different formats, then it uses the SOX tool to convert them to WAV format before extracting the features.

  • utt2spk - This is a text file that contains the mapping of each audio utterance to its corresponding speaker. It needs the following format for organising content: <utterance_id><speaker_id> where utterance_id is the same as described above and speaker_id is the id/code name given to the speaker.

  • Segments - If the corpus presents one audio file for each speaker which contains several utterances then we would need to create a segments file. This is a text file containing the start and end times for each utterance in the audio file. The content format is as follows: utterance_id file_name start_time end_time. This is an optional file and it is required when multiple utterances are present inside each audio file.

The above-mentioned files can be prepared by executing Bash or Python scripts. Additionally, Kaldi offers example data preparation scripts for widely used speech corpora, including TIMIT, LibriSpeech, WSJ, and Switchboard, among others.

After organizing audio data, transcripts, and pronunciation information according to the Kaldi-style data preparation process, we can extract acoustic features like Mel-Frequency Cepstral Coefficients (MFCCs) and Filterbank (Fbank) features from the audio files. Kaldi provides tools and scripts to perform feature extraction. The feature extraction tools are often used in combination with Kaldi's data iteration scripts. These scripts iterate through the audio files, read the audio data, and pass it through the feature extraction tools to produce feature vectors for each frame. The extracted feature vectors are saved in Kaldi's native format, often as binary files that can be efficiently read during training and decoding.

The extracted feature vectors serve as input features for training ASR and MDD models. During training, the feature vectors are paired with the corresponding transcript information to learn the relationships between acoustic features and phonetic sequences. During decoding (transcription of unseen audio), the same features are used to generate hypotheses or predictions. Kaldi provides executables like compute-mfcc-feats for MFCC feature extraction and compute-fbank-feats for Fbank feature extraction.

Mel-Frequency Cepstral Coefficients (MFCCs) and Filter banks are both commonly used acoustic

features in the field of speech processing and automatic speech recognition (ASR). They serve as representations of the spectral characteristics of speech signals, but they are computed differently and have distinct properties. Following is a comparison table highlighting their differences.


Filter bank

MFCCs tend to capture both spectral and timbral characteristics.

Fbank features are more focused on spectral characteristics.

MFCCs typically have a lower dimensionality compared to filter banks due to the application of the discrete cosine transform (DCT). This can lead to faster training and reduced computational complexity.

Filter banks capture more spectral information than MFCCs, which can be beneficial for tasks that require fine-grained spectral detail.

The dimensionality reduction in MFCCs can lead to a loss of fine spectral detail, which might be important for some tasks.

The higher dimensionality of filter bank features can lead to increased computational requirements and slower training, especially when used with deep neural networks

MFCCs are decorrelated

Filter bank features are highly correlated features.

In practice, both MFCCs and filter banks have been used successfully in deep learning-based ASR systems. The choice between them often depends on empirical experimentation and domain-specific considerations. It's common for researchers to experiment with both representations and choose the one that works best for their specific speech processing task and dataset.

After performing Data preparation and Feature extraction, Acoustic models are built as a next step. As explained in Part 1 of this article, an Acoustic model is built not only to capture the relationship between acoustic features of speech to corresponding linguistic units but also to properly deal with the variations and nuances that come with the physicality of speech such as age, gender, microphone, and environmental conditions. Attention-based Sequence to Sequence and Connectionist Temporal Classification (CTC) are examples of such models. Connectionist Temporal Classification is a technique that is used with encoder-only architecture and sequence-to-sequence models make use of Encoder-Decoder architecture for ASR and MDD tasks.

Encoder-Decoder architectures are neural network structures that allows the model to capture and process complex sequential information, making it well-suited for tasks involving variable-length input and output sequences, such as transcribing speech or detecting mispronunciations. These architectures consist of two main components: an encoder and a decoder.


The encoder processes the input sequence and transforms it into a fixed-size representation, often referred to as a context vector. This context vector captures the essential information from the input sequence in a condensed form.

This means that the model is optimized to acquire understanding from the input.


The decoder takes the context vector produced by the encoder and generates the output sequence. The output sequence is produced step by step, with the decoder using its internal state and the context vector to generate each element of the output sequence.

This means that the model is optimized for generating outputs.

Basic block diagram of Encoder Decoder Architecture

In ASR and MDD systems, the encoder processes the input acoustic features (such as spectrograms or MFCCs) of a speech signal, capturing relevant information about the spoken sequence. The decoder generates the corresponding sequence of words or phonemes, translating the encoded information back into a textual representation.

(Graves et al., 2006) introduced Connectionist Temporal Classification (CTC), a neural network output and scoring function designed for training recurrent neural networks (RNNs) like LSTM networks to handle sequence problems with variable timing. CTC effectively addresses the challenge of unknown alignment between input and output, making it ideal for applications such as speech and handwriting recognition. The intuition of CTC is to output a single character for every frame of the input, so that the output is the same length as the input, and then to apply a collapsing function that combines sequences of identical letters, resulting in a shorter sequence.

CTC Alignments

For every input sequence ‘X’ the algorithm maps an output sequence ‘Y’ called a CTC alignment. This alignment process is fundamental to how CTC works, allowing it to handle sequences of different lengths and account for the timing differences between input and output elements. Following is an example of a CTC alignment, input vector X is represented using 8 timesteps (x1, x2,.. x8).

Alignments are created adhering to the following criteria.

  • Alignments length should be equal to the length of the input.

  • Alignments should be monotonic, which means output at the next time step can be the same label or can advance to the next label.

  • Alignments should have a many-to-one relationship, which means one or more input elements can align to a single output element.

CTC introduces a new symbol called blank symbol <b> to the existing phoneme/character set. While creating an alignment, blanks are added between repeated labels (<b> is added to distinguish between the consecutive characters c and s in the word success). The blanks are also present when there is no target label for the time step. For example silence between words in a speech. During inference, these blanks are simply ignored and repeating characters separated by blanks are merged, this way the expected word is derived from the alignment.

Any alignment that follows the above criteria and which maps to Y after merging repeats and removing blanks is allowed. 

Let us understand this by an example. Consider "success" as the target transcription that needs to be predicted. The Neural network generates a probability distribution of all characters/phonemes and blank symbols for every time step. The target is to calculate the sum of all probabilities of valid CTC alignments. Below are some of the valid CTC alignments/paths for the target word "success".

Here we are considering the input size as 16 timesteps, hence input vector X can be represented as (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16). Now Y is (y1, y2, y3, y4, y5, y6, y7) since the target word "success" has 7 letters. This also means that 16 time steps need to be mapped to the output size of 7 hence many-to-one relationship can be formed.

After the alignment space is created, the probability is calculated for each alignment. This is done by multiplying the probability distribution value of each character in the alignment. At this point, we have calculated the probabilities of all the alignments and now they can be summed up. The total sum value is the likelihood value and, a negative logarithm of the likelihood is taken to obtain the CTC loss.

CTC Loss

CTC loss is calculated using the neural network's output matrix and the corresponding ground truth text. In speech recognition, the neural network outputs probability distributions for all characters or phonemes at each time step. An alignment space is then created using a dynamic programming algorithm, encompassing all possible mappings of output symbols to input symbols. For each input sequence, multiple paths through this alignment space can produce the same output sequence. The probability of each path is computed based on the predicted probabilities for the output symbols at each time step.

The sum of the probabilities of all paths corresponding to the target sequence is then calculated, representing the likelihood of the target sequence given the input sequence. The CTC loss is obtained by taking the negative logarithm of this likelihood. The objective function is based on the principle of maximum likelihood, meaning that minimizing the CTC loss maximizes the log-likelihoods of the target labels. This principle is the same as that underlying standard neural network objective functions.

The CTC Objective for a single (X, Y) pair

Hannun, “Sequence Modeling with CTC”, Distill, 2017

The CTC Objective function of a training set S for network Nw

Unlike the aim of maximum likelihood training to maximise the log probabilities in our case, the aim is to minimize the negative log probabilities.

Most of the Sequence-to-sequence models use encoder-decoder architecture for ASR and MDD tasks. The performance of a basic encoder–decoder deteriorates rapidly as the length of an input speech audio increases. When input features are longer than the ones used in the training corpus, it becomes difficult for the neural network to cope up with. This happens because the encoder has to compress all the necessary information of the audio into a fixed-length vector. 

To address this issue (Bahdanau et al., 2014) introduced attention mechanism to encoder-decoder architecture which allows the model to focus on relevant parts of the input sequence while generating each element of the output sequence. During the output generation the model performs soft search for a set of positions in a source sentence where the most relevant information is concentrated. Subsequently, the model then makes predictions for the target word by considering the context vectors associated with these identified source positions along with all previously generated target words.

For each input text sentence, the encoder a Bidirectional RNN generates a sequence of concatenated forward and backward hidden states (annotation) (h1, h2, h3, . . . hN) where, N is the number of words in input text sequence. Each annotation (encoder output for each input time step) contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. The context vector is computed as a weighted sum of these annotations.

The approach involves taking a weighted sum of all the annotations. This can be understood as computing an expected annotation, where expectation is over possible alignments. To explain further, let αij represent the probability that the target word yi is aligned with, or translated from, a source word xj. Consequently, the i-th context vector ci becomes the expected annotation across all the annotations, weighted by the probabilities αij.

Now α is a vector of (N, 1) dimension and of whose elements represent the weights assigned to words in the input sequence.

The probability αij, or its corresponding energy eij, signifies the relevance of annotation hj in relation to the previous hidden state si-1 for determining the subsequent state si and producing yi. Essentially, this establishes an attention mechanism within the decoder. This mechanism enables the decoder to selectively focus on specific segments of the source sentence.

For example if α is [0.3, 0.2, 0.2,0.3] and the input sequence is “How are you doing” the corresponding context vector here would be ci = 0.3 ∗ h1 + 0.2 ∗ h2 + 0.2 ∗ h3 + 0.3 ∗ h4 where h1, h2, h3 and h4 are hidden states (annotations) corresponding to the words “How”, “are”, “you” and “doing” respectively.

Global attention and Local attention

(Luong et al., 2015) introduced the concept of  Global attention and Local attention. In global attention model input from all hidden states must be taken into consideration, which results in increased computation. Now this happens because to obtain the final layer of the feedforward connection, all hidden states are added into a matrix and multiplied by a correct dimension weight matrix. To avoid this problem Local attention mechanism is used where only a few hidden states are considered.

The Transformer was introduced in the paper "Attention is All You Need" by (Vaswani et al., 2017). Since its introduction, it has become a foundational architecture in natural language processing tasks. The Transformer is the first sequence-to-sequence model based entirely on attention. It replaces the recurrent layers (RNN) with multi-headed self-attention. The Transformer architecture can be configured both as an Encoder-Decoder model and as an Encoder-Only model depending on the task to be achieved.

The Transformer - model architecture (Vaswani et al., 2017)

Encoder Stack 

As per the "Attention is all you need" paper the encoder consists of 6 identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections and layer normalization are employed around each sub-layer. The model, including embedding layers, produces outputs of dimension d = 512 to facilitate these connections.

Decoder Stack

The decoder consists of 6 identical layers, with three sub-layers each. It introduces a third sub-layer for multi-head attention over the encoder stack output. Residual connections, layer normalization are employed similar to encoder. Also modifications in self-attention prevent positions from attending to subsequent positions, ensuring predictions for position i depend only on known outputs at positions less than i.


An attention function maps a query and key-value pairs to an output, calculated as a weighted sum of values. The weights are determined by a compatibility function based on the query and corresponding key. This attention mechanism is called as 'Scaled Dot-Product Attention,' which takes input queries and keys of dimension dk, and values of dimension dv. It computes dot products of queries with keys, scales them, applies softmax for weights on values. The attention function is computed on packed matrices of queries, keys, and values (Q, K, V) simultaneously.

The Transformer model introduces the concept of multi-head attention. In this model, multiple attention heads operate in parallel. This parallel operation allows the model to focus on different aspects of the relationships within the sequence. Consequently, it provides richer representations. Additionally, it allows for parallel computation of attention scores for all positions in the sequence. This capability enables efficient training on hardware like GPUs and TPUs. As a result, the Transformer model is highly scalable and faster compared to the sequential nature of Attention mentioned in (Bahdanau et al., 2014).

While CTC and attention-based models are effective for ASR and MDD tasks, they do have limitations. CTC performance tends to decline with longer input sequences, while attention models struggle in the presence of noisy audio. The weakness of the attention model stems from the absence of left-to-right constraints. These constraints are present in DNN-HMM and CTC models. The lack of such constraints makes training the encoder network challenging for proper alignments. This is especially problematic in the context of noisy data or lengthy input sequences.

The research paper (Watanabe et al., 2017) introduced a hybrid CTC/attention end-to-end ASR, effectively leveraging the strengths of both architectures in both training and decoding phases. The multi-objective learning framework is applied during training to enhance robustness and facilitate swift convergence. In the decoding process, joint decoding occurs by integrating attention-based and CTC scores within the beam search algorithm, contributing to the elimination of irregular alignments. The suggested training approach incorporates a CTC objective function as an auxiliary task to train the attention model encoder within the multi-objective learning (MOL) framework.

A schematic depiction of the hybrid CTC Attention model architecture for MDD. (Lo et al., 2020; Yan et al., 2020; Zhang et al., 2020).

The diagram above depicts the overall architecture of our framework, showcasing the shared encoder network utilized by both CTC and attention models. In contrast to the attention model, the forward-backward algorithm of CTC has the ability to ensure monotonic alignment between speech and label sequences. Therefore the framework becomes more robust in achieving accurate alignments, particularly in noisy conditions. An additional benefit of incorporating CTC as an auxiliary task is that the network is learned quickly. Instead of relying solely on data-driven attention methods for estimating desired alignments in long input sequences, the forward-backward algorithm in CTC accelerates the alignment estimation process.

The training objective using both attention and CTC loss is shown below

The CTC model loss to be minimised is defined as the negative log likelihood of the ground truth character sequence y*

The Attention model loss to be minimised is computed as follows

Integrating CTC and Attention in a hybrid model not only capitalizes on the alignment precision of CTC and the contextual understanding of Attention but also accelerates network training, particularly in the presence of noisy data. ESPnet is one such open source framework designed on the lines of hybrid CTC-ATT architecture, more details on this are present in Part 3 of this article.

Decoding in ASR and MDD systems using Encoder-Decoder architecture is the process of transforming the encoded speech representation into a meaningful sequence of words or phonemes. This sequence represents the recognized speech content based on the input audio signal.

During decoding, the model predicts the likelihood of different tokens (phonemes, subwords, or words) at each time step. The decoding algorithm selects the token with the highest probability at each step to form the final output sequence. This is also called as greedy decoding. However in practice naive greedy decoding is not used as it makes choice that is locally optimal but may not be the best choice in hindsight. A decoder therefore needs to be optimised for generating outputs hence an extension to greedy decoding called as Beam Search is used for ASR, MDD and Machine translation tasks. In most other tasks such as decoding in LLM’s more sophisticated methods are used called sampling methods. Top-k sampling, Nucleus or top-p sampling and Temperature sampling are the examples of the same. 

In the following figure a search tree is employed to generate the target string T = t1, t2, ... from the vocabulary V = {a, b, <eos>}, illustrating the probability associated with generating each token from the current state. In a Greedy search approach, the choice of 'a' followed by another 'b' takes precedence over selecting the globally most probable sequence 'ba'.

The beam search is a heuristic search algorithm where, instead of selecting the optimal token to generate at each timestep, k potential tokens are retained at each step. The fixed-size memory footprint, referred to as the beam width, is crucial. It can be visualized as a flashlight beam. This beam can be adjusted to be either wider or narrower, enhancing its adaptability.

For better understanding purpose let’s assume an Encoder-Decoder outputs words instead of phones. In the initial decoding step, a softmax is calculated across the entire vocabulary, assigning probabilities to individual words. The top k options are then chosen from this softmax distribution. These initial k outputs form the search frontier, and the corresponding words are referred to as hypotheses. A hypothesis represents an output sequence, or translation-so-far, along with its associated probability.

In the following steps, the top k hypotheses are expanded iteratively by feeding each one into separate decoders. Each decoder generates a softmax over the entire vocabulary to predict the next possible token and extend the hypothesis. These k × V  hypotheses are evaluated based on P(yi|x,y<i), a score calculated as the product of the current word's probability and the probability of the path leading to it. The set of k × V hypotheses is then pruned to retain only the top k, ensuring that the search frontier never has more than k hypotheses and there are never more than k decoders.

The process iterates until an End-of-Sequence (EOS) token is generated, signifying the discovery of a complete candidate output. Upon reaching this point, the finalized hypothesis is eliminated from the search frontier, and the beam size is decreased by one. The search persists until the beam size reaches zero, ultimately yielding a set of k hypotheses.

The mentioned algorithm faces an issue where completed hypotheses may vary in length, potentially favoring shorter strings due to lower probabilities assigned by language models to longer strings. To address this, length normalization methods, such as dividing the log probability by the number of words, are commonly employed to ensure fair comparisons and mitigate the length-related bias during the decoding process.

CTC relies on a conditional independence assumption regarding the characters in the output sequence and does not implicitly learn a language model over the data (unlike attention based seq-to-seq models). While the CTC model may generate an output transcript that sounds similar to the input audio, it may not precisely match the intended transcription. 

For instance, consider the following comparison between the output and target transcriptions:

Target transcript: To illustrate the point, a prominent Middle East analyst in Washington recounts a call from one campaign.


Output transcript: Twoalstrait the point, a prominent midille East analyst in Washington recouncacall from one campaign.

In this example, the output transcript sounds plausible but clearly lacks correct spelling and grammar. Such issues can be addressed by incorporating a language model into CTC during training, to enhance accuracy. This language model essentially acts as a spellchecker on top of the CTC output. A language model can be built using the text present in corpus. In the Kaldi setup, additional software packages like IRSTLM are installed. IRSTLM serves as a language modelling tool for creating n-gram language models.

The following commands illustrate how to build a 2-gram language model using the training corpus:

  • -i $srcdir/lm_train.text -n 2 -o lm_phone_bg.ilm.gz

  • compile-lm lm_phone_bg.ilm.gz -t=yes /dev/stdout > $srcdir/

While the conditional independence assumption made by CTC is not always detrimental, as it ensures a robust model. For example, a speech recognizer trained on phone conversations between friends might not be suitable for transcribing customer support calls due to differences in language. The flexibility of a CTC acoustic model allows for the easy substitution of a new language model when transitioning between domains.

An encoder-decoder model functions as a conditional language model, as it inherently learns language patterns for the output domain from its training data. However, the availability of training data, which consists of speech paired with text transcriptions, may not offer enough textual content to effectively train a robust language model. Access to vast amounts of standalone text data is generally more abundant than text paired with speech. Consequently, integrating a substantial external language model can often lead to modest enhancements in model performance.

The straightforward approach involves employing beam search to obtain a final set of hypothesized sentences, often referred to as an n-best list. Subsequently, a language model is utilized to reevaluate each hypothesis in the beam. The scoring process involves a combination of the score from the language model, attention model and ctc model.

This article explains the complex process of building Mispronunciation Detection Systems (MDDs). It starts by explaining the detailed steps involved in data preparation, emphasizing the importance of acquiring, annotating, and augmenting speech data to ensure the robustness of MDD models. Next, it dives into feature extraction, focusing on acoustic characteristics like Mel-Frequency Cepstral Coefficients (MFCCs) and Filter banks, which capture the spectral qualities of speech signals.

The article then explores acoustic modeling, highlighting the crucial role of models such as Attention-based Sequence-to-Sequence and Connectionist Temporal Classification (CTC). It also examines how architectures like Encoder-Decoder and Transformer can enhance the effectiveness of MDD systems. Finally, it navigates through the challenges and methods in decoding, explaining the process of translating encoded speech representations into meaningful sequence.

Following are links to some of the topics discussed above.

  • Boersma, Paul & Weenink, David (2024). Praat: doing phonetics by computer [Computer program]. Version 6.4.13, retrieved 10 June 2024 from

  • Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017, December). Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.

Online reference links

Following are the links to the popular articles that intuitively explain some of the concepts discussed in this post.

11 views0 comments

Related Posts


bottom of page