Understanding Mispronunciation Detection Systems - Part 1

Sharatkumar Chilaka
Jun 27, 2024
18 min read

Updated: Jul 9, 2024

Abbreviations

The below table contains the full forms of some of the frequently used abbreviations in this 3 part article series.

Abbreviations	Full Forms
CALL	Computer Assisted Language Learning
CAPT	Computer Assisted Pronunciation Training
MDD	Mispronunciation Detection and Diagnosis
ASR	Automatic Speech Recognition
DNN	Deep Neural Network
HMM	Hidden Markov Model
GMM	Gaussian Mixture Model
CNN	Convolution Neural Network
RNN	Recurrent Neural Network
CTC	Connectionist Temporal Classification
ATT	Attention Architecturebased model
GOP	Goodness of Pronunciation
MFCCs	Mel-Frequency Cepstral Coefficients
E2E	End-to-end

Speaking is the most natural way for humans to communicate with each other. As the world becomes more globalised, there is a growing demand for learning foreign languages, particularly English pronunciation. However, traditional pronunciation teaching, which involves one-on-one interaction between students and teachers, is often too expensive for many students. As a result, automated pronunciation teaching has become a popular area of research. This article discusses state-of-the-art research done in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems, a subset of CALL.

This three-part series provides readers with a thorough understanding of the evolution and challenges in developing Mispronunciation Detection Systems, aiding the advancement of CALL technologies.

Part 1 lays the groundwork by explaining essential terms in Speech Recognition, with a focus on Mispronunciation Detection. It covers the basics of signal processing and dives into Automatic Speech Recognition (ASR) systems, examining methods like Goodness of Pronunciation (GOP) and DNN-HMM systems. Additionally, it highlights the significance of End-to-end (E2E) architectures in Speech Recognition.

Part 2 delves into the complex workflow components, from Data Preparation to Language Modelling, revealing the nuances of each phase. It discusses Feature Extraction techniques such as MFCCs and Filter Banks. It explores Acoustic Modelling methods like Encoder-Decoder Architecture and Attention-Based seq-to-seq models, shedding light on the mechanisms driving mispronunciation detection.

Part 3 charts the progress of E2E ASR and MDD systems, exploring strategies to improve efficiency and address the challenges in the MDD domain. It introduces key evaluation metrics for MDD, including a Hierarchical Evaluation Structure, and provides insights into Speech Datasets and essential Speech processing toolkits.

This series investigates methods to refine mispronunciation detection techniques for L2 English speakers (speakers whose first language is not English), emphasizing streamlined model-building processes and the use of End-to-End architectures.

The article thoroughly explores the complexities of MDD systems. It delves into every aspect of these systems, examining their intricate details and mechanisms. Mastering this expansive subject demands a thorough understanding of Digital Signal Processing, Natural Language Processing, and Deep Learning techniques. Each of these critical areas will be briefly discussed. However, to facilitate a deeper understanding, online reference links and research paper citations will be provided. Notably, the article will delve into popular model architectures for MDD and highlight trends in current research, drawing upon widely cited papers for reference. Furthermore, the datasets analyzed are widely utilized and some are publicly accessible. The focus is on prestigious sets such as L2-Arctic and TIMIT, which were recorded under optimal conditions.

Making computers understand what people say, or transcribing speech into words has been one of the earliest goals of computer language processing. Speech recognition has been used for a variety of tasks such as Voice Assistants on Smartphones and Automobiles (Siri and Alexa). It’s also a great way for people with physical disabilities to use computers. In some situations, speech may be a more natural interface to communicate with software applications rather than a keyboard or mouse.

In the 1920s, a toy called “Radio Rex” may have been the first machine to recognize speech to some extent. It was a simple wooden dog that reacted when someone called out "Rex!" by popping out of a doghouse. It worked by responding to a specific sound frequency associated with the word "Rex." While it couldn't differentiate between different words or sounds, it showed a basic principle of speech recognition: identifying distinguishing features of a desired sound and matching them with incoming speech. Later, in the 1950s, Bell Labs introduced a Digit recognizer that could recognize digits from a single speaker, but it had limitations like being speaker-dependent and needing adjustments for each speaker.

During the 1960s and 1970s, significant advancements occurred in speech recognition research. Breakthrough techniques like Fast Fourier Transform (FFT), Linear Predictive Coding (LPC), Dynamic Time Warp (DTW), and Hidden Markov Models (HMMs) were discovered. IBM researchers in the 1970s developed an early HMM-based automatic speech-recognition system, pioneering continuous speech-recognition tasks with the New Raleigh Language. By the mid-late 1980s, HMMs became the primary recognition method. In the mid-1990s, Cambridge University created and released the Hidden Markov Model Toolkit (HTK), a versatile toolkit for building and manipulating HMMs. HTK has widespread use in speech recognition research and various other applications worldwide.

Studies on Automated pronunciation error detection started in the 1990s. However, the development of comprehensive CAPT systems has only accelerated in the last decade. This acceleration is due to increased computing power and the widespread availability of mobile devices capable of recording speech for pronunciation analysis. In the beginning, techniques were created using posterior likelihood called Goodness of Pronunciation using GMM-HMM and DNN-HMM approaches. These methods are difficult to implement compared to the newer E2E MDD systems.

There are two main uses of pronunciation error detection in industries:

to measure pronunciation quality and
to teach pronunciation

Both applications have their own difficulties, particularly when it comes to pronunciation training. First, let's start by understanding some basic terms like What is a Phoneme and a Pronunciation error?

A Phoneme represents the smallest possible unit of speech audio when compared with a syllable, word, or phrase. In linguistics, phonemes are abstract representations of the sounds used in speech. They are the basic building blocks of spoken language and are used to differentiate between words. Each phoneme sound is represented by a phonetic symbol. For example, the word "cat" can be phonetically transcribed as /kæt/ using IPA symbols, where each symbol represents a specific phoneme: /k/ for the initial sound, /æ/ for the vowel sound, and /t/ for the final sound.

Phonetic transcription is the representation of a spoken language using symbols from the International Phonetic Alphabet (IPA) or ARPABET. The IPA is a standardized system of phonetic notation that uses symbols to represent the sounds of speech in any language. Phonetic transcription provides a precise and detailed representation of how words are pronounced, capturing the specific sounds (phonemes) and their variations, including nuances of pronunciation such as accents, stress, and intonation.

ARPABET

ARPABET and IPA are both systems used for phonetic transcription, but they serve different purposes and have some differences. ARPABET is a series of phonetic transcription codes created in the 1970s by the Advanced Research Projects Agency (ARPA) as part of their speech understanding research project. It uses distinct sequences of ASCII characters to reflect phonemes and allophones of General American English.

ARPABET transcriptions play a crucial role in building Automatic Speech Recognition (ASR) systems. ASR systems require large amounts of transcribed audio data for training Machine learning models. During data collection, human annotators often use ARPABET transcriptions to phonetically transcribe the spoken words in the training corpus. These transcriptions map the spoken words to their corresponding phonetic representations, capturing the pronunciation variations that occur in natural speech. During the decoding phase of ASR, acoustic models convert acoustic features from the input audio into a sequence of phonetic symbols.

Following are some examples of Phonetically transcribed words

Words	Phonetic Transcriptions
dog	D AO G
cat	K AE T
rain	R EY N
tree	T R IY
sun	S AH N
elephant	EH L AH F AH N T
language	L AE NG G W AH JH

Phonetic transcription is not only used in training and evaluating automatic speech recognition (ASR) systems, but also for building text-to-speech (TTS) synthesis models, analyzing speech patterns, and investigating phonetic characteristics of spoken language. While ARPABET is specific to American English, similar phonetic transcription systems exist for other languages to represent their unique phonological features.

Pronunciation refers to how a phrase or word is spoken. It is difficult to determine what constitutes a 'pronunciation error' because there is no clear definition of correct or incorrect pronunciation. Instead, there is a wide range of speech styles, ranging from sounding like a native speaker to being completely unintelligible.

It is often agreed that the emphasis should be on students' ability to be understood rather than sounding exactly like native speakers. While it is important for advanced learners to sound more like native English speakers, it is not as crucial as being easily understood.

Phonemic and Prosodic are the two major categories of pronunciation errors.

Phonemic Pronunciation Errors

Phonemic pronunciation errors involve mispronouncing phonemes, which can lead to misunderstandings or misinterpretations of words. These errors typically occur when a speaker substitutes one phoneme for another, omits a phoneme, or adds an extra phoneme. For example, confusing the sounds /b/ and /p/ in English can result in words like "bit" being pronounced as "pit" or vice versa. Phonemic errors involve mistakes related to individual speech sounds (phonemes) that can change the meaning of words.

Prosodic Pronunciation Errors

Prosodic pronunciation errors involve misinterpreting or misusing prosodic features such as stress, rhythm and intonation, which can affect the natural flow and meaning of speech. These errors might lead to sentences sounding robotic, monotone, or lacking the appropriate emphasis. For instance, if a speaker uses the wrong intonation pattern in a question, it can make the sentence sound like a statement instead. Prosodic errors involve mistakes related to the rhythmic and intonation patterns of speech, affecting the overall delivery and interpretation of spoken language.

Identifying and teaching pronunciation errors as a whole is a challenging problem. As a result, previous studies have mainly focused on phonemic and prosodic errors.

Classification of Phonemic and Prosodic Errors

Speech Processing involves studying and using concepts like Signal Processing, Deep Learning, and Natural Language Processing. In simple terms, it means converting audio data from sound waves to digital format and extracting important sound features. This is done using Digital Signal Processing techniques. Then, these sound features are used to train machine learning models that can understand and predict the sounds and letters in the audio data. Deep Learning models have shown good results in this area. Finally, a Pronunciation and Language model is used to arrange the words in the correct order. The language model can predict the next word based on the previous words.

Digital Signal Processing is the process of analyzing and working with sound signals that are recorded by digital devices like microphones. These signals are used in various CALL applications. Here are some simple explanations of the fundamental ideas behind signal processing.

The first step in digitizing a sound wave is to convert the analog representations into a digital signal. This analog-to-digital conversion has two steps: sampling and quantization. Audio sampling is the process of converting a continuous audio signal into a series of separate values in signal processing. The sampling rate refers to how often sound waves are converted into digital form at specific intervals (For CD-quality audio, this is typically 44.1kHz, meaning that 44,100 samples are taken per second). Each sample represents the amplitude of the wave at a particular moment in time, and the bit depth determines the level of detail in each sample, also known as the dynamic range of the signal. A 16-bit sample can have a value ranging from 0 to 65,536. When the sampling frequency is low, there is a greater loss of information, but it is cheaper and easier to calculate. On the other hand, a high sampling frequency results in less loss of information, but it requires more computing power and is more expensive.

Using higher sampling rates can enhance ASR accuracy. However, it's crucial to maintain consistent sampling rates between training and testing data. Similarly, when training on multiple corpora, you must downsample all corpus data to match the lowest sampling rate among them.

An Audio spectrogram is a visual representation of the audio signal in terms of the amplitude of frequencies over a time period. Audio spectrograms are created by applying the Fast Fourier Transform on the audio signals. All words are made up of distinct phoneme sounds each of which has different vowel frequencies hence spectrograms can be used to phonetically identify words spoken by humans. The below figure represents a spectrogram of spoken words “nineteenth century”. In this figure, time is represented on the X-axis, frequency is represented on the Y-axis and the legend on the right side shows color intensity which is proportional to amplitude intensity.

Spectrogram of spoken words “nineteenth century” (Wikipedia, 2021)

Mel-Frequency Cepstral Coefficients (MFCCs) are derived from the spectrogram and have proven to be more accurate in simulating the human auditory system's response. Hence MFCCs are a feature widely used in automatic speech and speaker recognition. It represents the envelope of the time power spectrum of the speech signal. MFCCs of a signal are a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope.

Plot of Mel Frequency Cepstral Coefficients (MFCCs) Spectral Envelope

Spectrograms vs MFCCs

Spectrograms provide a visual representation of frequency content over time, capturing detailed temporal dynamics and prosodic features. MFCCs, on the other hand, offer a more compact representation of the spectral characteristics of audio signals, particularly suited for tasks like speech recognition. While spectrograms are valuable for a wide range of audio analysis tasks, MFCCs are a powerful tool in speech and audio processing, especially when efficient feature extraction is crucial.

A majority of the components used in speech recognition systems are also used in building mispronunciation detection systems. Hence it is essential to understand how an automatic speech recognition system works. The task of an ASR system is to recognize the sequence of words present in the speech signal. It works by breaking down audio into individual sounds, converting these sounds into digital format, and using machine learning models to find the most probable word fit in a particular language. All phoneme sounds have different frequency patterns which can be learned by machine learning algorithms.

Speech encompasses two primary types of properties: physical and linguistic. Physical properties consist of factors such as a speaker’s age, gender, personality, accent, and background noise. These elements affect how speech is produced. The variations in these physical properties create significant challenges in establishing comprehensive rules for speech recognition.

In addition to physical properties, linguistic properties also play a critical role. For example, consider the sentences "I read the book last night" and "This is a red book." Although the words "read" and "red" sound similar, their meanings differ based on context. The complexity and nuances of language demand the development of exhaustive rules for effective speech recognition.

Both physical and linguistic properties of speech present substantial challenges. Addressing these challenges assertively is essential for building accurate and effective speech recognition systems.

To properly deal with the variations and nuances inherent in speech such as age, gender, accent, and background noise an Acoustic model is built. An Acoustic model is a Deep Neural Network and Hidden Markov Model(DNN-HMM) model that takes speech features (MFCC) as input and outputs the transcribed text. For the neural network to accurately transcribe speech to text, it must be trained on large datasets of speech. Given that speech is a naturally occurring time sequence, a neural network capable of handling sequential data is necessary. Recurrent Neural Networks (RNNs) are well-suited for this purpose.

Addressing the linguistic aspects of speech involves incorporating linguistic features into the transcriptions. This is achieved using a language model, a pronunciation model, and a rescoring algorithm. The acoustic model's output consists of probabilities for potential phonemes at each timestep. Pronunciation models map these phonemes to words. Instead of simply emitting the words with the highest probability, the language model constructs a probability distribution over word sequences based on its training data. This distribution helps determine the most likely sentence by rescoring the probabilities based on sentence context.

By integrating a language model, linguistic properties are embedded into the acoustic model's output, thereby enhancing the accuracy of the transcriptions.

Block diagram of a conventional ASR system (Microsoft Research, 2017)

In a conventional ASR system, there exists an acoustic model, language model, and pronunciation model to map acoustic features all the way to the words. Each of these modules are trained and adjusted independently with a distinct goal in mind. Therefore due to the sequential nature of this system errors generated in one module may propagate to other modules.

More recently it has been possible to represent this conventional model setup by one neural network. This gives many advantages in terms of model simplicity, and model size, and optimization also becomes much easier. These types of models are called End-to-end (E2E) models because they try to encompass the functionality of conventional ASR model components into one big model. Attention-based Sequence to Sequence and Connectionist Temporal Classification (CTC) are some examples of such models.

Block diagram of an end-to-end ASR system (Microsoft Research, 2017).

Mispronunciation detection and diagnosis (MDD) systems also consist of the same components as that of an ASR system. The process of feature extraction, and acoustic modelling remains the same. MDD is more challenging than automatic speech recognition. In ASR, a language model can mitigate the impact of inaccurate acoustics, producing a valid character sequence despite errors. However, in MDD, relying on a language model can result in missed mispronunciations. Therefore, robust acoustic modeling is crucial for distinguishing between native pronunciations with standard phonetic patterns and non-native pronunciations that deviate from these norms.

Following is a comparison table to help gain a clear distinction between ASR and MDD task.

Automatic Speech Recognition Systems	Mispronunciation Detection Systems
The objective of an ASR system is to transcribe spoken language into text.	The objective of an MDD system is to identify and analyse mispronunciations.
The training data consists of audio paired with corresponding correct transcriptions of the spoken words.	The training data includes instances of correct pronunciations as well as various types of mispronunciations. This may involve annotations specifying the types of mispronunciations (e.g., phonemic errors and prosodic errors).
Acoustic modelling in ASR focuses on mapping acoustic features to linguistic units for accurate transcription.	Acoustic modelling in MDD involves recognizing acoustic patterns associated with correct and incorrect pronunciation for the purpose of mispronunciation detection.
The decoding process results in a final transcription of the spoken words, representing the system's best estimate of the spoken language given the observed acoustic features.	The decoding process results in an assessment of the pronunciation quality, indicating whether it is likely to be correct or whether there are potential mispronunciations.
Evaluation metrics include word error rate (WER), phoneme error rate (PER), and other transcription accuracy measures.	Evaluation metrics focus on the system's ability to correctly identify and classify mispronunciations, which may involve FRR, FAR, DER, Precision, Recall, F1 score and other metrics.
Widely used in applications such as voice assistants, transcription services, and voice command recognition.	Applied in language learning platforms, pronunciation assessment tools, and educational software to provide feedback on and help improve a user's pronunciation skills.

ASR and MDD both face challenges with acoustic modeling and diverse linguistic variations. However, MDD is more complex because it involves the detailed assessment of pronunciation accuracy. Since pronunciation is subjective, MDD requires distinguishing between acceptable variations and actual mispronunciations. This highlights the need for meticulously curated training data, expert annotation, and a deep understanding of linguistic norms to develop effective MDD models.

Initial mispronunciation detection was ASR-based and made use of the Goodness of Pronunciation (GOP) algorithm to perform phone-level pronunciation error detection. The goal of the GOP measure is to assign a score to each phone in an utterance. This scoring process assumes that the orthographic transcription is known and that a set of HMMs is available to determine the likelihood of different phonemes given acoustic features.

Given these assumptions, for each phoneme p in the transcription, the algorithm computes the likelihood of the acoustic segment matching phoneme p. This involves calculating the posterior probability P(p|acoustic segment). The GOP score for each phoneme is then determined by taking the duration-normalized log of the posterior probability P(p|O(p)). A simplified equation proposed in (Witt and Young, 2000) is presented below.

The basic GOP measure equation (Witt and Young, 2000)

where Q is the set of all phone models and NF(p) is the number of frames in the acoustic segment 𝑶(𝒑).

A block diagram of the resulting scoring mechanism is shown below. The front-end feature extraction converts the speech waveform into a sequence of mel frequency cepstral coefficients (MFCC). These coefficients are used in two recognition passes: the forced alignment pass and the phone recognition pass. In the forced alignment pass, the system aligns the speech waveform with the corresponding phonetic transcription. In the phone recognition pass, each phone can follow the previous one with equal probability.

Using the results obtained, individual GOP scores for each phone is calculated as per previous equations. Then, a threshold is applied to each GOP score to identify and reject phones that are badly pronounced. The specific threshold to use depends on how strict we want to be in our evaluation.

Block-diagram of the pronunciation scoring system: phones whose scores are above the predefined threshold are assumed to be badly pronounced and are therefore rejected (Witt and Young, 2000).

What is Forced Alignment?

Forced alignment is a crucial step in assessing and detecting mispronunciations. A speech recognition system (often based on Hidden Markov Models or deep learning techniques) is used to perform forced alignment. This system uses the reference phonetic transcription to align the phonemes with the actual audio signal produced by the speaker.

The forced alignment process produces an alignment grid that maps each phoneme in the reference transcription to a specific time segment in the speech signal. This alignment grid provides a fine-grained mapping of when each phoneme begins and ends in the audio. The alignment information can be used to calculate scores that indicate the degree of similarity between the produced phonemes and the reference phonemes. These scores can be aggregated to provide an overall measure of pronunciation accuracy, often referred to as the "Goodness of Pronunciation" (GOP) score.

Drawbacks of GOP

GOP based methods however could only provide the functionality of mispronunciation detection and lacked the ability to provide a diagnosis for mispronunciation. Extended Recognition Networks (ERN) address this by providing diagnosis feedback for insertion, substitution, and deletion errors. It does this by extending the decoding network of ASR with phonological rules and can thus provide the diagnosis feedback by comparing an ASR output and the corresponding text prompt. ERN decoding networks require a high number of phonological rules which also degrades the performance of ASR resulting in poor performance of MDD systems.

Recently E2E ASR systems have also shown promising results for MD tasks. These methods do not use force alignment and integrate the whole training pipeline. The conceptual simplicity and practical effectiveness of E2E neural networks have recently prompted considerable research efforts into replacing the conventional ASR architectures with integrated E2E modelling frameworks that learn the acoustic and language models jointly.

There was a great amount of research done on Mispronunciation detection techniques for developing efficient CAPT systems. A complete CAPT system would consist of having methods to determine Mispronunciation detection and providing diagnosis feedback. Here are some important research studies that have been conducted on the topic of detecting mispronunciation using DNN-HMM and GOP. List of research articles published using End-to-end MDD architecture will be discussed in Part 3 of this article.

(Witt and Young, 2000) This paper explores a measure called 'Goodness of Pronunciation' (GOP) that assesses pronunciation accuracy by considering individual thresholds for each phone based on confidence scores and rejection statistics from human judges. The study incorporates models of the speaker's native language and includes expected pronunciation errors. The GOP measures are evaluated using a database of non-native speakers annotated for phone-level pronunciation errors. The results indicate that a likelihood-based pronunciation scoring metric can achieve usable performance with the enhancements. This technique is now commonly found in pronunciation assessment and identification of mispronunciation tasks.

(Lo et al., 2010) focuses on using phonological rules to capture language transfer effects and generate an extended recognition network for mispronunciation detection.

(Qian et al., 2010) proposes a discriminative training algorithm to minimize mispronunciation detection errors and diagnosis errors. It also compares handcrafted rules with data-driven rules and concludes that data-driven rules are more effective in capturing mispronunciations.

(Witt, 2012) This paper talks about the latest research in CAPT as of early 2012. It discusses all the important factors that contribute to pronunciation assessment. It also provides a summary of the research done so far. Furthermore, it gives an overview of how this research is used in commercial language learning software. The paper concludes with a discussion on the remaining challenges and possible directions for future research.

(Hu et al., 2015) propose several methods to enhance mispronunciation detection. They refine the GMM-HMM acoustic model using DNN training to improve discrimination and incorporate F0 into the DNN-based model for detecting errors related to lexical stress and tone, particularly beneficial for L2 learners. Additionally, they enhance the GOP measure in the DNN-HMM system to assess non-native pronunciation more effectively with native speakers' models. Introducing a neural network-based logistic regression classifier further streamlines classification and boosts generalization. Experimental results on English and Chinese datasets confirm the effectiveness of their approaches.

(Li et al., 2016) suggests using speech attributes like voicing and aspiration to detect mispronunciation and provide feedback. It focuses on improving the detection of mispronunciations at the segmental and sub-segmental levels. In this study, speech attribute scores are used to assess the quality of pronunciation at a subsegmental level, such as how sounds are made. These scores are then combined using neural network classifiers to generate scores for each segment. This proposed framework reduces the error rate by 8.78% compared to traditional methods such as GOP, while still providing detailed feedback.

(Li et al., 2017) This paper investigates MDD using multi-distribution DNNs. It begins by building a traditional acoustic model with a DNN to align canonical pronunciations with L2 English speech. Then, an Acoustic Phonological Model (APM) is constructed, also using a DNN. The input features for the APM include MFCC features, assumed to have a Gaussian distribution, and binary values representing the corresponding canonical pronunciations. This model implicitly learns phonological rules from canonical productions and annotated mispronunciations in the training data. Additionally, the APM captures the relationship between phonological rules and related acoustic features. In the final step, a 5-gram phone-based language model is constructed using a DNN during Viterbi decoding. This is also considered as the one of the baseline paper for MDD tasks.

(Sudhakara et al., 2019) This study proposes a new formulation for GOP that considers both sub-phonemic posteriors and state transition probabilities (STPs). The proposed method is implemented in Kaldi, a speech recognition toolkit, and tested on English data collected from Indian speakers.

In this article, the evolution and intricacies of Computer Assisted Pronunciation Training (CAPT) systems, specifically focusing on Mispronunciation Detection and Diagnosis (MDD), are thoroughly explored. Beginning with a historical overview of Speech Recognition, the article traces its development and introduces the relevance of CAPT in language learning. Essential concepts in Speech Processing, like Audio Sampling and Mel-Frequency Cepstral Coefficients (MFCCs), are explained to lay the groundwork for understanding Mispronunciation Detection methodologies. The functioning of ASR systems is explained, emphasizing components such as Acoustic Models and Language Models. By drawing parallels between ASR and MDD systems, unique challenges in mispronunciation detection are discussed, alongside methodologies like Goodness of Pronunciation (GOP) and recent advancements in End-to-End ASR models. Through the exploration of research studies, the article also highlights the ongoing efforts to enhance MDD accuracy. Ultimately, it equips readers with a comprehensive understanding of MDD systems' construction, functioning, and challenges, underscoring their significance in advancing Computer Assisted Language Learning (CALL) technologies for effective pronunciation teaching.

Witt, S., & Young, S. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108. https://doi.org/10.1016/s0167-6393(99)00044-8
Lo, W. K., Zhang, S., & Meng, H. (2010). Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. https://doi.org/10.21437/interspeech.2010-280
Qian, X., Soong, F. K., & Meng, H. (2010). Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). https://doi.org/10.21437/interspeech.2010-278
Witt, Silke. (2012). Automatic Error Detection in Pronunciation Training: Where we are and where we need to go.
Hu, W., Qian, Y., Soong, F. K., & Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154–166. https://doi.org/10.1016/j.specom.2014.12.008
Li, W., Siniscalchi, S. M., Chen, N. F., & Lee, C. H. (2016). Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling. https://doi.org/10.1109/icassp.2016.7472856
Li, K., Qian, X., & Meng, H. (2017). Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 193–207. https://doi.org/10.1109/taslp.2016.2621675
Sudhakara, S., Ramanathi, M. K., Yarra, C., & Ghosh, P. K. (2019). An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities. https://doi.org/10.21437/interspeech.2019-2363

Online reference links

Following are some popular and informative links on ASR that will give you a solid understanding of Speech Recognition.

In this Stanford University YouTube video, Navdeep Jaitley confidently asserts that end-to-end models are superior for ASR tasks. He provides a concise overview of CTC and attention-based networks.

In this YouTube video by Microsoft Research, Prof. Preethi Jyothi from IIT Bombay delves into the intricacies of ASR and its components.

Here is the Hugging Face Audio course link. It offers a comprehensive overview of ASR technologies and provides detailed tutorials for building audio applications.

Here is the link to the documentation of the HTK toolkit, one of the pioneering tools developed for speech recognition research.