Sharatkumar Chilaka
- Nov 7, 2022
- 16 min read

Understanding Mispronunciation Detection Systems - Part 1

Updated: Mar 20

Table of Contents

Abbreviations

Introduction

What kind of knowledge will be shared?

Scope and Limitations

Background

History of Speech Recognition

Computer Assisted Pronunciation Training

Phoneme and its uses

Speech Processing

Audio Sampling

Audio Spectrogram

MFCCs

ASR

Properties of Speech

DNN-HMM based ASR

End-to-end ASR

MDD Systems

Difference between ASR and MDD systems

Goodness of Pronunciation

Research on DNN-HMM systems

Summary

References

Table of Contents

The below table contains the full forms of some of the frequently used abbreviations in this article

Abbreviations	Full Forms
CALL	Computer Assisted Language Learning
CAPT	Computer Assisted Pronunciation Training
MDD	Mispronunciation Detection and Diagnosis
ASR	Automatic Speech Recognition
DNN	Deep Neural Network
HMM	Hidden Markov Model
GMM	Gaussian Mixture Model
CNN	Convolution Neural Network
RNN	Recurrent Neural Network
CTC	Connectionist Temporal Classification
ATT	Attention Architecturebased model
GOP	Goodness of Pronunciation
MFCCs	Mel-Frequency Cepstral Coefficients

Speaking is the most natural way for humans to communicate with each other. As the world becomes more globalised, there is a growing demand for learning foreign languages, particularly English pronunciation. However, traditional pronunciation teaching, which involves one-on-one interaction between a student and a teacher, is often too expensive for many students. As a result, automated pronunciation teaching has become a popular area of research.

This article discusses state-of-the-art research done in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems which is a subset of CALL.

Across this three-part series, readers will attain a comprehensive understanding of the construction of Mispronunciation Detection Systems. Part 1 serves as a foundation, elucidating key terminologies within Speech Recognition, particularly in the Mispronunciation Detection domain. It covers signal processing basics and delves into Automatic Speech Recognition systems, examining methodologies like Goodness of Pronunciation (GOP) and DNN-HMM systems. Furthermore, it explores the prominence of End-to-End architectures in the Speech Recognition field.

Part 2 navigates through the intricate workflow components, from Data Preparation to Language Modelling, uncovering the intricacies of each phase. It explores Feature Extraction techniques such as MFCCs and Filter Banks, alongside Acoustic Modelling methodologies like Encoder-Decoder Architecture and Attention-Based seq-to-seq models, unveiling the mechanisms propelling mispronunciation detection.

Finally, Part 3 traces the evolution of End-to-End ASR and MDD systems, discussing approaches to enhance efficiency and overcome limitations that exist in the MDD domain. Evaluation Metrics, including a Hierarchical Evaluation Structure and Performance Metrics, are unveiled, alongside insights into Speech Datasets and essential Speech processing toolkits.

This article explores ways to enhance mispronunciation detection techniques, particularly for L2 English speakers, through streamlined model-building processes, such as leveraging End-to-End architectures.

The article's scope encompasses an in-depth exploration of the intricacies inherent in Mispronunciation Detection and Diagnosis (MDD) systems. This vast subject necessitates a comprehensive grasp of Digital Signal Processing, Natural Language Processing, and Deep Learning techniques, each of which will be briefly discussed. However, to facilitate a deeper understanding, online reference links and research paper citations will be provided. Notably, the article will delve into popular model architectures for MDD and highlight trends in current research, drawing upon widely cited papers for reference. Moreover, the datasets examined will be publicly available, with a focus on renowned sets like L2-Arctic and TIMIT, recorded under optimal conditions.

Making computers understand what people say, or transcribing speech into words has been one of the earliest goals of computer language processing. Speech recognition has been used for a variety of tasks such as Voice Assistants on Smartphones and Automobiles (Siri and Alexa). It’s also a great way for people with physical disabilities to use computers. In some situations speech may be a more natural interface to communicate with software applications rather than a keyboard or mouse.

In the 1920s, a toy called Radio Rex may have been the first machine to recognize speech to some extent. It was a simple wooden dog that reacted when someone called out "Rex!" by popping out of a doghouse. It worked by responding to a specific sound frequency associated with the word "Rex." While it couldn't differentiate between different words or sounds, it showed a basic principle of speech recognition: identifying distinguishing features of a desired sound and matching them with incoming speech. Later, in the 1950s, Bell Labs introduced a Digit recognizer that could recognize digits from a single speaker, but it had limitations like being Speaker Dependent and needing adjustments for each speaker.

During the 1960s and 1970s, significant advancements occurred in speech recognition research. Breakthrough techniques like Fast Fourier Transform (FFT), Linear Predictive Coding (LPC), Dynamic Time Warp (DTW), and Hidden Markov Models (HMMs) were discovered. IBM researchers in the 1970s developed an early HMM-based automatic speech-recognition system, pioneering continuous speech recognition task with the New Raleigh Language. By the mid-late 1980s, HMMs became the primary recognition method. In the mid-1990s, Cambridge University created and released the Hidden Markov Model Toolkit (HTK), a versatile toolkit for building and manipulating HMMs. HTK has widespread use in speech recognition research and various other applications worldwide.

Studies on automated pronunciation error detection began in the 1990s, but the development of full-fledged CAPTs has only accelerated in the last decade due to an increase in computing power and the availability of mobile devices for recording speech required for pronunciation analysis.

In the beginning, techniques were created using posterior likelihood called Goodness of Pronunciation using GMM-HMM and DNN-HMM approaches. These methods are difficult to implement compared to the newer ASR based End-to-End mispronunciation detection systems.

A discussion about the purpose of learning pronunciation has reached the conclusion that the emphasis should be on students' ability to be understood rather than sounding exactly like native speakers. While it is important for advanced learners to sound more like native English speakers, it is not as crucial as being easily understood.

There are two main uses of pronunciation error detection in industries:

to measure pronunciation and
to teach pronunciation

Both applications have their own difficulties, particularly when it comes to pronunciation training.

First, let's start by understanding some basic terms like What is a Phoneme and a Pronunciation error?

A Phoneme represents the smallest possible unit of speech audio when compared with a syllable, word, or phrase. In linguistics, phonemes are abstract representations of the sounds used in speech. They are the basic building blocks of spoken language and are used to differentiate between words. Each phoneme sound is represented by a phonetic symbol. For example, the word "cat" can be phonetically transcribed as /kæt/ using IPA symbols, where each symbol represents a specific phoneme: /k/ for the initial sound, /æ/ for the vowel sound, and /t/ for the final sound.

Phonetic transcription is the process of representing spoken language using symbols from the International Phonetic Alphabet (IPA) or ARPABET. The IPA is a standardized system of phonetic notation that uses symbols to represent the sounds of speech in any language. Phonetic transcription provides a precise and detailed representation of how words are pronounced, capturing the specific sounds (phonemes) and their variations, including nuances of pronunciation such as accents, stress, and intonation. ARPABET and IPA are both systems used for phonetic transcription, but they serve different purposes and have some differences:

ARPABET

ARPABET is a series of phonetic transcription codes created in the 1970s by the Advanced Research Projects Agency (ARPA) as part of their Speech Understanding Research project. It uses distinct sequences of ASCII characters to reflect phonemes and allophones of General American English.

ARPABET transcriptions play a crucial role in building Automatic Speech Recognition (ASR) systems. ASR systems require large amounts of transcribed audio data for training Machine learning models. During data collection, human annotators often use ARPABET transcriptions to phonetically transcribe the spoken words in the training corpus. These transcriptions map the spoken words to their corresponding phonetic representations, capturing the pronunciation variations that occur in natural speech. During the decoding phase of ASR, acoustic models convert acoustic features from the input audio into a sequence of phonetic symbols.

Following are some examples of Phonetically transcribed words

Words	Phonetic Transcriptions
dog	D AO G
cat	K AE T
rain	R EY N
tree	T R IY
sun	S AH N
elephant	EH L AH F AH N T
language	L AE NG G W AH JH

Phonetic transcription is not only used in training and evaluating automatic speech recognition (ASR) systems, but also for building text-to-speech (TTS) synthesis models, analyzing speech patterns, and investigating phonetic characteristics of spoken language. While ARPABET is specific to American English, similar phonetic transcription systems exist for other languages to represent their unique phonological features.

Pronunciation refers to how a phrase or word is spoken. It is difficult to determine what constitutes a 'pronunciation error' because there is no clear definition of correct or incorrect pronunciation. Instead, there is a wide range of speech styles, ranging from sounding like a native speaker to being completely unintelligible.

Phonemic and Prosodic are the two major categories of pronunciation errors.

Phonemic Pronunciation Errors

Phonemic pronunciation errors involve mispronouncing phonemes, which can lead to misunderstandings or misinterpretations of words. These errors typically occur when a speaker substitutes one phoneme for another, omits a phoneme, or adds an extra phoneme. For example, confusing the sounds /b/ and /p/ in English can result in words like "bit" being pronounced as "pit" or vice versa. Phonemic errors involve mistakes related to individual speech sounds (phonemes) that can change the meaning of words.

Prosodic Pronunciation Errors

Prosodic pronunciation errors involve misinterpreting or misusing prosodic features such as stress, rhythm and intonation, which can affect the natural flow and meaning of speech. These errors might lead to sentences sounding robotic, monotone, or lacking the appropriate emphasis. For instance, if a speaker uses the wrong intonation pattern in a question, it can make the sentence sound like a statement instead. Prosodic errors involve mistakes related to the rhythmic and intonation patterns of speech, affecting the overall delivery and interpretation of spoken language.

Identifying and teaching pronunciation errors as a whole is a challenging problem. As a result, previous studies have mainly focused on phonemic and prosodic errors.

Speech Processing involves studying and using concepts like Signal Processing, Deep Learning, and Natural Language Processing. In simple terms, it means converting audio data from sound waves to digital format and extracting important sound features. This is done using Digital Signal Processing techniques. Then, these sound features are used to train machine learning models that can understand and predict the sounds and letters in the audio data. Deep Learning models have shown good results in this area. Finally, a Pronunciation and Language model is used to arrange the words in the correct order. The language model can predict the next word based on the previous words.

Digital Signal Processing is the process of analyzing and working with sound signals that are recorded by digital devices like microphones. These signals are used in various applications, such as CALL systems. Here are some simple explanations of the fundamental ideas behind signal processing.

The first step in digitizing a sound wave is to convert the analog representations into a digital signal. This analog-to-digital conversion has two steps: sampling and quantization. Audio sampling is the process of converting a continuous audio signal into a series of separate values in signal processing. The sampling rate refers to how often sound waves are converted into digital form at specific intervals (For CD-quality audio, this is typically 44.1kHz, meaning that 44,100 samples are taken per second). Each sample represents the amplitude of the wave at a particular moment in time, and the bit depth determines the level of detail in each sample, also known as the dynamic range of the signal. A 16-bit sample can have a value ranging from 0 to 65,536. When the sampling frequency is low, there is a greater loss of information, but it is cheaper and easier to calculate. On the other hand, a high sampling frequency results in less loss of information, but it requires more computing power and is more expensive.

While employing higher sampling rates can enhance ASR accuracy, it's imperative to maintain consistency in sampling rates between training and testing data. Likewise, when training on multiple corpora, it's necessary to downsample all corpus data to match the lowest sampling rate among them.

An Audio spectrogram is a visual representation of the audio signal in terms of the amplitude of frequencies over a time period. Audio spectrograms are created by applying the Fast Fourier Transform on the audio signals. All words are made up of distinct phoneme sounds each of which has different vowel frequencies hence spectrograms can be used to phonetically identify words spoken by humans. The below figure represents a spectrogram of spoken words “nineteenth century”. In this figure, time is represented on the X-axis, frequency is represented on the Y-axis and the legend on the right side shows color intensity which is proportional to amplitude intensity.

Figure 3 Spectrogram of spoken words “nineteenth century” (Wikipedia, 2021)

Mel-Frequency Cepstral Coefficients (MFCCs) are derived from the spectrogram and have proven to be more accurate in simulating the human auditory system's response. Hence MFCCs are a feature widely used in automatic speech and speaker recognition. It represents the envelope of the time power spectrum of the speech signal. MFCCs of a signal are a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope.

Plot of Mel Frequency Cepstral Coefficients (MFCCs) Spectral Envelope

Spectrograms vs MFCCs

Spectrograms provide a visual representation of frequency content over time, capturing detailed temporal dynamics and prosodic features. MFCCs, on the other hand, offer a more compact representation of the spectral characteristics of audio signals, particularly suited for tasks like speech recognition. While spectrograms are valuable for a wide range of audio analysis tasks, MFCCs are a powerful tool in speech and audio processing, especially when efficient feature extraction is crucial.

A majority of the components used in speech recognition systems are also used in building mispronunciation detection systems. Hence it is essential to understand how an automatic speech recognition system works. The task of an ASR system is to recognize the sequence of words present in the speech signal. It works by breaking down audio into individual sounds, converting these sounds into digital format, and using machine learning models to find the most probable word fit in a particular language. All phoneme sounds have different frequency patterns which can be learned by machine learning algorithms.

Continous/conversation speech, speaker accent, out of vocabulary words and background noise while recording speech from microphone are some of the factors that make automatic speech recognition difficult. Speech essentially has two types of properties physical and linguistic properties.

A speaker’s age, gender, personality, accent, and background noise while recording affects the way speech is produced. All these aspects combined form the physical properties of speech. Now since there are so many variations and nuances in these physical properties of speech it is extremely hard to come up with all rules possible for speech recognition. Not only do we have to deal with the physical properties of speech, but we must deal with linguistic properties as well.

For example, consider two sentences “I read the book last night” and “This is a red book”. Observe the words “read” and “red” have similar pronunciations but they are interpreted differently based on their context. So, language itself is complex and it has a lot of nuances and variations that we must come up with all possible rules for it as well for having effective speech recognition.

To properly deal with the variations and nuances that come with the physicality of speech such as age, gender, microphone, and environmental conditions an Acoustic model is built. An Acoustic model is a Deep Neural Network and Hidden Markov Model(DNN-HMM) model that takes speech features (MFCC) as input and outputs the transcribed text. For the neural network to properly transcribe speech data to text it needs to be trained on huge amounts of speech data. Speech is a naturally occurring time sequence which means a neural network that can process sequential data is required, for this purpose Recurrent Neural Networks (RNNs) can be used.

Now to deal with the linguistic aspect of the speech and inject the linguistic features into the transcriptions, a language model, pronunciation model and a rescoring algorithm are used.

The output of an acoustic model is the probability of possible phones at each timestep. Pronunciation models then map these phones onto words. Now instead of emitting the words with the highest probability as output transcript, the language model helps determine what is a more likely sentence by building a probability distribution over sequences of words it trained upon. It is used to re-score the probabilities depending on the context of the sentence. By using a language model, the linguistics properties can be injected into the output of an acoustic model and the accuracy of the transcriptions can be increased.

In a conventional ASR system, there exists an acoustic model, language model, and pronunciation model to map acoustic features all the way to the words. Each of these modules is trained and adjusted independently with a distinct goal in mind. Errors generated in one module may not cooperate with errors in another. More recently it has been possible to represent this conventional model setup by one neural network. This gives many advantages in terms of model simplicity, and model size, and optimization also becomes much easier. These types of models are called end-to-end models because they try to encompass the functionality of conventional ASR model components into one big model. Attention-based Sequence to Sequence and Connectionist Temporal Classification (CTC) are some examples of such models.

Mispronunciation detection systems also consist of the same components as that of an ASR system. The process of feature extraction, and acoustic modelling remains the same. Following is a comparison table to help gain a clear distinction between ASR and MDD task.

Automatic Speech Recognition Systems	Mispronunciation Detection Systems
The objective of an ASR system is to transcribe spoken language into text.	The objective of an MDD system is to identify and analyse mispronunciations.
The training data consists of audio paired with corresponding correct transcriptions of the spoken words.	The training data includes instances of correct pronunciations as well as various types of mispronunciations. This may involve annotations specifying the types of mispronunciations (e.g., phonemic errors and prosodic errors).
Acoustic modelling in ASR focuses on mapping acoustic features to linguistic units for accurate transcription.	Acoustic modelling in MDD involves recognizing acoustic patterns associated with correct and incorrect pronunciation for the purpose of mispronunciation detection.
The decoding process results in a final transcription of the spoken words, representing the system's best estimate of the spoken language given the observed acoustic features.	The decoding process results in an assessment of the pronunciation quality, indicating whether it is likely to be correct or whether there are potential mispronunciations.
Evaluation metrics include word error rate (WER), phoneme error rate (PER), and other transcription accuracy measures.	Evaluation metrics focus on the system's ability to correctly identify and classify mispronunciations, which may involve precision, recall, F1 score, or other classification metrics.
Widely used in applications such as voice assistants, transcription services, and voice command recognition.	Applied in language learning platforms, pronunciation assessment tools, and educational software to provide feedback on and help improve a user's pronunciation skills.

While ASR and MDD share challenges related to acoustic modeling and handling diverse linguistic variations, MDD introduces a set of complexities related to the nuanced assessment of pronunciation correctness. The subjective nature of pronunciation makes MDD a task that involves understanding and learning the subtle differences between acceptable variations and actual mispronunciations. It underscores the importance of carefully curated training data, expert annotation, and a nuanced understanding of linguistic norms for effective MDD model development.

Initial mispronunciation detection was ASR-based and made use of the Goodness of Pronunciation (GOP) algorithm to perform phone-level pronunciation error detection. The aim of the GOP measure is to provide a score for each phone of an utterance. The orthographic transcription is known while computing this score. The equation of GOP for a given phone p is as follows.

where Q is the set of all phone models

NF(p) is the number of frames in the acoustic segment 𝑶(𝒑).

A block diagram of the resulting scoring mechanism is shown below. The front-end feature extraction converts the speech waveform into a sequence of mel frequency cepstral coefficients (MFCC). These coefficients are used in two recognition passes: the forced alignment pass and the phone recognition pass. In the forced alignment pass, the system aligns the speech waveform with the corresponding phonetic transcription. In the phone recognition pass, each phone can follow the previous one with equal probability.

Using the results obtained, individual GOP scores for each phone is calculated as per previous equations. Then, a threshold is applied to each GOP score to identify and reject phones that are badly pronounced. The specific threshold to use depends on how strict we want to be in our evaluation.

Forced Alignment

Forced alignment is a crucial step in assessing and detecting mispronunciations. A speech recognition system (often based on Hidden Markov Models or deep learning techniques) is used to perform forced alignment. This system uses the reference phonetic transcription to align the phonemes with the actual audio signal produced by the speaker.

The forced alignment process produces an alignment grid that maps each phoneme in the reference transcription to a specific time segment in the speech signal. This alignment grid provides a fine-grained mapping of when each phoneme begins and ends in the audio. The alignment information can be used to calculate scores that indicate the degree of similarity between the produced phonemes and the reference phonemes. These scores can be aggregated to provide an overall measure of pronunciation accuracy, often referred to as the "Goodness of Pronunciation" (GOP) score.

Drawbacks of GOP

GOP based methods however could only provide the functionality of mispronunciation detection and lacked the ability to provide a diagnosis for mispronunciation. Extended Recognition Networks (ERN) address this by providing diagnosis feedback for insertion, substitution, and deletion errors. It does this by extending the decoding network of ASR with phonological rules and can thus provide the diagnosis feedback by comparing an ASR output and the corresponding text prompt. ERN decoding networks require a high number of phonological rules which also degrades the performance of ASR resulting in poor performance of MDD systems.

Recently End-to-End ASR systems have also shown promising results for MD tasks. These methods do not use force alignment and integrate the whole training pipeline. The conceptual simplicity and practical effectiveness of end-to-end neural networks have recently prompted considerable research efforts into replacing the conventional ASR architectures with integrated E2E modelling frameworks that learn the acoustic and language models jointly.

There was a great amount of research done on Mispronunciation detection techniques for developing efficient Computer Assisted Pronunciation Training systems. A complete CAPT system would consist of having methods to determine Mispronunciation detection and providing diagnosis feedback. Here are some important research studies that have been conducted on the topic of detecting mispronunciation using DNN-HMM and GOP. List of research articles published using End-to-end MDD architecture will be discussed in Part 3 of this article.

(Witt and Young, 2000) This paper explores a measure called 'Goodness of Pronunciation' (GOP) that assesses pronunciation accuracy by considering individual thresholds for each phone based on confidence scores and rejection statistics from human judges. The study incorporates models of the speaker's native language and includes expected pronunciation errors. The GOP measures are evaluated using a database of non-native speakers annotated for phone-level pronunciation errors. The results indicate that a likelihood-based pronunciation scoring metric can achieve usable performance with the enhancements. This technique is now commonly found in pronunciation assessment and identification of mispronunciation tasks.

(Lo et al., 2010) focuses on using phonological rules to capture language transfer effects and generate an extended recognition network for mispronunciation detection.

(Qian et al., 2010) proposes a discriminative training algorithm to minimize mispronunciation detection errors and diagnosis errors. It also compares handcrafted rules with data-driven rules and concludes that data-driven rules are more effective in capturing mispronunciations.

(Witt, 2012) This paper talks about the latest research in CAPT as of early 2012. It discusses all the important factors that contribute to pronunciation assessment. It also provides a summary of the research done so far. Furthermore, it gives an overview of how this research is used in commercial language learning software. The paper concludes with a discussion on the remaining challenges and possible directions for future research.

(Hu et al., 2015) This paper suggests different ways to improve the detection of mispronunciations. First, the acoustic model is refined using DNN training to better distinguish between different sounds. Then, F0 is added to the model to identify pronunciation errors caused by incorrect stress or tone. The measurement of pronunciation quality is enhanced using a DNN-HMM based system. Finally, a neural network based classifier is proposed to improve generalization. Experimental results on English and Chinese learning show the effectiveness of these approaches.

(Li et al., 2016) This paper suggests using speech attributes like voicing and aspiration to detect mispronunciation and provide feedback. It focuses on improving the detection of mispronunciations at the segmental and sub-segmental levels. In this study, speech attribute scores are used to assess the quality of pronunciation at a subsegmental level, such as how sounds are made. These scores are then combined using neural network classifiers to generate scores for each segment. This proposed framework reduces the error rate by 8.78% compared to traditional methods such as GOP, while still providing detailed feedback.

Most existing methods for detecting and diagnosing mispronunciations focus on categorical phoneme errors, where one native phoneme is replaced with another. However, they do not consider non-categorical errors. This study (Li et al., 2018) aims to improve mispronunciation detection by developing an Extended Phoneme Set in L2 speech (L2-EPS), which includes both categorical and non-categorical phoneme units. By analyzing clusters of phoneme-based phonemic posterior-grams (PPGs), L2-EPS is identified. Experimental results show that including non-categorical phonemes in L2-EPS enhances the representation of L2 speech and improves mispronunciation detection performance.

(Sudhakara et al., 2019) This study proposes a new formulation for GoP that considers both sub-phonemic posteriors and state transition probabilities (STPs). The proposed method is implemented in Kaldi, a speech recognition toolkit, and tested on English data collected from Indian speakers.

Witt, S.M. and Young, S.J., (2000) Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 302, pp.95–108.

Lo, W.-K., Zhang, S., Meng, H. (2010) Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. Proc. Interspeech 2010, 765-768, doi: 10.21437/Interspeech.2010-280

Qian, X., Soong, F.K., Meng, H. (2010) Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). Proc. Interspeech 2010, 757-760, doi: 10.21437/Interspeech.2010-278

Witt, Silke. (2012). Automatic Error Detection in Pronunciation Training: Where we are and where we need to go.

Hu, Wenping & Qian, Yao & Soong, Frank & Wang, Yong. (2015). Improved Mispronunciation Detection with Deep Neural Network Trained Acoustic Models and Transfer Learning based Logistic Regression Classifiers. Speech Communication. 67. 10.1016/j.specom.2014.12.008.

W. Li, S. M. Siniscalchi, N. F. Chen and C. -H. Lee, "Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016, pp. 6135-6139, doi: 10.1109/ICASSP.2016.7472856.

Li, X., Mao, S., Wu, X., Li, K., Liu, X., Meng, H. (2018) Unsupervised Discovery of Non-native Phonetic Patterns in L2 English Speech for Mispronunciation Detection and Diagnosis. Proc. Interspeech 2018, 2554-2558, doi: 10.21437/Interspeech.2018-2027

Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K. (2019) An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities. Proc. Interspeech 2019, 954-958, doi: 10.21437/Interspeech.2019-2363

Understanding Mispronunciation Detection Systems - Part 1

Recent Posts