Jinadu Olayinka   and    Johnson Olanrewaju

Department of Computer Science, Rufus Giwa Polytechnic, Owo





Language modeling for conversational speech using neural network model is a challenging task due to unconstrained speaking style, frequent grammatical errors, hesitations, start-overs and other variability associated with audio signal transcriptions. All these made speech language modeling inadequate because collecting large amount of frame-based training data with detailed description remains very costly. This paper therefore, proposes a new method of language modeling to capture the acoustic knowledge required in neural networks to map speech text to speakers. The speaker identification system (SIS), implemented with C++, is equipped with data abstraction, to adapt many speakers under various conditions to achieve speaker independency. This increases robustness and reduces out-of-vocabulary (OOV) rate significantly. An accelerated training algorithm was implemented to recognize spoken words to achieve efficient supervised learning than using frame-based speech data. Developed system is applicable where non-audio communication is desired.


Keywords: Neural Network, Hidden Markov Model, Speech recognition, Perplexity, adaptation



In most of the existing speech recognisers, the voice (audio) signal is used as the training set. The speech signal is first captured by a microphone or other transducers and converted into electrical signals before being sampled at some frequency to store transmittable finite number of amplitudes. Later, the signal is quantized into one discrete number of bits (called frames) to represent each sample. The neural network is then trained with this ample training data to achieve the required classification (Jinadu, 2010). Though, most of the systems can recognize large number of words, they are speaker dependent, costly and can only recognize isolated words. This is a threat to secured communication in information communication technologies.


Also, these frame-based speech recognisers are adversely affected by language constraints including pronunciation variations caused by dialect (intonations) and confusability of E-set. It has also been demonstrated in Fosier-Lussier (1999) and Ken Chen (2002) that auxiliary factors such as stress, syllabifications, syntax and prosody all have an effect on pronunciation. In all, the speech recognizer had always been characterized with adverse conditions of mismatch, environmental variability and high costs. Usually, these neural networks map noisy speech Hidden Markov Models (HMMs) to clean HMM to achieve noisy speech recognition (Furui et al, 2002). To achieve a better word-level recognition rate, Frank (2006) modeled the dependent and adaptive models for each speaker with triphone left-right HMMs with Gaussian mixture to be decoded with Viterbi algorithm on a lexical –tree structure augmented with a context-free grammar. This system was equally speaker dependent. Due to these inadequacies in the existing recognisers, seeking a language model (LM) that accurately account for pronunciation variation and as well provide for speaker independency became an important research concern.


Similarly, discrete word indices representing probability distributions used in modeling  speech data are not smooth and this limitation of the n-gram LM results in very long training time (Morgan and Scofields, 1991). To achieve required robustness, Schwenk and Gauvain (2002) projected the word indices onto a continuous space to estimate a better generalization to the unknown n-gram. This only increase the vocabulary size but long training time still remain an issue.


This paper proposes the use of an enumerated training set LM for the neural network to identify speakers in real-time mode. The developed system achieves supervised learning implemented via the training set to provide the required mapping for effective classification. This expert system is designed for use in environments where non-audible real-time communication is desired between users who cannot receive audibly spoken words especially on high bandwidth networks.



The models

The architecture of the neural network LM for a standard fully connected multi layer perceptron (MLP) as shown in figure 1 consists of the input, projection, hidden and output layers. The inputs to the neural network are the word indices of the n-1 previous words in the vocabulary represented as



where  denote the context  and P is the size of one projection. H and N  is the size of the hidden and output layer respectively and the outputs are the posterior probabilities of all words in the vocabulary.  A single projection is modeled by equation 2.2 with  representing the current word.


P1=P(wj =1|hj)


Output layer



Neural Network


Hidden layer



Projection layer













Fig. 1: Architecture of the neural network language model

            Adapted from Schwenk and Gauvain (2002)


The value of the output neuron Pi corresponds directly to the probability   of all the words in the vocabulary.     


The neural network simultaneously learns the projection of the words onto a continuous space and the LM (bigram, trigram etc) probability estimation is easily achieved. The aim is to demonstrate an improved perplexity using the new LM over the n-gram approach commonly used with the frame-based acoustic models. When word lists are used as characterized by the ‘enumerated’ data structure, the size of the output layer realized is effectively smaller than the size of the vocabulary. This is the basis for an improved perplexity.


Perplexity (PP), the average branching factor of speech recognisers, is defined in Schwenk and Gauvain (2002) as the approximate number of words that could follow a word in the grammar of a sentence. In computing  pp, the LM is applied  on the training data provided to predict the next probable words. Ken Chen (2002) and Graeme (2004) both confirmed that restricted vocabularies minimises the perplexity of training data because computing probabilities for full vocabulary is very expensive. With a flexible vocabulary, there is improved pp when compared to frame-based speech vocabulary because lower perplexity means better recognition accuracy (Graeme, 2004). The learning rate is therefore accelerated because the LM implemented is flexible and so it enhanced the perplexity.


The second order HMM is used in capturing the dynamic acoustic features of spoken words. This 3-state HMM generates b as finite sets called observation probabilities.


The dictionary and the transcription of the LM

Structurally, the neural network consist of weighted sum of inputs, activation function characterized by nonlinear memoryless equation, and the outputs a function of only the current inputs. The activation function performs the threshold function.


Also, the trigram language model approximates the original probability given for any utterance as


and the HMM of the acoustic model generates the observation b as a finite set called observation probabilities


which is equivalent to  with the second-order Markov Model  defined as .


With the feature extraction statistically implemented using observation sequence to obtain the probability of utterance as total weight input , where  the previous state input and  the current state input, and  the most probable utterance is given as


and the n vectors obtained from the input words (presented via the keyboard) are known in the training set . Since the distribution is given in the word list, the feature vector of the ‘enumerated’ enlisted words of the dictionary is thereby obtained.


The convergence between equations 2.3, 2.4 and 2.5 achieves parallelism. The product in equation 2.4 is used because the words in an utterance U is linked and concatenated to form sentences while the statistical LM reduces the search space as the input data is fed to the system via the keyboard interface. The back propagation (BP) algorithm is the basic and most effective weight updating method used by the multi layer feed forward neural network model in performing the specific computing task of classification. The speech text is finally recognized as the concatenated words modeled by equation 2.6.



The tree library function initializes the search on the training set implemented by the buildTreeSearch function.  This represent the search engine called the Viterbi algorithm, which find the best hypothesized words from the dictionary using the test utterance supplied via the user interface.  The output is finally displayed by the printf  function after the declared end of utterance function END is called.



Since the acoustic and language model feature as a form of Bayesian classifier , which is optimal  for speaker identification or voice recognition systems, the probability of error is minimal and robustness is achieved in the neural network via the feed forward architecture. With the parallel structure, the system stores the word of the vocabulary and at the same time conduct a search through a complex non-linear mapping.


Using the ‘enumeration’ capability of C++, the enlisted words, declared as string are converted into integers and training implemented via the back propagation algorithm defined as


where  the input vector; the weight vector;  the weighted sum is logically combined and the error is added or subtracted to obtain the output via the firing rule. In this design, and .


The targets are set to 1.0 for the next word in the training sentence and 0.0 for all the other ones. The outputs of the neural network trained in this manner converge to the posterior probabilities discussed earlier in 2.1 modeled by equation 2.2.


Supervised learning is realized by making the training data a measurement and accompanying it with labels, using the data abstraction capability of the system programming tool to indicate the class of events, which the measurements represent.  This class produces a desired response to the measurements using an algorithm of object-oriented programming paradigm. The speaker identification system (SIS) is obtained.


The SIS extracts required features from the training data and the speaker is identified as text contained in his speech is recognized as words in vocabulary. Words recognized are concatenated to represent speech sentences while all out-of-vocabulary (OOV) words are mapped through exception handling capability to an OOV output. The output displayed after the declared end of utterance function END is called for the three options are shown in the appendix.

Conclusion and performance measure

The quality of the SIS is based on the recognition rate and the rejection rate. In line with Farrell (1994), the MLP is efficient as good classifier in speech recognition or speaker identification as well. In this paper, the quality of the neural network are evaluated based on the  LM, its PP and the learning (recognition) rate. The designed SIS, characterized with an enumerated LM has an improved perplexity when compared with the frame-based LM approach because the network is trained to classify words directly. There is an improved word error because the standard error in the back propagation algorithm is eliminated.


Also, the SIS is speaker independent because no separate training is required for various speakers and the vocabulary size could be increased as desired. The enumerated training data suggests more accuracy and cheaper feature extraction strategies when compared with speech voice training data. This is because all problems of signal mismatch, language confusability, echoes and other acoustic variability associated with audio speech processing are eliminated.


Finally, with the improved LM used for the SIS, an efficient speaker identifier (expert system), characterized with highly economical feature extraction technique is realized. This is because it is better to recognize words directly than employing phoneme classifiers, which sometimes make mistakes in most speech recognisers. With all these, the SIS is more secured and more applicable where the receiver may not be able to audibly receive spoken words.



Akinyokun O.C. (2002). “Neuro-Fuzzy Expert System for Evaluation of Human Resource Performance”. First Bank of Nigeria Plc Endowment Fund Lecture Series I, delivered at Federal    University of Technology, Akure, Nigeria.


Andre G.A., Dante A.C., Gustavo B. and Enio F.F. (2003). “A comparison between features for a residential security prototype based on speaker identification with a model of Artificial Neural Network”. A paper presented at the Department of Informatics, University of Caxias du Sul.


Andre G., Gustavo B., Enio F. and Dante A.(2003)  “Speaker Identification with a Model of Artificial Neural Network” IEEE International Joint Conference on Neural           Network. 10(5):111-124.


Emami A. Xu P. and Jelinek F.(2003). “Using a connectionist model in a syntactical             based language model” in International Conference on Acoustic, Speech and     Signal Processing, 2003. pp. 272-375.


Farrell K. (1994). “Speaker recognition using Neural Networks and Convention Classifier”, IEE Transactions on Acoustics, Speech, and Signal Processing, 2(1),         1994,   pp.194-205.


Fosier-Lussier F. (1991). “Contextual word and Syllable Pronunciation Models” in proceedings of the          IEEE workshop on Automatic Speech Recognition and Understanding, Keystone, Colorado,          USA.


Frank R. (2006). “Comparing Speaker-Dependent and Speaker Adaptive Acoustic Models for Recognising Dysarthric Speech”. A paper presented at Department of Computer Science, University of Toronto, Canada.


Furui S., Itoh D. and Zhang Z. (2002). “Neural-network-based HMM adaptation for noisy speech recognition”. A paper presented at Institute of Technology, Tokyo, Japan.

Graeme W.B. (2004). “Neural Network-based Language Model for Conversational telephone Speech             Recognition”. Ph.D Thesis submitted to Institute of Technology in Computer Science, Tokyo.


Jinadu O.T. (2010). “Modeling and Simulation of Neural Network based Speaker Identification System” M.Tech. Thesis submitted to Federal University of Technology, Akure, Nigeria


Ken Chen M. (2002). “Modeling Pronunciation variation using Artificial Neural Networks for English             Spontaneous speech”. A paper presented to Department of Electrical and Computer Engineering, University of Illinos, Urbana.


Morgan D. and Scofield C. (1991), Neural Networks and Speech Processing, Morwell, Kluwer        Publications.


Schwenk H. and Gauvain J.(2002). “Connectionist Language Modeling for large Vocabulary, Continuous Speech Recognition”.  in International Conference on             Acoustic, Speech         and Signal Processing , 2002. pp. 765-768.


Appendix showing sample outputs of the SIS