Speech Recognition

What is speech recognition?

Speech recognition is making a machine able to recognise and interpret speech. It is an interdisciplinary branch which demands knowledge of computer science, linguistics, mathematics, computing and such related fields. Although officially, it was first developed in 1952 at Bell laboratories since then, the technologies and methodology used in it have undergone a lot of waves of innovation. At present, the accuracy of speech recognition systems is at per with those of humans.  This technology today allows us to talk with our devices replacing the existing tiresome techniques of typing and clicking. Whether it is playing music, switching off lights, making a call, everything today can be done at your command, all you need to do is “speak” and they will act accordingly. This technology is a boon especially for those who are physically handicapped. It is commonly called an Automatic Speech Recognition (ASR) system.

Working Principle of speech recognition.

Speech recognition is a case of pattern recognition and the basic principle of its working are as follows:

  • Input: The input to the system is the speech which acts as a trigger to it. A high-quality noise-cancelling microphone is used to capture it.
  • Pre-processing: The speech which is an analog signal is then digitized using a DSP (Digital Signal Processor) into a series of 8- or 16-bit values at a particular sampling frequency. The unwanted noises surrounding the speech are filtered before extracting the features.  
  • Feature Extraction: The pre-processed raw speech is used for the extraction of relevant features for recognition. Since speech varies with the person, those features that would be independent of the speaker are the important ones for ASR. These features act as discriminants and help in classifying the input in different class labels. The features might be spectral or temporal or both depending on the purpose. In the temporal analysis, the features are extracted directly from the speech signal. Some of the commonly used temporal techniques for feature extraction are Power estimation, Fundamental frequency estimation, Gold and Rabiner Algorithm, Cepstrum based pitch determination. For spectral analysis, the speech signal needs to be transformed from time to frequency domain to extract features. Some of the spectral techniques commonly used are Critical Band filter bank analysis, Cepstrum analysis, Mel Cepstrum analysis, Linear Predictive Coding (LPC) Analysis, Perceptually Based Linear Predictive Analysis (PLP). This stage is common for both the training and testing phase. 
  • Training/Classification: At this stage, the system is trained with the dataset. Parameters are estimated using complex mathematics. These parameters are used in the classification model which are stored in the memory as a template for reference. Some common examples are listed as follows.
  • Dynamic Time Warping (DTW): In this approach, time warping techniques are used to find the distance between the reference and the unknown input speech patterns.
  • Hidden Markov Model (HMM): This approach is simple and effective for modelling time-varying sequences (continuous speech) based on the Markov model. 
  • VQ (Vector Quantization): It is used for mapping from a large vector space to a small finite space where each region represents a cluster.
  • ANN (Artificial Neural Network): It is an electronic computational model based on the idea to replicate the biological neurons present in animal models. The input and output (result) are fed to it during the training phase based on which it creates a mapping function between the two.
  • Testing/Recognizing: During the testing (recognizing) phase each of the features is matched with the stored database and the class to which it has the best fit, it is classified into that.

speech recognition

Design Constraints

ASR varies in different forms depending upon the imposed constraints. Some of the constraints are discussed below:

  • Speaker size: No matter who the speaker is, ASR might be able to recognise it, such a system is said to be speaker-independent. If an ASR is speaker-dependent it needs to be trained with several speakers. Many adaptive ASRs have also come up which incorporate the facility of adding new speakers.
  • Nature of utterance: Depending upon the utterance type different ASRs are designed. For instance, in isolated word recognition systems, one needs to speak words with a pause to make it intelligible to the device. Whereas a continuous speech recognition system has the potential to recognise sentences continuously.
  • Size of vocabulary: Based on the size of vocabulary, the type of ASR varies from small (less than hundred), medium (few hundreds), large (thousands) to very large (tens of thousands). 
  • Spectral bandwidth: ASR may be wideband or narrowband based on whether the spectrum is broad or narrow.

Applications

Speech recognition systems are found everywhere, starting from our home to workplaces. It is cost-effective and makes life easier. Especially for those who are disabled, the ASR system gives them a new ray of hope. The list can go on but some of the applications are enlisted below:

  • Office: The traditional jobs like printing of documents, plotting graphs, online conferences, meetings, voice browser for the Internet, voice dialler, dictation etc. all can be controlled at a voice command.
  • Banking and Industry: Customer service is reduced today by using ASR. This helps in the reduction of friction for customers and gives them more satisfaction. Online payments, transaction history and other services can be controlled by a personalised voice-activated device. To add, speech translators help in initiating better communication by removing the barrier of language thereby making the customers feel comfortable and at home.
  • Telecommunications: Automation of operator services, customer care, voice calls are some examples where these systems can ease the manual labour.
  • Military: Speech-to-speech translation, controlling fighter jets are some examples.
  • Automobile: ASR can assist the riders by displaying maps, showing shortest routes, traffic status etc on command. It can also be used to play music and switch on/off the radio.
  • Healthcare: It can help doctors in creating reports, searching databases thereby reducing human labour. Common people with no medical ground can take help of ASR to understand the common symptoms of a disease, calling a doctor etc. Again, whether one is physically or visually disabled, using their speech they can initiate many activities at their home as well as the workplace. There are many voice-activated games.

Some of the realistic speech activated digital assistants are available today and can be used in any of the above-mentioned fields are

  • Amazon’s Alexa
  • Apple’s Siri
  • Google’s Google Assistant
  • Microsoft’s Cortana