Speech as Data

 The first step while making any automated speech recognition system is to get the features. In other words, identifying the components of the audio wave that are useful for recognizing the linguistic content and deleting all the other useless features that are just background noises is the first task.

In humans, the speech coming out from the body is filtered by the shape of vocal tract and also by tongue and teeth. What sound is coming out depends on this shape. To identify the phoneme being produced accurately we need to determine this shape accurately. We can say that the shape of the vocal tract manifests itself to form an envelope of short time power spectrum. And it’s the job of Mel Frequency Cepstral Coefficients (MFCCs) to represent this envelope accurately.

Speech can also be represented as data by converting it to a spectrogram

Speech Features (MFCC) that maps speech to a matrix

MFCCs are a feature widely used in automated speech and speaker recognition. The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency.

You can convert an audio in frequency scale to Mel scale using the following formula

Function to extract MFCC features in python:

Here are the parameter descriptions:

  • signal: The signal for which we need to calculate the mfcc features. It should be an array of N*1(read a wav file)
  • samplerate: The signal’s sample rate at which we have are working now.
  • winlen: Analysis window length in seconds. By default, it is 0.025s
  • winstep: successive window steps. By default, it is 0.01s
  • numcep: The number of ceptrum that the function should return, by default it is 13
  • nfilt:The number of filters in the filterbank, By default it is 26.
  • nfft: The size of FFT. By default, it is 512
  • lowfreq: The lowest band edge. In Hz, by default it is 0
  • highfreq: The highest band edge. In Hz, by default it is samplerate/2
  • preemph: To apply preemphasis filter with preemph as coefficient. 0 is no filter. By default, it is 0.97
  • ceplifter: To apply a lifter to final cepstral coefficients. 0 means no lifter by default it is 22
  • appendEnergy: the zeroth cepstral coefficient is replaced with the log of the total frame energy, if it is set to true.
  • Returns: A numpy array containing features. Each row contains one feature vector.

Building a classifier for speech recognition through MFCC features

To build a classifier for speech recognition you need to have a python package installed python_speech_features.

You can use

pip install python_speech_features

to install this package.

MFCC function creates a feature matrix for an audio file. To build a classifier that recognizes the voice of different people you need to collect speech data of them in wav format. Then convert all the audio files into matrix using the MFCC function. The code to extract the features from a wav file is given below.

For text classification and text segmentation, we need to convert text corpus ( sentence, paragraph etc) into a sequence of number and then we use deep learning (or machine learning) algorithms such as LSTM.

The above content is taken from this book.

Deep Learning with Applications Using Python will be highly useful.


A curious person who loves to solve problems mainly based on mathematical and computational models. In short, A senior Data Scientist passionate about machine learning algorithms


Leave a Reply

Your email address will not be published. Required fields are marked *