Mel Frequency Cepstral Coefficients (MFCCs) are features commonly used in speech and audio processing. They provide a compact representation of the power spectrum of a signal, capturing essential characteristics for tasks such as speech recognition, speaker identification, and emotion analysis.
Steps to Compute MFCCs:
- Pre-emphasis: A high-pass filter is applied to the signal to boost high frequencies and improve clarity.
- Framing: The signal is divided into short frames (e.g., 20-40 ms) to assume stationary properties within each frame.
- Windowing: Each frame is multiplied by a window function (e.g., Hamming window) to reduce spectral leakage.
- Fast Fourier Transform (FFT): Converts the time-domain signal into the frequency domain.
- Mel Filter Bank Processing: The frequency spectrum is transformed to the Mel scale, which mimics human auditory perception.
- Logarithm and Discrete Cosine Transform (DCT): Logarithmic scaling is applied, followed by DCT to decorrelate features and extract cepstral coefficients.
- Selecting Coefficients: Typically, the first 12-13 coefficients (excluding the first one, which represents overall energy) are used as features.
Applications of MFCCs:
- Speech Recognition (e.g., Google Assistant, Siri)
- Speaker Identification
- Emotion Recognition
- Music Classification
- Environmental Sound Classification
Python example
Compute MFCCs in Python
https://gist.github.com/viadean/17f05f66656b4b57fe82cea3ddb871c3
Explanation of the Code:
- Load Audio: Uses Librosa to load an example speech file.