This post is for people who have good understanding of deep learning, and basic understanding of data representation for images and text. In this blog, we will explore how TTS (text to speech) systems work.
- What is sound?: Sound is a signal produced due to variations in air pressure. Sound has properties like period, amplitude, frequency
- How is sound represented digitally?: Data for sound is captured by measuring amplitude of the sound at fixed intervals of time (defined by sample rate). Commonly used sampling rate is 44100. This means a 10 second clip will have 441000 points.
- Like in computer vision and NLP, traditional ML models used extracted features like phonetic concepts, etc.
- With deep learning, commonly used approach is to convert audio into image (spectogram) and use standard CNN to process / generate these images. This generally done by generating spectrogram
- What is spectrum?: Signals of different frequencies can be added together to create composite signals, representing any sound that occurs in real world. (Read fourier transform and traditional transform math, converting time to frequency and frequency to time domain, etc.)
- Pipeline for ASR: raw audio → spectogram → augmentation / cleaning → CNN → embedding → LSTM → text
- Mel Spectrogram is widely used spectrum for audio processing: “mans do not perceive frequencies linearly. We are more sensitive to differences between lower frequencies than higher frequencies.” “We hear them on a logarithmic scale rather than a linear scale.”
- The Mel Scale was developed to take this into account by conducting experiments with a large number of listeners. It is a scale of pitches, such that each unit is judged by listeners to be equal in pitch distance from the next.
- Mel Spectrograms: A Mel Spectrogram makes two important changes relative to a regular Spectrogram that plots Frequency vs Time. It uses the Mel Scale instead of Frequency on the y-axis. It uses the Decibel Scale instead of Amplitude to indicate colors.
- MFCC (for Human Speech): Mel Spectrograms work well for most audio deep learning applications. However, for problems dealing with human speech, like Automatic Speech Recognition, you might find that MFCC (Mel Frequency Cepstral Coefficients) sometimes work better.
- Data Augmentation: SpecAugment, frequency mask, time mask, time shift, silence addition, time stretch, small noise
- Mean Opinion Score: This is the metric used for evaluation of tts auto generated.
- Griffin-Lim Reconstruction Algorithm: GLA is use for converting spectrogram to audio.
References: