We often overlook or compromise over the audio quality of videos. There are multiple times when the audio quality is put down because of background noises or it might be distorted and difficult to understand. In order to address this concern, an audiovisual Speech Enhancement feature in youtube (stories) was recently introduced, allowing creators to take better selfie videos by automatically enhancing their voices and reducing background noise. This feature is based on machine learning technology and uses both audio and visual signals to distinguish the voices of people in a video and remove background noises.
The feature fuses ‘enhanced speech’ with just 10% background noise to improve video quality in iOS devices. YouTube Stories users can access the feature from the volume controls editing tool. After the video is recorded, the audio and the visual features are processed using the speech separation model to produce the enhanced speech. To avoid processing videos with clean speech (so as to avoid unnecessary computation), first, to check the model, the video is run for two seconds, then compared the speech-enhanced output to the original input audio. If there is sufficient difference (meaning the model cleaned up the speech), then the speech is enhanced throughout the video.
The earlier models of such a version were basically designed to eliminate all possible background noise but then after some research, it was found out that some users like to leave minimum background noise in the video. Based on this user study, a new linear combination of the original audio was found out.