One of the biggest obstacles facing automated speech recognition (ASR) systems is their inability to adapt to novel, unbounded domains. Audiovisual ASR (AV-ASR) is a technique for enhancing the accuracy of ASR systems in multimodal video, especially when the audio is loud. This feature is invaluable for movies shot “in the wild” when the speaker’s mouth might not be in view. Models for this task are often large and comprise both visual and audio encoders and datasets for this task tend to be small.
As other AVASR works, it is only taught and tested using instructional videos. As trials by Google’s research team demonstrate, it performs badly when applied to novel domains using only a single training data set. However, several newly released massive audio-only models have been greatly optimized using self-supervised pretraining and tremendous supervised training on audio-only data from audiobooks like LibriLight and LibriSpeech. Models with billions of parameters, widespread availability, and impressive cross-domain generalization are all features of this class of models. The idea is to recycle the massive investment in such models’ training by reusing their weights. Inspiring them are recent efforts that modify frozen foundation models for use in a variety of domains.
While these models retain the advantages of audio-only pretraining for zero-shot generalization, they now integrate visual inputs in a lightweight manner to enable AV-ASR. The AVFormer framework uses light projection layers and trainable adaptors to infuse visual input into a static ASR model.
Researchers demonstrate that these can be taught with minimal extra training time and parameters on a modest amount of poorly labeled video data. This reduces the potential for domain shift and catastrophic forgetting associated with end-to-end finetuning. They also incorporate a basic curricular plan during training to guarantee consistency in the finetuning of these adapters, which they demonstrate is essential for the model to interpret auditory and visual data in tandem correctly. Finally, they show that the model beats state-of-the-art zero-shot approaches on three AV-ASR benchmarks from various domains while maintaining respectable performance on baselines that rely just on audio.
Zero-shot generalization across all AV domains is the target without sacrificing quality on audio-only benchmarks. A state-of-the-art ASR model is used as a starting point and then modified for use in unrestricted AV-ASR. The following two elements are used to include visual features derived from a robust pretrained visual model into the model:
- They use a linear projection of visual elements to incorporate audio tokens.
- To facilitate domain adaptation, they introduce minimally invasive adapters into the ASR model’s encoder before it is frozen.
Here are some of the architecture’s most crucial parts:
- Encoder and decoder for frozen conformers
- Layers of the optical encoder and projection are used for projecting and extracting features from images.
- Adaptation layers were added to the core infrastructure, specifically for the audio spectrum.
To facilitate domain adaptation across multiple modalities, the architecture features a frozen Conformer encoder-decoder model and a frozen CLIP encoder (frozen layers shown in grey with a lock symbol), as well as two lightweight trainable modules, a visual projection layer (shown in orange) and bottleneck adapters (shown in blue). Researchers recommend a two-stage approach to curriculum learning, with the first phase focusing on training the adapters (blue) without any visual tokens and the second phase tuning the visual projection layer (orange) while keeping the rest of the model static.
Researchers evaluate AVFormer’s zero-shot performance on the How2, VisSpeech, and Ego4D AV-ASR benchmarks compared to BEST-RQ, the audio version of the model, and AVATAR, the state-of-the-art AV-ASR. When both AVATAR and BEST-RQ are trained on LibriSpeech and the complete HowTo100M dataset, AVFormer still surpasses them. Notably, this requires training 600M parameters for BEST-RQ but only 4M parameters for AVFormer; therefore, it only needs a small subset of the training dataset (5% of HowTo100M). In addition, they compare AVFormer to an audio-only baseline called LibriSpeech and find that it outperforms both.
The state-of-the-art in zero-shot performance on many AV-ASR datasets is compared. LibriSpeech, an audio-only platform, also features performances. Lower WER percentages indicate higher performance. While the entirety of AVATAR and BEST-RQ are finetuned on HowTo100M, AVFormer’s small collection of finetuned parameters allows it to function effectively with as little as 5% of the dataset.
Researchers unveil AVFormer, an efficient tool for converting static examples of state-of-the-art ASR models into those suitable for AVASR. This method is realistic and effective, as seen by its zero-shot efficiency. Tuning the full parameter set of pre-trained models becomes problematic as ASR models grow in size and complexity across domains. The method is parameter efficient, allowing for simultaneous domain transfer and visual input blending.