How Multitask Model Seamlessly Transcribes Across Speech and Text

In a world where interactions are increasingly global, being multilingual can bridge gaps, foster understanding, and open doors to diverse opportunities. Learning multiple languages can provide insights into language structure and linguistics, deepening one’s understanding of the mechanics of communication and thought. This can be especially valuable in today’s globalized world, where cross-cultural interactions are common. Don?t you think this bridge needs to be filled even between the humans and the AI?

Researchers from MetaAI and UC Berkley propose a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text. They call it ?SeamlessM4T?. The M4T in the name stands for Massively Multilingual and Multimodal Machine Translation. It is an AI model with speech-to-text, speech-to-speech, text-to-speech, text-to-text translation, and automatic speech recognition for up to 100 languages. 

Who isn?t familiar with Babel Fish ( an online translator )? What is the problem with it? Babel Fish is a speech-to-speech translation system. Various existing systems of such kind tend to focus on high-resource languages such as English, Spanish, and French, leaving many low-resource languages behind. Their services are mostly translations from English to other languages and not vice-versa. These systems rely on cascade systems composed of multiple subsystems, so their performance doesn?t match their cascade counterparts.

To resolve these limitations, researchers used over 1 million hours of open speech audio data to learn self-supervised speech. They created a multimodal corpus of automatically aligned speech translations of more than 470,000 hours! To evaluate the model?s robustness against the background noises and speaker, they created open robustness benchmarks and found an improvement of 38% and 49%, respectively.

Researchers say that they maintained systematic evaluations for their system throughout their workflow to ensure safe and robust performance. They used parallel data mining alternative to using closed data. This method involves encoding sentences from various languages into a fixed-size embedding space and finding parallel instances based on a similarity metric.

Creating a unified large model that can handle the full suite of tasks involved in text and speech translation lays the important groundwork for the next generation of on-device and on-demand multimodal translation. They say that when language technologies are developed primarily with this idealogy in mind, the needs of half of the world?s population are resolved, and their future work involves bridging this gap between those who speak high and low-resource languages to lead the world in a direction that has never been more interconnected. 

Researchers say that their model SeamlessM4T performance may need to be more consistent when it comes to translating slang or proper nouns across high and low-resource languages. Their future work would resolve this limitation to have a more friendly and moderate conversation based on one?s mother tongue and slang.