With some great advancements being made in the field of Artificial Intelligence, natural language systems are rapidly progressing. Large Language Models (LLMs) are getting significantly better and more popular with each upgrade and innovation. A new feature or modification is being added nearly daily, enabling LLMs to serve in different applications in almost every domain. LLMs are everywhere, from Machine translation and text summarization to sentiment analysis and question answering.
Though mostly in English, the open-source community has made some impressive strides in the development of chat-based LLMs. The development of a multilingual chat feature comparable to this in an LLM has received a bit less attention. In order to solve this, BLOOMChat, an open-source, multilingual chat LLM, has been developed by SambaNova, a software firm that specializes in generative AI solutions. BLOOMChat is a 176 billion parameter multilingual chat LLM developed in partnership with Together, an open, scalable, and decentralized cloud for artificial intelligence. It is based on top of the BLOOM paradigm.
The BLOOM model has the ability to generate text in 46 natural languages and 13 programming languages. For languages such as Spanish, French, and Arabic, BLOOM represents the first language model ever created with over 100 billion parameters. BLOOM was developed by the BigScience organization, which is an international collaboration of over 1000 researchers. By fine-tuning BLOOM on open conversation and alignment datasets from projects like OpenChatKit, Dolly 2.0, and OASST1, the core capabilities of BLOOM were extended into the chat domain.
For the development of the multilingual chat LLM, BLOOMChat, SambaNova, and Together have used the SambaNova DataScale systems that utilize SambaNova’s unique Reconfigurable Dataflow Architecture for the training process. Synthetic conversation data and human-written samples have been combined to create BLOOMChat. A big synthetic dataset called OpenChatKit has served as the basis for chat functionality, and higher-quality human-generated datasets like Dolly 2.0 and OASST1 have been used to enhance performance significantly. The code and scripts used for instruction-tuning on the OpenChatKit and Dolly-v2 datasets have been made available on SambaNova’s GitHub.
In human evaluations conducted across six languages, BLOOMChat responses were preferred over GPT-4 responses 45.25% of the time. Compared to four other open-source chat-aligned models in the same six languages, BLOOMChat’s responses ranked as the best 65.92% of the time. This accomplishment successfully closes the open-source market’s multilingual chat capability gap. In the WMT translation test, BLOOMChat performed better than additional BLOOM model iterations as well as popular open-source conversation models.
BLOOMChat, like other chat LLMs, has limitations. It may produce factually incorrect or irrelevant information or may switch languages by mistake. It can even repeat phrases, have limited coding or math capabilities, and sometimes generate toxic content. Further research is working towards addressing these challenges and ensuring better usage.
In conclusion, BLOOMChat builds upon the extensive work of the open-source community and is a great addition to the list of some highly useful and multilingual LLMs. By releasing it under an open-source license, SambaNova and Together aims to expand access to advanced multilingual chat capabilities and encourage further innovation in the AI research community.