LMSYS ORG Present Chatbot Arena: A Crowdsourced LLM Benchmark Platform With Anonymous, Randomized Battles

Many open-source projects have developed comprehensive linguistic models that can be trained to carry out specific tasks. These models can provide useful responses to questions and commands from users. Notable examples include the LLaMA-based Alpaca and Vicuna and the Pythia-based OpenAssistant and Dolly.

Even though new models are being released every week, the community still struggles to benchmark them properly. Since LLM assistants’ concerns are often vague, creating a benchmarking system that can automatically assess the quality of their answers is difficult. Human evaluation via pairwise comparison is often required here. A scalable, incremental, and distinctive benchmark system based on pairwise comparison is ideal. 

Few of the current LLM benchmarking systems meet all of these requirements. Classic LLM benchmark frameworks like HELM and lm-evaluation-harness provide multi-metric measures for research-standard tasks. However, they do not evaluate free-form questions well because they are not based on pairwise comparisons.

LMSYS ORG is an organization that develops large models and systems that are open, scalable, and accessible. Their new work presents Chatbot Arena, a crowdsourced LLM benchmark platform with anonymous, randomized battles. As with chess and other competitive games, the Elo rating system is employed in Chatbot Arena. The Elo rating system shows promise for delivering the aforementioned desirable quality.

They started collecting information a week ago when they opened the arena with many well-known open-source LLMs. Some examples of real-world applications of LLMs can be seen in the crowdsourcing data collection method. A user can compare and contrast two anonymous models while chatting with them simultaneously in the arena. 

FastChat, the multi-model serving system, hosted the arena at https://arena.lmsys.org. A person entering the arena will face a conversation with two nameless models. When consumers receive comments from both models, they can continue the conversation or vote for which one they prefer. After a vote is cast, the models’ identities will be unmasked. Users can continue conversing with the same two anonymous models or start a fresh battle with two new models. The system records all user activity. Only when the model names have obscured the votes in the analysis used. About 7,000 legitimate, anonymous votes have been tallied since the arena went live a week ago.

In the future, they want to implement improved sampling algorithms, tournament procedures, and serving systems to accommodate a greater variety of models and supply granular ranks for various tasks.


Check out the Project and Notebook. Don’t forget to join our 20k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

???? Check Out 100’s AI Tools in AI Tools Club


Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.