⚔️ LMSYS Chatbot Arena (Multimodal): Benchmarking LLMs and VLMs in the Wild
Blog | GitHub | Paper | Dataset | Twitter | Discord | Kaggle Competition
📜 Rules
- Ask any question to two anonymous models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
- You can continue chatting until you identify a winner.
- Vote won't be counted if model identity is revealed during conversation.
- NEW Image Support: Upload an image on your first turn to unlock the multimodal arena! Images should be less than 15MB.
🏆 Chatbot Arena Leaderboard
- We've collected 1,000,000+ human votes to compute an LLM Elo leaderboard for 100+ models. Find out who is the 🥇LLM Champion here!
👇 Chat now!
| GPT-4o: The flagship model across audio, vision, and text by OpenAI | Grok-2: Grok-2 by xAI | Gemini: Gemini by Google |
| Claude 3.5: Claude by Anthropic | Llama 3.1: Open foundation and chat models by Meta | Mixtral of experts: A Mixture-of-Experts model by Mistral AI |
| GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs | Gemma 2: Gemma 2 by Google |
| Claude: Claude by Anthropic | DeepSeek Coder v2: An advanced code model by DeepSeek | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
| Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
| GPT-3.5: GPT-3.5-Turbo by OpenAI | Yi-Large: State-of-the-art model by 01 AI | Yi-Chat: A large language model by 01 AI |
| Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka | Reka Flash: Multimodal model by Reka |
| Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere | Qwen 1.5: The First 100B+ Model of the Qwen1.5 Series |
| GLM-4: Next-Gen Foundation Model by Zhipu AI | DBRX Instruct: DBRX by Databricks Mosaic AI | InternVL 2: Multimodal Model developed by OpenGVLab |
| internlm2_5-20b-chat: Register the description at fastchat/model/model_registry.py |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
⚔️ LMSYS Chatbot Arena (Multimodal): Benchmarking LLMs and VLMs in the Wild
Blog | GitHub | Paper | Dataset | Twitter | Discord | Kaggle Competition
📜 Rules
- Ask any question to two chosen models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
- You can chat for multiple turns until you identify a winner.
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
🤖 Choose two models to compare
| GPT-4o: The flagship model across audio, vision, and text by OpenAI | Grok-2: Grok-2 by xAI | Gemini: Gemini by Google |
| Claude 3.5: Claude by Anthropic | Llama 3.1: Open foundation and chat models by Meta | Mixtral of experts: A Mixture-of-Experts model by Mistral AI |
| GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs | Gemma 2: Gemma 2 by Google |
| Claude: Claude by Anthropic | DeepSeek Coder v2: An advanced code model by DeepSeek | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
| Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
| GPT-3.5: GPT-3.5-Turbo by OpenAI | Yi-Large: State-of-the-art model by 01 AI | Yi-Chat: A large language model by 01 AI |
| Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka | Reka Flash: Multimodal model by Reka |
| Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere | Qwen 1.5: The First 100B+ Model of the Qwen1.5 Series |
| GLM-4: Next-Gen Foundation Model by Zhipu AI | DBRX Instruct: DBRX by Databricks Mosaic AI | InternVL 2: Multimodal Model developed by OpenGVLab |
| internlm2_5-20b-chat: Register the description at fastchat/model/model_registry.py |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
🏔️ Chat with Large Vision-Language Models
Blog | GitHub | Paper | Dataset | Twitter | Discord | Kaggle Competition
❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
| GPT-4o: The flagship model across audio, vision, and text by OpenAI | Grok-2: Grok-2 by xAI | Gemini: Gemini by Google |
| Claude 3.5: Claude by Anthropic | Llama 3.1: Open foundation and chat models by Meta | Mixtral of experts: A Mixture-of-Experts model by Mistral AI |
| GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs | Gemma 2: Gemma 2 by Google |
| Claude: Claude by Anthropic | DeepSeek Coder v2: An advanced code model by DeepSeek | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
| Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
| GPT-3.5: GPT-3.5-Turbo by OpenAI | Yi-Large: State-of-the-art model by 01 AI | Yi-Chat: A large language model by 01 AI |
| Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka | Reka Flash: Multimodal model by Reka |
| Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere | Qwen 1.5: The First 100B+ Model of the Qwen1.5 Series |
| GLM-4: Next-Gen Foundation Model by Zhipu AI | DBRX Instruct: DBRX by Databricks Mosaic AI | InternVL 2: Multimodal Model developed by OpenGVLab |
| internlm2_5-20b-chat: Register the description at fastchat/model/model_registry.py |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
LMSYS Chatbot Arena is a crowdsourced open platform for LLM evals. We've collected over 1,000,000 human pairwise comparisons to rank LLMs with the Bradley-Terry model and display the model ratings in Elo-scale. You can find more details in our paper. Chatbot arena is dependent on community participation, please contribute by casting your vote!
Total #models: 145. Total #votes: 1,898,013. Last updated: 2024-09-17.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Overall Questions
#models: 145 (100%) #votes: 1,898,013 (100%)
Rank* (UB) | Model | Arena Score | 95% CI | Votes | Organization | License | Knowledge Cutoff |
|---|---|---|---|---|---|---|---|
104 | 1355 | +12/-11 | 165503 | Cognitive Computations | Falcon-180B TII License | 2023/10 |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena (Overall)
Total #models: 145. Total #votes: 1,898,013. Last updated: 2024-09-17.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Task Leaderboard
Chatbot Arena Overview
Model | Overall | Overall w/ Style Control | Hard Prompts (Overall) | Hard Prompts (Overall) w/ Style Control | Instruction Following | Coding | Math | Multi-Turn | Longer Query |
|---|---|---|---|---|---|---|---|---|---|
phi-3-mini-4k-instruct-june-2024 | 100 | 101 | 105 | 104 | 106 | 104 | 101 | 100 | 107 |
Language Leaderboard
Chatbot Arena Overview
Model | English | Chinese | German | French | Spanish | Russian | Japanese | Korean |
|---|---|---|---|---|---|---|---|---|
phi-3-mini-4k-instruct-june-2024 | 100 | 105 | 100 | 107 | 104 | 101 | 111 | 102 |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
Total #models: 26. Total #votes: 101,842. Last updated: 2024-09-17.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Overall Questions
#models: 26 (100%) #votes: 101,842 (100%)
Rank* (UB) | Model | Arena Score | 95% CI | Votes | Organization | License | Knowledge Cutoff |
|---|---|---|---|---|---|---|---|
13 | 1231 | +11/-12 | 10089 | Anthropic | Proprietary | 2023/11 |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena (Overall)
Last Updated: 2024-07-31
Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs with 500 challenging user queries curated from Chatbot Arena.
We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check out our paper for more details about how Arena-Hard-Auto works as an fully automated data pipeline converting crowdsourced data into high-quality benchmarks -> [Paper | Repo]
Rank* (UB) | Model | Win-rate | 95% CI | Average Tokens | Organization |
|---|---|---|---|---|---|
10 | 82.63 | +2.0/-1.9 | 662 | DeepSeek AI |
Three benchmarks are displayed: Arena Elo, MT-Bench and MMLU.
- Chatbot Arena - a crowdsourced, randomized battle platform. We use 1M+ user votes to compute model strength.
- MT-Bench: a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
- MMLU (5-shot): a test to measure a model's multitask accuracy on 57 tasks.
💻 Code: The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are mostly computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available.
Model | Arena Score | arena-hard-auto | MT-bench | MMLU | Organization | License |
|---|---|---|---|---|---|---|
1355 | 79.21 | 9.32 | 88.7 | Cognitive Computations | Falcon-180B TII License |
Citation
Please cite the following paper if you find our leaderboard or dataset helpful.
@misc{chiang2024chatbot,
title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
year={2024},
eprint={2403.04132},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
About Us
Chatbot Arena is an open-source research project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to build an open platform to evaluate LLMs by human preference in the real-world. We open-source our FastChat project at GitHub and release chat and human feedback dataset. We invite everyone to join us!
Open-source contributors
- Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Lisa Dunlap, Anastasios Angelopoulos, Christopher Chou, Tianle Li, Siyuan Zhuang
- Advisors: Ion Stoica, Joseph E. Gonzalez, Hao Zhang, Trevor Darrell
Learn more
Contact Us
- Follow our X, Discord or email us at lmsys.org@gmail.com
- File issues on GitHub
- Download our datasets and models on HuggingFace
Acknowledgment
We thank SkyPilot and Gradio team for their system support. We also thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship. Learn more about partnership here.