From votes to a leaderboard
Thank you for your contributions!
The arena ranking is based on all votes and reactions from the blind comparison of the models, collected since the service opened to the public in October 2024.
Developed in partnership with the Digital Regulation Expertise Center (PEReN), the model ranking is based on the satisfaction score calculated using the Bradley Terry statistical model, a widely used method for converting binary votes into a probabilistic ranking.
The compar:IA ranking is not intended to be an official recommendation or to evaluate the technical performance of the models. It reflects the subjective preferences of the platform's users and not the factual accuracy or veracity of the responses.
Updated on 6/16/2026
Download dataModel | BT score of satisfaction Tooltip Estimated statistical score based on the Bradley-Terry model, reflecting the probability that one model is preferred over another. This score is calculated from all user votes and reactions. For more information, visit the methodology tab. | Confidence (±) Tooltip Interval indicating the reliability of the rank: the narrower the interval, the more reliable the rank estimate. There is a 95% chance that the model's true rank is within this range. | Total votes | Average energy (per 1000 tokens) Tooltip Measured in watt-hours, energy consumption represents the electricity used by the model to process a request and generate the corresponding response. Model energy consumption depends on the model's size and architecture. We have chosen to display proprietary models for which we do not have transparent information about their size and architecture in gray as "unanalyzed" (N/A). | Size (parameters) Tooltip Model size in billions of parameters, categorized into five classes. For proprietary models, this size is not reported. | Architecture Tooltip The architecture of an LLM model refers to the design principles that define how the components of a neural network are arranged and interact to transform input data into predictive outputs, including the activation mode of parameters (dense vs. sparse), the specialization of components, and the information processing mechanisms (transformers, convolutional networks, hybrid architectures). | Release date | Organization | Licence | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1153 | -7/+1 | 3320 | N / A | XL - (estimate) | Proprietary | 12/25 | |||
| 2 | 1146 | -7/+2 | 5604 | N / A | L - (estimate) | Proprietary | 8/25 | Mistral AI | ||
| 3 | 1137 | -11/+3 | 4368 | 4134 mWh | XL - 675 billion | MoE | 12/25 | Mistral AI | ||
| 4 | 1134 | -12/+4 | 3081 | N / A | XL - (estimate) | Proprietary | 6/25 | |||
| 5 | 1131 | -18/+5 | 2221 | N / A | L - (estimate) | Proprietary | 3/26 | |||
| 6 | 1130 | -17/+6 | 4267 | N / A | XL - (estimate) | Proprietary | 2/26 | Anthropic | ||
| 7 | 1119 | -19/+5 | 3039 | N / A | XL - (estimate) | Proprietary | 9/25 | Alibaba | ||
| 8 | 1119 | -20/+8 | 588 | 1892 mWh | L - 355 billion | MoE | 7/25 | Zhipu | ||
| 9 | 1119 | -17/+7 | 6362 | N / A | XL - (estimate) | Proprietary | 12/24 | |||
| 10 | 1114 | -18/+10 | 868 | N / A | XL - (estimate) | Proprietary | 11/25 | |||
| 11 | 1114 | -17/+9 | 1532 | 84 mWh | S - 26 billion | MoE | 4/26 | |||
| 12 | 1109 | -26/+11 | 499 | 1524 mWh | L - 398 billion | MoE | 4/26 | Arcee | ||
| 13 | 1107 | -15/+10 | 1530 | 117 mWh | S - 31 billion | Dense | 4/26 | |||
| 14 | 1106 | -15/+12 | 1252 | 332 mWh | M - 80 billion | MoE | 2/26 | Alibaba | ||
| 15 | 1105 | -13/+11 | 2799 | N / A | XL - (estimate) | Proprietary | 2/26 | |||
| 16 | 1105 | -12/+12 | 3622 | 3979 mWh | XL - 685 billion | MoE | 3/25 | DeepSeek | ||
| 17 | 1104 | -11/+13 | 2004 | 347 mWh | L - 119 billion | MoE | 3/26 | Mistral AI | ||
| 18 | 1103 | -10/+14 | 2551 | N / A | L - (estimate) | Proprietary | 3/26 | OpenAI | ||
| 19 | 1103 | -17/+17 | 766 | N / A | L - (estimate) | Proprietary | 11/25 | OpenAI | ||
| 20 | 1103 | -16/+17 | 1195 | N / A | L - (estimate) | Proprietary | 6/25 | Mistral AI | ||
| 21 | 1099 | -15/+17 | 1120 | N / A | L - (estimate) | Proprietary | 12/25 | OpenAI | ||
| 22 | 1095 | -19/+18 | 839 | N / A | XL - (estimate) | Proprietary | 5/26 | |||
| 23 | 1094 | -13/+17 | 3783 | 3979 mWh | XL - 671 billion | MoE | 12/24 | DeepSeek | ||
| 24 | 1094 | -14/+18 | 1424 | N / A | L - (estimate) | Proprietary | 4/26 | OpenAI | ||
| 25 | 1093 | -13/+19 | 1163 | 3979 mWh | XL - 685 billion | MoE | 8/25 | DeepSeek | ||
| 26 | 1091 | -9/+18 | 6466 | 112 mWh | S - 27 billion | Dense | 3/25 | |||
| 27 | 1090 | -20/+23 | 487 | 1892 mWh | L - 357 billion | MoE | 9/25 | Zhipu | ||
| 28 | 1078 | -19/+20 | 952 | 3785 mWh | XL - 1000 billion | MoE | 4/26 | Moonshot AI | ||
| 29 | 1076 | -14/+13 | 4677 | N / A | XL - (estimate) | Proprietary | 9/25 | Anthropic | ||
| 30 | 1069 | -19/+13 | 2142 | N / A | S - (estimate) | Proprietary | 3/26 | OpenAI | ||
| 31 | 1068 | -18/+14 | 1767 | 3979 mWh | XL - 685 billion | MoE | 5/25 | DeepSeek | ||
| 32 | 1066 | -18/+14 | 1993 | N / A | XS - (estimate) | Proprietary | 3/26 | OpenAI | ||
| 33 | 1065 | -17/+16 | 2002 | 3785 mWh | XL - 1000 billion | MoE | 1/26 | Moonshot AI | ||
| 34 | 1063 | -15/+8 | 6318 | 94 mWh | XS - 12 billion | Dense | 3/25 | |||
| 35 | 1061 | -18/+18 | 1128 | N / A | XL - (estimate) | Proprietary | 4/26 | Alibaba | ||
| 36 | 1059 | -19/+19 | 969 | 3785 mWh | XL - 1000 billion | MoE | 11/25 | Moonshot AI | ||
| 37 | 1059 | -14/+11 | 4194 | 109 mWh | S - 24 billion | Dense | 6/25 | Mistral AI | ||
| 38 | 1058 | -16/+15 | 1160 | 8890 mWh | XL - 1600 billion | MoE | 4/26 | DeepSeek | ||
| 39 | 1058 | -15/+16 | 1150 | 1524 mWh | L - 284 billion | MoE | 4/26 | DeepSeek | ||
| 40 | 1058 | -11/+14 | 2472 | 3979 mWh | XL - 685 billion | MoE | 12/25 | DeepSeek | ||
| 41 | 1055 | -21/+24 | 633 | 1951 mWh | XL - 480 billion | MoE | 7/25 | Alibaba | ||
| 42 | 1054 | -11/+15 | 2238 | N / A | XL - (estimate) | Proprietary | 5/25 | Anthropic | ||
| 43 | 1053 | -9/+15 | 4824 | 857 mWh | L - 111 billion | Dense | 3/25 | Cohere | ||
| 44 | 1052 | -13/+17 | 1782 | 3785 mWh | XL - 1000 billion | MoE | 9/25 | Moonshot AI | ||
| 45 | 1046 | -17/+17 | 1454 | N / A | L - (estimate) | Proprietary | 3/26 | OpenAI | ||
| 46 | 1045 | -13/+18 | 2736 | N / A | XL - (estimate) | Proprietary | 2/25 | Anthropic | ||
| 47 | 1040 | -15/+17 | 2319 | 109 mWh | S - 24 billion | Dense | 6/25 | Mistral AI | ||
| 48 | 1036 | -14/+13 | 4920 | 658 mWh | M - 70 billion | Dense | 10/24 | Nvidia | ||
| 49 | 1035 | -16/+16 | 1988 | 1601 mWh | L - 397 billion | MoE | 2/26 | Alibaba | ||
| 50 | 1030 | -27/+22 | 456 | 118 mWh | S - 32 billion | Dense | 4/25 | Alibaba | ||
| 51 | 1030 | -14/+11 | 7137 | 84 mWh | XS - 4 billion | Dense | 3/25 | |||
| 52 | 1023 | -20/+10 | 2315 | 3979 mWh | XL - 671 billion | MoE | 1/25 | DeepSeek | ||
| 53 | 1022 | -23/+16 | 987 | 4095 mWh | XL - 744 billion | MoE | 4/26 | Zhipu | ||
| 54 | 1022 | -17/+11 | 4968 | N / A | M - (estimate) | Proprietary | 4/25 | OpenAI | ||
| 55 | 1020 | -17/+10 | 4314 | 84 mWh | XS - 8 billion | Matformer | 5/25 | |||
| 56 | 1020 | -26/+26 | 373 | N / A | XL - (estimate) | Proprietary | 5/26 | Alibaba | ||
| 57 | 1019 | -16/+12 | 3148 | 342 mWh | L - 117 billion | MoE | 8/25 | OpenAI | ||
| 58 | 1016 | -21/+20 | 923 | 733 mWh | L - 230 billion | MoE | 10/25 | MiniMax | ||
| 59 | 1016 | -17/+16 | 1512 | 4095 mWh | XL - 744 billion | MoE | 2/26 | Zhipu | ||
| 60 | 1013 | -17/+16 | 1600 | 733 mWh | L - 229 billion | MoE | 2/26 | MiniMax | ||
| 61 | 1013 | -16/+16 | 1837 | N / A | S - (estimate) | Proprietary | 8/25 | OpenAI | ||
| 62 | 1013 | -16/+18 | 1412 | 82 mWh | S - 24 billion | MoE | 2/26 | Liquid | ||
| 63 | 1008 | -14/+12 | 4204 | 1601 mWh | XL - 400 billion | MoE | 4/25 | Meta | ||
| 64 | 1007 | -13/+13 | 5761 | N / A | XL - (estimate) | Proprietary | 9/24 | |||
| 65 | 1005 | -18/+16 | 1240 | 1892 mWh | L - 357 billion | MoE | 12/25 | Zhipu | ||
| 66 | 1004 | -17/+17 | 1428 | 376 mWh | L - 120 billion | MoE | 3/26 | Nvidia | ||
| 67 | 1002 | -16/+16 | 2160 | 166 mWh | S - 35 billion | MoE | 2/26 | Alibaba | ||
| 68 | 1001 | -15/+20 | 1069 | 733 mWh | L - 230 billion | MoE | 3/26 | MiniMax | ||
| 69 | 994 | -14/+18 | 1873 | 106 mWh | S - 22 billion | Dense | 12/25 | EuroLLM | ||
| 70 | 993 | -12/+17 | 3257 | N / A | S - (estimate) | Proprietary | 2/25 | Mistral AI | ||
| 71 | 992 | -11/+17 | 6454 | 400 mWh | L - 109 billion | MoE | 4/25 | Meta | ||
| 72 | 992 | -12/+22 | 1358 | N / A | L - (estimate) | Proprietary | 8/25 | OpenAI | ||
| 73 | 983 | -14/+12 | 3423 | 109 mWh | S - 24 billion | Dense | 3/25 | Mistral AI | ||
| 74 | 980 | -20/+24 | 513 | 89 mWh | XS - 8 billion | Dense | 4/26 | IBM | ||
| 75 | 980 | -15/+21 | 1542 | N / A | S - (estimate) | Proprietary | 4/25 | OpenAI | ||
| 76 | 976 | -14/+14 | 1628 | 81 mWh | XS - 8 billion | MoE | 10/25 | Liquid | ||
| 77 | 970 | -16/+13 | 1488 | 1524 mWh | L - 398 billion | MoE | 1/26 | Arcee | ||
| 78 | 967 | -13/+8 | 3899 | 83 mWh | S - 21 billion | MoE | 8/25 | OpenAI | ||
| 79 | 967 | -16/+15 | 1058 | 83 mWh | S - 30 billion | MoE | 5/25 | Alibaba | ||
| 80 | 962 | -13/+8 | 3245 | 658 mWh | M - 70 billion | Dense | 9/25 | Swiss AI | ||
| 81 | 960 | -13/+8 | 3479 | 118 mWh | S - 32 billion | Dense | 12/24 | Cohere | ||
| 82 | 959 | -11/+8 | 7402 | 658 mWh | M - 70 billion | Dense | 12/24 | Meta | ||
| 83 | 957 | -13/+18 | 679 | 112 mWh | S - 27 billion | Dense | 6/24 | |||
| 84 | 955 | -11/+11 | 2272 | 109 mWh | S - 24 billion | Dense | 1/25 | Mistral AI | ||
| 85 | 952 | -10/+13 | 1142 | N / A | S - (estimate) | Proprietary | 11/24 | OpenAI | ||
| 86 | 950 | -8/+13 | 4940 | N / A | S - (estimate) | Proprietary | 7/24 | OpenAI | ||
| 87 | 941 | -8/+11 | 3959 | N / A | S - (estimate) | Proprietary | 4/25 | OpenAI | ||
| 88 | 937 | -13/+13 | 1347 | N / A | XS - (estimate) | Proprietary | 4/25 | OpenAI | ||
| 89 | 935 | -10/+10 | 4209 | N / A | XL - (estimate) | Proprietary | 10/24 | Anthropic | ||
| 90 | 929 | -10/+10 | 4407 | 658 mWh | M - 70 billion | Dense | 7/24 | Meta | ||
| 91 | 929 | -10/+16 | 905 | 89 mWh | XS - 8 billion | Dense | 10/24 | Cohere | ||
| 92 | 921 | -9/+6 | 6664 | 96 mWh | XS - 14 billion | Dense | 12/24 | Microsoft | ||
| 93 | 921 | -8/+10 | 3958 | N / A | XL - (estimate) | Proprietary | 8/24 | OpenAI | ||
| 94 | 915 | -6/+9 | 7937 | 9134 mWh | XL - 405 billion | Dense | 7/24 | Meta | ||
| 95 | 905 | -7/+9 | 3994 | 90 mWh | XS - 9 billion | Dense | 6/24 | |||
| 96 | 905 | -8/+11 | 985 | 118 mWh | S - 32 billion | Dense | 4/25 | Alibaba | ||
| 97 | 904 | -9/+26 | 142 | 118 mWh | S - 32 billion | Dense | 9/24 | Alibaba | ||
| 98 | 899 | -5/+10 | 2816 | 658 mWh | M - 70 billion | Dense | 8/25 | Nous | ||
| 99 | 898 | -6/+13 | 1693 | 658 mWh | M - 70 billion | Dense | 1/25 | DeepSeek | ||
| 100 | 891 | -5/+6 | 4818 | 9134 mWh | XL - 405 billion | Dense | 7/24 | Nous | ||
| 101 | 873 | -4/+8 | 1315 | 88 mWh | XS - 7 billion | Dense | 9/24 | Alibaba | ||
| 102 | 872 | -3/+6 | 7502 | 89 mWh | XS - 8 billion | Dense | 7/24 | Meta | ||
| 103 | 862 | -1/+11 | 752 | 118 mWh | S - 32 billion | Dense | 11/25 | Ai2 | ||
| 104 | 824 | -2/+2 | 2297 | 193 mWh | S - 56 billion | MoE | 12/23 | Mistral AI | ||
| 105 | 807 | -2/+2 | 2789 | N / A | S - (estimate) | Proprietary | 9/24 | Liquid | ||
| 106 | 800 | -2/+3 | 2383 | 83 mWh | XS - 3.8 billion | Dense | 8/24 | Microsoft | ||
| 107 | 773 | -2/+2 | 5046 | 94 mWh | XS - 12 billion | Dense | 7/24 | Mistral AI | ||
| 108 | 758 | -1/+2 | 4355 | 1063 mWh | L - 176 billion | MoE | 4/24 | Mistral AI | ||
| 109 | 717 | -3/+0 | 1386 | 88 mWh | XS - 14 billion | Dense | 2/25 | jpacifico | ||
| 110 | 709 | -2/+6 | 65 | 90 mWh | XS - 9 billion | Dense | 5/24 | 01-ai | ||
| 111 | 695 | -1/+2 | 308 | 96 mWh | XS - 14 billion | Dense | 9/24 | jpacifico | ||
| 112 | 673 | -0/+3 | 80 | 88 mWh | XS - 7 billion | Dense | 7/24 | Alibaba |
Are the most popular models energy efficient?
This graph represents the satisfaction score (Bradley Terry score) for each model as a function of the estimated average energy consumption per 1000 tokens. Energy consumption is estimated using the Ecologits methodology and is based on two parameters: the size of the models (number of parameters) and their architecture. For proprietary models, this information is either not provided or only partially available. Therefore, they are excluded from the graph below.
Bradley-Terry (BT) Satisfaction Score VS Average Consumption per 1000 Tokens
Select a model to find out its Bradley-Terry (BT) score and estimated energy consumption
Model architecture
- MoE Tooltip The Mixture of Experts (MoE) architecture uses a routing mechanism to activate, depending on the input, only certain specialized subsets (“experts”) of the neural network. This allows for the construction of very large models while keeping computational costs low, because only a part of the network is used at each step.
- Dense Tooltip Dense architecture refers to a type of neural network in which each neuron in one layer is connected to all neurons in the next layer. This allows all parameters in the layer to contribute to the output calculation.
- Matformer Tooltip Imagine **Russian dolls** (matryoshkas → matryoshka transformer → Matformer): each block contains several **nested sub-models** of increasing size, sharing the same parameters. This allows, for each request, to select a model of adapted capacity, according to the available memory or latency, without needing to re-train different models.
How to find the right balance between perceived performance and energy efficiency? Examples of how to read the graph
- Higher a model is located on the graph the higher its Bradley-Terry satisfaction score. further to the left a model is located on the graph the less energy it consumes compared to other models.
- At the top left are the models that are popular and consume less energy compared to other models.
- Beyond size, architecture has an impact on the average energy consumption of models: for example, with a similar size, the Llama 3 405B model (dense architecture, 405 billion parameters) consumes 10 times more energy on average than the GLM 4.5 model (MOE architecture, 355 billion parameters and 32 billion active parameters).
Why are the proprietary models not displayed on the graph?
The estimation of energy consumption for model inference relies on the Ecologits methodology, which takes into account the size and architecture of the models. However, this information is not made public by model developers for proprietary models.
We have therefore decided not to integrate proprietary models into the graph until the information contributing used for the calculation of energy consumption is transparent.
How is the energy impact of the models calculated?
The arena uses the methodology developed by Ecologits (GenAI Impact) to provide an estimate of the energy footprint associated with inferring conversational generative AI models. This estimate allows users to compare the environmental impact of different AI models for the same query. This transparency is essential to encourage the development and adoption of more eco-friendly AI models.
Ecologits applies the principles of life cycle assessment (LCA) in accordance with ISO 14044, focusing for the moment on the impact of inference (i.e., the use of models to answer queries) and the manufacturing of graphics cards (resource extraction, manufacturing and transport).
The model's power consumption is estimated by taking into account various parameters such as the size and architecture of the AI model used, the location of the servers where the models are deployed, and the number of output tokens. The calculation of the global warming potential indicator, expressed in CO2 equivalent, is derived from the measurement of the model's power consumption.
It is important to note that methodologies for assessing the environmental impact of AI are still under development.
Chart data in table form
Updated on 6/16/2026
Download dataModel | BT score of satisfaction Tooltip Estimated statistical score based on the Bradley-Terry model, reflecting the probability that one model is preferred over another. This score is calculated from all user votes and reactions. For more information, visit the methodology tab. | Average energy (per 1000 tokens) Tooltip Measured in watt-hours, energy consumption represents the electricity used by the model to process a request and generate the corresponding response. Model energy consumption depends on the model's size and architecture. We have chosen to display proprietary models for which we do not have transparent information about their size and architecture in gray as "unanalyzed" (N/A). | Size (parameters) Tooltip Model size in billions of parameters, categorized into five classes. For proprietary models, this size is not reported. | Architecture Tooltip The architecture of an LLM model refers to the design principles that define how the components of a neural network are arranged and interact to transform input data into predictive outputs, including the activation mode of parameters (dense vs. sparse), the specialization of components, and the information processing mechanisms (transformers, convolutional networks, hybrid architectures). | Organization | Licence |
|---|---|---|---|---|---|---|
| 976 | 81 mWh | XS - 8 billion | MoE | Liquid | Open-weight | |
| 1013 | 82 mWh | S - 24 billion | MoE | Liquid | Open-weight | |
| 967 | 83 mWh | S - 21 billion | MoE | OpenAI | Open-weight | |
| 967 | 83 mWh | S - 30 billion | MoE | Alibaba | Open-weight | |
| 800 | 83 mWh | XS - 3.8 billion | Dense | Microsoft | Open-weight | |
| 1114 | 84 mWh | S - 26 billion | MoE | Open-weight | ||
| 1030 | 84 mWh | XS - 4 billion | Dense | Open-weight | ||
| 1020 | 84 mWh | XS - 8 billion | Matformer | Open-weight | ||
| 873 | 88 mWh | XS - 7 billion | Dense | Alibaba | Open-weight | |
| 717 | 88 mWh | XS - 14 billion | Dense | jpacifico | Open-weight | |
| 673 | 88 mWh | XS - 7 billion | Dense | Alibaba | Open-weight | |
| 980 | 89 mWh | XS - 8 billion | Dense | IBM | Open-weight | |
| 929 | 89 mWh | XS - 8 billion | Dense | Cohere | Open-weight | |
| 872 | 89 mWh | XS - 8 billion | Dense | Meta | Open-weight | |
| 905 | 90 mWh | XS - 9 billion | Dense | Open-weight | ||
| 709 | 90 mWh | XS - 9 billion | Dense | 01-ai | Open-weight | |
| 1063 | 94 mWh | XS - 12 billion | Dense | Open-weight | ||
| 773 | 94 mWh | XS - 12 billion | Dense | Mistral AI | Open-weight | |
| 921 | 96 mWh | XS - 14 billion | Dense | Microsoft | Open-weight | |
| 695 | 96 mWh | XS - 14 billion | Dense | jpacifico | Open-weight | |
| 994 | 106 mWh | S - 22 billion | Dense | EuroLLM | Open-weight | |
| 1059 | 109 mWh | S - 24 billion | Dense | Mistral AI | Open-weight | |
| 1040 | 109 mWh | S - 24 billion | Dense | Mistral AI | Open-weight | |
| 983 | 109 mWh | S - 24 billion | Dense | Mistral AI | Open-weight | |
| 955 | 109 mWh | S - 24 billion | Dense | Mistral AI | Open-weight | |
| 1091 | 112 mWh | S - 27 billion | Dense | Open-weight | ||
| 957 | 112 mWh | S - 27 billion | Dense | Open-weight | ||
| 1107 | 117 mWh | S - 31 billion | Dense | Open-weight | ||
| 1030 | 118 mWh | S - 32 billion | Dense | Alibaba | Open-weight | |
| 960 | 118 mWh | S - 32 billion | Dense | Cohere | Open-weight | |
| 905 | 118 mWh | S - 32 billion | Dense | Alibaba | Open-weight | |
| 904 | 118 mWh | S - 32 billion | Dense | Alibaba | Open-weight | |
| 862 | 118 mWh | S - 32 billion | Dense | Ai2 | Open source | |
| 1002 | 166 mWh | S - 35 billion | MoE | Alibaba | Open-weight | |
| 824 | 193 mWh | S - 56 billion | MoE | Mistral AI | Open-weight | |
| 1106 | 332 mWh | M - 80 billion | MoE | Alibaba | Open-weight | |
| 1019 | 342 mWh | L - 117 billion | MoE | OpenAI | Open-weight | |
| 1104 | 347 mWh | L - 119 billion | MoE | Mistral AI | Open-weight | |
| 1004 | 376 mWh | L - 120 billion | MoE | Nvidia | Open-weight | |
| 992 | 400 mWh | L - 109 billion | MoE | Meta | Open-weight | |
| 1036 | 658 mWh | M - 70 billion | Dense | Nvidia | Open-weight | |
| 962 | 658 mWh | M - 70 billion | Dense | Swiss AI | Open source | |
| 959 | 658 mWh | M - 70 billion | Dense | Meta | Open-weight | |
| 929 | 658 mWh | M - 70 billion | Dense | Meta | Open-weight | |
| 899 | 658 mWh | M - 70 billion | Dense | Nous | Open-weight | |
| 898 | 658 mWh | M - 70 billion | Dense | DeepSeek | Open-weight | |
| 1016 | 733 mWh | L - 230 billion | MoE | MiniMax | Open-weight | |
| 1013 | 733 mWh | L - 229 billion | MoE | MiniMax | Open-weight | |
| 1001 | 733 mWh | L - 230 billion | MoE | MiniMax | Open-weight | |
| 1053 | 857 mWh | L - 111 billion | Dense | Cohere | Open-weight | |
| 758 | 1063 mWh | L - 176 billion | MoE | Mistral AI | Open-weight | |
| 1109 | 1524 mWh | L - 398 billion | MoE | Arcee | Open-weight | |
| 1058 | 1524 mWh | L - 284 billion | MoE | DeepSeek | Open-weight | |
| 970 | 1524 mWh | L - 398 billion | MoE | Arcee | Open-weight | |
| 1035 | 1601 mWh | L - 397 billion | MoE | Alibaba | Open-weight | |
| 1008 | 1601 mWh | XL - 400 billion | MoE | Meta | Open-weight | |
| 1119 | 1892 mWh | L - 355 billion | MoE | Zhipu | Open-weight | |
| 1090 | 1892 mWh | L - 357 billion | MoE | Zhipu | Open-weight | |
| 1005 | 1892 mWh | L - 357 billion | MoE | Zhipu | Open-weight | |
| 1055 | 1951 mWh | XL - 480 billion | MoE | Alibaba | Open-weight | |
| 1078 | 3785 mWh | XL - 1000 billion | MoE | Moonshot AI | Open-weight | |
| 1065 | 3785 mWh | XL - 1000 billion | MoE | Moonshot AI | Open-weight | |
| 1059 | 3785 mWh | XL - 1000 billion | MoE | Moonshot AI | Open-weight | |
| 1052 | 3785 mWh | XL - 1000 billion | MoE | Moonshot AI | Open-weight | |
| 1105 | 3979 mWh | XL - 685 billion | MoE | DeepSeek | Open-weight | |
| 1094 | 3979 mWh | XL - 671 billion | MoE | DeepSeek | Open-weight | |
| 1093 | 3979 mWh | XL - 685 billion | MoE | DeepSeek | Open-weight | |
| 1068 | 3979 mWh | XL - 685 billion | MoE | DeepSeek | Open-weight | |
| 1058 | 3979 mWh | XL - 685 billion | MoE | DeepSeek | Open-weight | |
| 1023 | 3979 mWh | XL - 671 billion | MoE | DeepSeek | Open-weight | |
| 1022 | 4095 mWh | XL - 744 billion | MoE | Zhipu | Open-weight | |
| 1016 | 4095 mWh | XL - 744 billion | MoE | Zhipu | Open-weight | |
| 1137 | 4134 mWh | XL - 675 billion | MoE | Mistral AI | Open-weight | |
| 1058 | 8890 mWh | XL - 1600 billion | MoE | DeepSeek | Open-weight | |
| 915 | 9134 mWh | XL - 405 billion | Dense | Meta | Open-weight | |
| 891 | 9134 mWh | XL - 405 billion | Dense | Nous | Open-weight |
How to choose the model classification method?
Since 2024, thousands of users have used the arena to compare the responses of different models, generating hundreds of thousands of votes. Simply counting the number of wins is not enough to establish a ranking. A fair system must be statistically robust, adjust after each matchup, and truly reflect the value of the performances achieved.
It is with this in mind that a ranking based on the Bradley-Terry model was established, developed in collaboration with the French Center of expertise for digital platform regulation (PEReN) teams, based on all the votes and reactions collected on the platform. To learn more, see our methodological notebook.
Two ways to classify models
Ranking by win rate
Definition An empirical ranking system for models based on the percentage of duels won by a model against all other models.
Main problems
- Game Count Bias A model that has won three out of three duels has a 100% win rate, but this score is not very meaningful because it is based on very little data.
- No consideration of duel difficulty: beating a “beginner” or an “expert” model counts the same. Win rates are unfair since they do not take match difficulty into account.
- Stagnation: In the long run, many good models end up around 50% win rate because they are facing models of their own skill level, which makes the rankings less discriminating.
Bradley-Terry (BT) leaderboard
Definition : Ranking system where the gain or loss of points depends on the result (victory/defeat/draw and the estimated level of the opponent: if a weaker model beats a stronger model, its progression in the ranking is greater.
Benefits
- Probabilistic model : we can estimate the probable outcome of any matchup, even between models that have never been directly competed.
- Taking match difficulty into account :The scores estimated from the Bradley Terry model take into account the level of the opponents encountered, allowing for a fair comparison between models.
- Better Uncertainty Management :The confidence interval integrates the entire network of comparisons. This allows for a more accurate estimation of uncertainty, especially for models with few direct confrontations but many common opponents.
Impact of methodological choice on model ranking
Top 10 models in the ranking based on "empirical" win rates
Based solely on the average win rate , an overall ranking can be obtained, but this calculation assumes that each model has played against all others.
This method is not ideal because it requires data from all combinations of models and as soon as the number of models increases, it quickly becomes expensive and cumbersome to maintain.
Top 10 models in the ranking based on estimated win rate with the Bradley-Terry model
The Bradley-Terry model transforms a set of local and potentially incomplete comparisons into a consistent and statistically robust global ranking system, where the empirical win rate remains limited to direct observations.