227,753 votes
Goal: 300,000 Legend Discuss, vote, and help us reach this goal!
Your votes matter: they feed the compar:IA dataset, which is freely available to help refine future models in less-resourced languages.
This digital commons contributes to better respect for linguistic and cultural diversity in future language models.

From votes to a leaderboard

Thank you for your contributions!
The arena ranking is based on all votes and reactions from the blind comparison of the models, collected since the service opened to the public in October 2024.
Developed in partnership with the Digital Regulation Expertise Center (PEReN), the model ranking is based on the satisfaction score calculated using the Bradley Terry statistical model, a widely used method for converting binary votes into a probabilistic ranking.
The compar:IA ranking is not intended to be an official recommendation or to evaluate the technical performance of the models. It reflects the subjective preferences of the platform's users and not the factual accuracy or veracity of the responses.

Total models: 112
Total votes: 228,000

Updated on 6/16/2026

Download data
From votes to a leaderboard
Rank Tooltip Rank assigned based on Bradley-Terry satisfaction score
Model
BT score
of satisfaction
Tooltip Estimated statistical score based on the Bradley-Terry model, reflecting the probability that one model is preferred over another. This score is calculated from all user votes and reactions. For more information, visit the methodology tab.
Confidence (±) Tooltip Interval indicating the reliability of the rank: the narrower the interval, the more reliable the rank estimate. There is a 95% chance that the model's true rank is within this range.
Total votes
Average energy
(per 1000 tokens)
Tooltip Measured in watt-hours, energy consumption represents the electricity used by the model to process a request and generate the corresponding response. Model energy consumption depends on the model's size and architecture. We have chosen to display proprietary models for which we do not have transparent information about their size and architecture in gray as "unanalyzed" (N/A).
Size
(parameters)
Tooltip Model size in billions of parameters, categorized into five classes. For proprietary models, this size is not reported.
Architecture Tooltip The architecture of an LLM model refers to the design principles that define how the components of a neural network are arranged and interact to transform input data into predictive outputs, including the activation mode of parameters (dense vs. sparse), the specialization of components, and the information processing mechanisms (transformers, convolutional networks, hybrid architectures).
Release date
Organization
Licence
1
1153
-7/+13320N / AXL - (estimate)Proprietary12/25GoogleProprietary
2
1146
-7/+25604N / AL - (estimate)Proprietary8/25Mistral AIProprietary
3
1137
-11/+343684134 mWh
XL - 675 billionMoE12/25Mistral AIOpen-weight
4
1134
-12/+43081N / AXL - (estimate)Proprietary6/25GoogleProprietary
5
1131
-18/+52221N / AL - (estimate)Proprietary3/26GoogleProprietary
6
1130
-17/+64267N / AXL - (estimate)Proprietary2/26AnthropicProprietary
7
1119
-19/+53039N / AXL - (estimate)Proprietary9/25AlibabaProprietary
8
1119
-20/+85881892 mWh
L - 355 billionMoE7/25ZhipuOpen-weight
9
1119
-17/+76362N / AXL - (estimate)Proprietary12/24GoogleProprietary
10
1114
-18/+10868N / AXL - (estimate)Proprietary11/25GoogleProprietary
11
1114
-17/+9153284 mWh
S - 26 billionMoE4/26GoogleOpen-weight
12
1109
-26/+114991524 mWh
L - 398 billionMoE4/26ArceeOpen-weight
13
1107
-15/+101530117 mWh
S - 31 billionDense4/26GoogleOpen-weight
14
1106
-15/+121252332 mWh
M - 80 billionMoE2/26AlibabaOpen-weight
15
1105
-13/+112799N / AXL - (estimate)Proprietary2/26GoogleProprietary
16
1105
-12/+1236223979 mWh
XL - 685 billionMoE3/25DeepSeekOpen-weight
17
1104
-11/+132004347 mWh
L - 119 billionMoE3/26Mistral AIOpen-weight
18
1103
-10/+142551N / AL - (estimate)Proprietary3/26OpenAIProprietary
19
1103
-17/+17766N / AL - (estimate)Proprietary11/25OpenAIProprietary
20
1103
-16/+171195N / AL - (estimate)Proprietary6/25Mistral AIProprietary
21
1099
-15/+171120N / AL - (estimate)Proprietary12/25OpenAIProprietary
22
1095
-19/+18839N / AXL - (estimate)Proprietary5/26GoogleProprietary
23
1094
-13/+1737833979 mWh
XL - 671 billionMoE12/24DeepSeekOpen-weight
24
1094
-14/+181424N / AL - (estimate)Proprietary4/26OpenAIProprietary
25
1093
-13/+1911633979 mWh
XL - 685 billionMoE8/25DeepSeekOpen-weight
26
1091
-9/+186466112 mWh
S - 27 billionDense3/25GoogleOpen-weight
27
1090
-20/+234871892 mWh
L - 357 billionMoE9/25ZhipuOpen-weight
28
1078
-19/+209523785 mWh
XL - 1000 billionMoE4/26Moonshot AIOpen-weight
29
1076
-14/+134677N / AXL - (estimate)Proprietary9/25AnthropicProprietary
30
1069
-19/+132142N / AS - (estimate)Proprietary3/26OpenAIProprietary
31
1068
-18/+1417673979 mWh
XL - 685 billionMoE5/25DeepSeekOpen-weight
32
1066
-18/+141993N / AXS - (estimate)Proprietary3/26OpenAIProprietary
33
1065
-17/+1620023785 mWh
XL - 1000 billionMoE1/26Moonshot AIOpen-weight
34
1063
-15/+8631894 mWh
XS - 12 billionDense3/25GoogleOpen-weight
35
1061
-18/+181128N / AXL - (estimate)Proprietary4/26AlibabaProprietary
36
1059
-19/+199693785 mWh
XL - 1000 billionMoE11/25Moonshot AIOpen-weight
37
1059
-14/+114194109 mWh
S - 24 billionDense6/25Mistral AIOpen-weight
38
1058
-16/+1511608890 mWh
XL - 1600 billionMoE4/26DeepSeekOpen-weight
39
1058
-15/+1611501524 mWh
L - 284 billionMoE4/26DeepSeekOpen-weight
40
1058
-11/+1424723979 mWh
XL - 685 billionMoE12/25DeepSeekOpen-weight
41
1055
-21/+246331951 mWh
XL - 480 billionMoE7/25AlibabaOpen-weight
42
1054
-11/+152238N / AXL - (estimate)Proprietary5/25AnthropicProprietary
43
1053
-9/+154824857 mWh
L - 111 billionDense3/25CohereOpen-weight
44
1052
-13/+1717823785 mWh
XL - 1000 billionMoE9/25Moonshot AIOpen-weight
45
1046
-17/+171454N / AL - (estimate)Proprietary3/26OpenAIProprietary
46
1045
-13/+182736N / AXL - (estimate)Proprietary2/25AnthropicProprietary
47
1040
-15/+172319109 mWh
S - 24 billionDense6/25Mistral AIOpen-weight
48
1036
-14/+134920658 mWh
M - 70 billionDense10/24NvidiaOpen-weight
49
1035
-16/+1619881601 mWh
L - 397 billionMoE2/26AlibabaOpen-weight
50
1030
-27/+22456118 mWh
S - 32 billionDense4/25AlibabaOpen-weight
51
1030
-14/+11713784 mWh
XS - 4 billionDense3/25GoogleOpen-weight
52
1023
-20/+1023153979 mWh
XL - 671 billionMoE1/25DeepSeekOpen-weight
53
1022
-23/+169874095 mWh
XL - 744 billionMoE4/26ZhipuOpen-weight
54
1022
-17/+114968N / AM - (estimate)Proprietary4/25OpenAIProprietary
55
1020
-17/+10431484 mWh
XS - 8 billionMatformer5/25GoogleOpen-weight
56
1020
-26/+26373N / AXL - (estimate)Proprietary5/26AlibabaProprietary
57
1019
-16/+123148342 mWh
L - 117 billionMoE8/25OpenAIOpen-weight
58
1016
-21/+20923733 mWh
L - 230 billionMoE10/25MiniMaxOpen-weight
59
1016
-17/+1615124095 mWh
XL - 744 billionMoE2/26ZhipuOpen-weight
60
1013
-17/+161600733 mWh
L - 229 billionMoE2/26MiniMaxOpen-weight
61
1013
-16/+161837N / AS - (estimate)Proprietary8/25OpenAIProprietary
62
1013
-16/+18141282 mWh
S - 24 billionMoE2/26LiquidOpen-weight
63
1008
-14/+1242041601 mWh
XL - 400 billionMoE4/25MetaOpen-weight
64
1007
-13/+135761N / AXL - (estimate)Proprietary9/24GoogleProprietary
65
1005
-18/+1612401892 mWh
L - 357 billionMoE12/25ZhipuOpen-weight
66
1004
-17/+171428376 mWh
L - 120 billionMoE3/26NvidiaOpen-weight
67
1002
-16/+162160166 mWh
S - 35 billionMoE2/26AlibabaOpen-weight
68
1001
-15/+201069733 mWh
L - 230 billionMoE3/26MiniMaxOpen-weight
69
994
-14/+181873106 mWh
S - 22 billionDense12/25EuroLLMOpen-weight
70
993
-12/+173257N / AS - (estimate)Proprietary2/25Mistral AIProprietary
71
992
-11/+176454400 mWh
L - 109 billionMoE4/25MetaOpen-weight
72
992
-12/+221358N / AL - (estimate)Proprietary8/25OpenAIProprietary
73
983
-14/+123423109 mWh
S - 24 billionDense3/25Mistral AIOpen-weight
74
980
-20/+2451389 mWh
XS - 8 billionDense4/26IBMOpen-weight
75
980
-15/+211542N / AS - (estimate)Proprietary4/25OpenAIProprietary
76
976
-14/+14162881 mWh
XS - 8 billionMoE10/25LiquidOpen-weight
77
970
-16/+1314881524 mWh
L - 398 billionMoE1/26ArceeOpen-weight
78
967
-13/+8389983 mWh
S - 21 billionMoE8/25OpenAIOpen-weight
79
967
-16/+15105883 mWh
S - 30 billionMoE5/25AlibabaOpen-weight
80
962
-13/+83245658 mWh
M - 70 billionDense9/25Swiss AIOpen source
81
960
-13/+83479118 mWh
S - 32 billionDense12/24CohereOpen-weight
82
959
-11/+87402658 mWh
M - 70 billionDense12/24MetaOpen-weight
83
957
-13/+18679112 mWh
S - 27 billionDense6/24GoogleOpen-weight
84
955
-11/+112272109 mWh
S - 24 billionDense1/25Mistral AIOpen-weight
85
952
-10/+131142N / AS - (estimate)Proprietary11/24OpenAIProprietary
86
950
-8/+134940N / AS - (estimate)Proprietary7/24OpenAIProprietary
87
941
-8/+113959N / AS - (estimate)Proprietary4/25OpenAIProprietary
88
937
-13/+131347N / AXS - (estimate)Proprietary4/25OpenAIProprietary
89
935
-10/+104209N / AXL - (estimate)Proprietary10/24AnthropicProprietary
90
929
-10/+104407658 mWh
M - 70 billionDense7/24MetaOpen-weight
91
929
-10/+1690589 mWh
XS - 8 billionDense10/24CohereOpen-weight
92
921
-9/+6666496 mWh
XS - 14 billionDense12/24MicrosoftOpen-weight
93
921
-8/+103958N / AXL - (estimate)Proprietary8/24OpenAIProprietary
94
915
-6/+979379134 mWh
XL - 405 billionDense7/24MetaOpen-weight
95
905
-7/+9399490 mWh
XS - 9 billionDense6/24GoogleOpen-weight
96
905
-8/+11985118 mWh
S - 32 billionDense4/25AlibabaOpen-weight
97
904
-9/+26142118 mWh
S - 32 billionDense9/24AlibabaOpen-weight
98
899
-5/+102816658 mWh
M - 70 billionDense8/25NousOpen-weight
99
898
-6/+131693658 mWh
M - 70 billionDense1/25DeepSeekOpen-weight
100
891
-5/+648189134 mWh
XL - 405 billionDense7/24NousOpen-weight
101
873
-4/+8131588 mWh
XS - 7 billionDense9/24AlibabaOpen-weight
102
872
-3/+6750289 mWh
XS - 8 billionDense7/24MetaOpen-weight
103
862
-1/+11752118 mWh
S - 32 billionDense11/25Ai2Open source
104
824
-2/+22297193 mWh
S - 56 billionMoE12/23Mistral AIOpen-weight
105
807
-2/+22789N / AS - (estimate)Proprietary9/24LiquidProprietary
106
800
-2/+3238383 mWh
XS - 3.8 billionDense8/24MicrosoftOpen-weight
107
773
-2/+2504694 mWh
XS - 12 billionDense7/24Mistral AIOpen-weight
108
758
-1/+243551063 mWh
L - 176 billionMoE4/24Mistral AIOpen-weight
109
717
-3/+0138688 mWh
XS - 14 billionDense2/25jpacificoOpen-weight
110
709
-2/+66590 mWh
XS - 9 billionDense5/2401-aiOpen-weight
111
695
-1/+230896 mWh
XS - 14 billionDense9/24jpacificoOpen-weight
112
673
-0/+38088 mWh
XS - 7 billionDense7/24AlibabaOpen-weight

Are the most popular models energy efficient?

This graph represents the satisfaction score (Bradley Terry score) for each model as a function of the estimated average energy consumption per 1000 tokens. Energy consumption is estimated using the Ecologits methodology and is based on two parameters: the size of the models (number of parameters) and their architecture. For proprietary models, this information is either not provided or only partially available. Therefore, they are excluded from the graph below.

Bradley-Terry (BT) Satisfaction Score VS Average Consumption per 1000 Tokens

Select a model to find out its Bradley-Terry (BT) score and estimated energy consumption

Bradley-Terry Score (BT)
70075080085090095010001050110011505001000150020002500300035004000
Average consumption per 1000 tokens (mWh)
Filter by average energy consumption per 1000 tokens
Size (parameters)
Visible

Model architecture

  • MoE Tooltip The Mixture of Experts (MoE) architecture uses a routing mechanism to activate, depending on the input, only certain specialized subsets (“experts”) of the neural network. This allows for the construction of very large models while keeping computational costs low, because only a part of the network is used at each step.
  • Dense Tooltip Dense architecture refers to a type of neural network in which each neuron in one layer is connected to all neurons in the next layer. This allows all parameters in the layer to contribute to the output calculation.
  • Matformer Tooltip Imagine **Russian dolls** (matryoshkas → matryoshka transformer → Matformer): each block contains several **nested sub-models** of increasing size, sharing the same parameters. This allows, for each request, to select a model of adapted capacity, according to the available memory or latency, without needing to re-train different models.

How to find the right balance between perceived performance and energy efficiency? Examples of how to read the graph

  • Higher a model is located on the graph the higher its Bradley-Terry satisfaction score. further to the left a model is located on the graph the less energy it consumes compared to other models.
  • At the top left are the models that are popular and consume less energy compared to other models.
  • Beyond size, architecture has an impact on the average energy consumption of models: for example, with a similar size, the Llama 3 405B model (dense architecture, 405 billion parameters) consumes 10 times more energy on average than the GLM 4.5 model (MOE architecture, 355 billion parameters and 32 billion active parameters).

Why are the proprietary models not displayed on the graph?

The estimation of energy consumption for model inference relies on the Ecologits methodology, which takes into account the size and architecture of the models. However, this information is not made public by model developers for proprietary models.

We have therefore decided not to integrate proprietary models into the graph until the information contributing used for the calculation of energy consumption is transparent.

How is the energy impact of the models calculated?

The arena uses the methodology developed by Ecologits (GenAI Impact) to provide an estimate of the energy footprint associated with inferring conversational generative AI models. This estimate allows users to compare the environmental impact of different AI models for the same query. This transparency is essential to encourage the development and adoption of more eco-friendly AI models.

Ecologits applies the principles of life cycle assessment (LCA) in accordance with ISO 14044, focusing for the moment on the impact of inference (i.e., the use of models to answer queries) and the manufacturing of graphics cards (resource extraction, manufacturing and transport).

The model's power consumption is estimated by taking into account various parameters such as the size and architecture of the AI model used, the location of the servers where the models are deployed, and the number of output tokens. The calculation of the global warming potential indicator, expressed in CO2 equivalent, is derived from the measurement of the model's power consumption.

It is important to note that methodologies for assessing the environmental impact of AI are still under development.

Chart data in table form

Updated on 6/16/2026

Download data
From votes to a leaderboard
Model
BT score
of satisfaction
Tooltip Estimated statistical score based on the Bradley-Terry model, reflecting the probability that one model is preferred over another. This score is calculated from all user votes and reactions. For more information, visit the methodology tab.
Average energy
(per 1000 tokens)
Tooltip Measured in watt-hours, energy consumption represents the electricity used by the model to process a request and generate the corresponding response. Model energy consumption depends on the model's size and architecture. We have chosen to display proprietary models for which we do not have transparent information about their size and architecture in gray as "unanalyzed" (N/A).
Size
(parameters)
Tooltip Model size in billions of parameters, categorized into five classes. For proprietary models, this size is not reported.
Architecture Tooltip The architecture of an LLM model refers to the design principles that define how the components of a neural network are arranged and interact to transform input data into predictive outputs, including the activation mode of parameters (dense vs. sparse), the specialization of components, and the information processing mechanisms (transformers, convolutional networks, hybrid architectures).
Organization
Licence
97681 mWh XS - 8 billionMoELiquidOpen-weight
101382 mWh S - 24 billionMoELiquidOpen-weight
96783 mWh S - 21 billionMoEOpenAIOpen-weight
96783 mWh S - 30 billionMoEAlibabaOpen-weight
80083 mWh XS - 3.8 billionDenseMicrosoftOpen-weight
111484 mWh S - 26 billionMoEGoogleOpen-weight
103084 mWh XS - 4 billionDenseGoogleOpen-weight
102084 mWh XS - 8 billionMatformerGoogleOpen-weight
87388 mWh XS - 7 billionDenseAlibabaOpen-weight
71788 mWh XS - 14 billionDensejpacificoOpen-weight
67388 mWh XS - 7 billionDenseAlibabaOpen-weight
98089 mWh XS - 8 billionDenseIBMOpen-weight
92989 mWh XS - 8 billionDenseCohereOpen-weight
87289 mWh XS - 8 billionDenseMetaOpen-weight
90590 mWh XS - 9 billionDenseGoogleOpen-weight
70990 mWh XS - 9 billionDense01-aiOpen-weight
106394 mWh XS - 12 billionDenseGoogleOpen-weight
77394 mWh XS - 12 billionDenseMistral AIOpen-weight
92196 mWh XS - 14 billionDenseMicrosoftOpen-weight
69596 mWh XS - 14 billionDensejpacificoOpen-weight
994106 mWh S - 22 billionDenseEuroLLMOpen-weight
1059109 mWh S - 24 billionDenseMistral AIOpen-weight
1040109 mWh S - 24 billionDenseMistral AIOpen-weight
983109 mWh S - 24 billionDenseMistral AIOpen-weight
955109 mWh S - 24 billionDenseMistral AIOpen-weight
1091112 mWh S - 27 billionDenseGoogleOpen-weight
957112 mWh S - 27 billionDenseGoogleOpen-weight
1107117 mWh S - 31 billionDenseGoogleOpen-weight
1030118 mWh S - 32 billionDenseAlibabaOpen-weight
960118 mWh S - 32 billionDenseCohereOpen-weight
905118 mWh S - 32 billionDenseAlibabaOpen-weight
904118 mWh S - 32 billionDenseAlibabaOpen-weight
862118 mWh S - 32 billionDenseAi2Open source
1002166 mWh S - 35 billionMoEAlibabaOpen-weight
824193 mWh S - 56 billionMoEMistral AIOpen-weight
1106332 mWh M - 80 billionMoEAlibabaOpen-weight
1019342 mWh L - 117 billionMoEOpenAIOpen-weight
1104347 mWh L - 119 billionMoEMistral AIOpen-weight
1004376 mWh L - 120 billionMoENvidiaOpen-weight
992400 mWh L - 109 billionMoEMetaOpen-weight
1036658 mWh M - 70 billionDenseNvidiaOpen-weight
962658 mWh M - 70 billionDenseSwiss AIOpen source
959658 mWh M - 70 billionDenseMetaOpen-weight
929658 mWh M - 70 billionDenseMetaOpen-weight
899658 mWh M - 70 billionDenseNousOpen-weight
898658 mWh M - 70 billionDenseDeepSeekOpen-weight
1016733 mWh L - 230 billionMoEMiniMaxOpen-weight
1013733 mWh L - 229 billionMoEMiniMaxOpen-weight
1001733 mWh L - 230 billionMoEMiniMaxOpen-weight
1053857 mWh L - 111 billionDenseCohereOpen-weight
7581063 mWh L - 176 billionMoEMistral AIOpen-weight
11091524 mWh L - 398 billionMoEArceeOpen-weight
10581524 mWh L - 284 billionMoEDeepSeekOpen-weight
9701524 mWh L - 398 billionMoEArceeOpen-weight
10351601 mWh L - 397 billionMoEAlibabaOpen-weight
10081601 mWh XL - 400 billionMoEMetaOpen-weight
11191892 mWh L - 355 billionMoEZhipuOpen-weight
10901892 mWh L - 357 billionMoEZhipuOpen-weight
10051892 mWh L - 357 billionMoEZhipuOpen-weight
10551951 mWh XL - 480 billionMoEAlibabaOpen-weight
10783785 mWh XL - 1000 billionMoEMoonshot AIOpen-weight
10653785 mWh XL - 1000 billionMoEMoonshot AIOpen-weight
10593785 mWh XL - 1000 billionMoEMoonshot AIOpen-weight
10523785 mWh XL - 1000 billionMoEMoonshot AIOpen-weight
11053979 mWh XL - 685 billionMoEDeepSeekOpen-weight
10943979 mWh XL - 671 billionMoEDeepSeekOpen-weight
10933979 mWh XL - 685 billionMoEDeepSeekOpen-weight
10683979 mWh XL - 685 billionMoEDeepSeekOpen-weight
10583979 mWh XL - 685 billionMoEDeepSeekOpen-weight
10233979 mWh XL - 671 billionMoEDeepSeekOpen-weight
10224095 mWh XL - 744 billionMoEZhipuOpen-weight
10164095 mWh XL - 744 billionMoEZhipuOpen-weight
11374134 mWh XL - 675 billionMoEMistral AIOpen-weight
10588890 mWh XL - 1600 billionMoEDeepSeekOpen-weight
9159134 mWh XL - 405 billionDenseMetaOpen-weight
8919134 mWh XL - 405 billionDenseNousOpen-weight

How to choose the model classification method?

Since 2024, thousands of users have used the arena to compare the responses of different models, generating hundreds of thousands of votes. Simply counting the number of wins is not enough to establish a ranking. A fair system must be statistically robust, adjust after each matchup, and truly reflect the value of the performances achieved.

It is with this in mind that a ranking based on the Bradley-Terry model was established, developed in collaboration with the French Center of expertise for digital platform regulation (PEReN) teams, based on all the votes and reactions collected on the platform. To learn more, see our methodological notebook.

Two ways to classify models

Ranking by win rate

Definition An empirical ranking system for models based on the percentage of duels won by a model against all other models.

Main problems

  • Game Count Bias A model that has won three out of three duels has a 100% win rate, but this score is not very meaningful because it is based on very little data.
  • No consideration of duel difficulty: beating a “beginner” or an “expert” model counts the same. Win rates are unfair since they do not take match difficulty into account.
  • Stagnation: In the long run, many good models end up around 50% win rate because they are facing models of their own skill level, which makes the rankings less discriminating.

Bradley-Terry (BT) leaderboard

Definition : Ranking system where the gain or loss of points depends on the result (victory/defeat/draw and the estimated level of the opponent: if a weaker model beats a stronger model, its progression in the ranking is greater.

Benefits

  • Probabilistic model : we can estimate the probable outcome of any matchup, even between models that have never been directly competed.
  • Taking match difficulty into account :The scores estimated from the Bradley Terry model take into account the level of the opponents encountered, allowing for a fair comparison between models.
  • Better Uncertainty Management :The confidence interval integrates the entire network of comparisons. This allows for a more accurate estimation of uncertainty, especially for models with few direct confrontations but many common opponents.

Impact of methodological choice on model ranking

Top 10 models in the ranking based on "empirical" win rates

58%60%62%64%66%68%70%gemini-2.0-flashmistral-medium-2508gemini-2.5-flashgemini-3-flash-previewglm-4.5qwen3-max-2025-09-23gemini-3.1-flash-lite-previewmistral-large-2512claude-4-6-sonnetgemini-3-pro-preview

Based solely on the average win rate , an overall ranking can be obtained, but this calculation assumes that each model has played against all others.

This method is not ideal because it requires data from all combinations of models and as soon as the number of models increases, it quickly becomes expensive and cumbersome to maintain.

Top 10 models in the ranking based on estimated win rate with the Bradley-Terry model

58%60%62%64%66%68%70%gemini-3-flash-previewmistral-medium-2508mistral-large-2512gemini-2.5-flashgemini-3.1-flash-lite-previewclaude-4-6-sonnetqwen3-max-2025-09-23glm-4.5gemini-2.0-flashgemini-3-pro-preview

The Bradley-Terry model transforms a set of local and potentially incomplete comparisons into a consistent and statistically robust global ranking system, where the empirical win rate remains limited to direct observations.