AI Energy Score v2: Refreshed Leaderboard, now with Reasoning 🧠
Today, we’re excited to launch a refreshed AI Energy Score leaderboard, featuring a new cohort of text generation models and the introduction of reasoning as a newly benchmarked task. We’ve also improved the benchmarking code and submission process, enabling a more streamlined evaluation workflow. With the increased interest of the community towards measuring and comparing the energy use of AI models, our benchmarking efforts are more important than ever to inform sustainably-minded AI development and policymaking.
Background 📜
The AI Energy Score project combines insights from prior benchmarking efforts to deliver a unified framework for evaluating the energy efficiency of AI models. Initially launched in February 2025, the leaderboard compares the energy efficiency of AI models across 10 tasks and multiple modalities (text, image, audio) using a standardized approach built on custom datasets and the latest generation of GPUs. The launch was covered in media outlets like The Economist, NPR, Newsweek, Nature and many more, was recognized at the Paris Peace Forum in the context of the Paris AI Summit, and was recently highlighted during Sasha's New York Climate Week TED Talk.
In the months since the launch, the push for a standardized benchmark has only accelerated, with regulatory efforts like the EU AI Act Code of Practice (signed by key industry players like OpenAI, Microsoft, and Google) now explicitly calling for an inference energy benchmark, and efforts from organizations like the IEEE and the Green Software Foundation working towards a standardized way for measuring energy use and carbon emissions. Recent environmental disclosures from companies such as Google and Mistral have shed light on these impacts, but since they do not use the same methodology, it’s hard to compare these disclosures. Each of these efforts is valuable, but without a standard they’re like comparing apples-to-bananas-to-pineapples. That’s where the AI Energy Score comes in: if all providers used this approach, we could finally compare models on a level playing field.
V2 of the AI Energy Score Leaderboard 🏆
For the second version of the leaderboard, we partnered up Scott Chamberlin from Neuralwatt to streamline the benchmarking approach and to add the ability to test reasoning models. Under the hood, we are still using Code Carbon and the same datasets that we initially developed for the first version of the leaderboard (see the documentation for more details). To streamline the approach, we created a new open-source package, AI Energy Benchmarks, which we hope will become the underlying coding package that enables energy benchmarking on a variety of hardware and software configurations.
Key Findings 🔎︎
Reasoning comes at a cost 🧠
This year has seen a rise in the popularity of reasoning models, which use an internal monologue to “reason” through questions, with the intent of increasing performance. Many of the most recent LLMs now include reasoning modes, either through a simple switch that turns the feature on or off, such as in Microsoft’s Phi 4, or through multiple levels of reasoning, as seen in OpenAI’s GPT-OSS models, which offer low, medium, and high modes.
According to our analysis, reasoning models use, on average, 100 times more energy than models with no reasoning capabilities (or with reasoning turned off). Honing in on specific models with and without the reasoning functionality enabled, we can see a huge difference: between 150 and 700 times more energy used by the same model with reasoning enabled compared to without:
| Model name | Params | Reasoning | GPU energy (Wh) per 1k queries | Energy Increase due to Reasoning |
|---|---|---|---|---|
| DeepSeek-R1-Distill-Llama-70B | 70B | Off | 49.53 | 154 |
| DeepSeek-R1-Distill-Llama-70B | 70B | On | 7,626.53 | |
| Phi-4-reasoning-plus | 15B | Off | 18.42 | 514 |
| Phi-4-reasoning-plus | 15B | On | 9,461.61 | |
| SmolLM3-3B | 3B | Off | 18.35 | 697 |
| SmolLM3-3B | 3B | On | 12,791.22 |
This can be explained in large part by the number of output tokens generated by the models themselves (for “reasoning” through their answers) – models with reasoning enabled use between 300 and 800 times more tokens than their base equivalents. This adds up as reasoning models are used more and more in consumer-facing tools and applications, since they will tend to output longer responses (which has also been found in recent research).
Additionally, the energy use of reasoning models is less predictable than that of standard LLMs. The impact of LLMs has traditionally shown a “strong correlation between a model’s size and its footprint”. However, each reasoning model may produce its reasoning traces in different ways and degrees of verbosity, making this approximation difficult. This matters because many people still assume that smaller models are always better, but the intensity of the reasoning process now has to be considered as well. This is yet another reason we need standardized and transparent benchmarks in this area.
Models with multiple levels of reasoning, like the GPT-OSS series, offer useful insight into the dynamics between model size and reasoning intensity. The 20B class shows a 4.8x difference between its high and low reasoning modes, while the 120B class has a much smaller 1.6x swing. Comparing the reasoning modes across the two classes reveals a 4.7x difference on the low setting, with much smaller deltas for the medium and high modes at roughly 1.6x (See here for another GPT-OSS energy analysis).
Are newer models more efficient? Results are mixed 🤷
For this update, we’ve added 39 new models, including 21 in the text generation task: 11 Class A models that fit on a single consumer GPU, 3 Class B models that require a cloud GPU, and 7 Class C models that require multiple GPUs.
Comparing the energy use of models from this cohort to the February 2025 cohort can help us understand potential efficiency progress in AI over the last nine months. To compare fairly, we selected models with no reasoning (or with reasoning turned off), no mixture-of-experts architecture (since this was more rare earlier this year) and compared their energy usage to a reference model of similar size (in terms of active parameters) from the previous leaderboard. The results were mixed:
Of the 15 models that met these criteria, the majority (9) had greater or equal energy use compared to models of a similar size from February. The range was large, with some models using only 3% of the energy while others used 4x more!
This runs counter to the prevailing narrative that AI is becoming more efficient and underscores the need for users and developers to choose the right model for each task. Selecting appropriately helps avoid wasting compute resources on queries for which a simpler and more efficient model would work well. Approaches such as routers, which can choose the most appropriate model for incoming queries, will be increasingly useful in these scenarios, and energy data from AI Energy Score can be used alongside performance-based metrics to route user queries to the right model at the right time.
Adoption in practice 🚀
One story we’re excited to share: Salesforce has integrated the AI Energy Score into its internal model benchmarking suite. Model Cards are now automatically generated with energy transparency included, and Salesforce has committed to publishing this information for all production models going forward. This shows how easy it is to integrate our Docker-based process, and serves as an example for other organizations.
We’ve also seen the AI Energy Score project featured by the Coalition for Sustainable AI as an example of best practices in benchmarking AI, and have presented it at events such as the ITU AI for Good conference and the IEA Forum on Energy and AI as examples of concrete initiatives that can be adopted by policymakers and developers alike to quantify and reduce the environmental impacts of AI.
What’s next for AI Energy Score 🔮
AI Energy Score is a dynamic project that will continue to evolve as new models and tasks are added and as the AI community grows. In the future, we hope to add additional modalities such as video generation, which has been shown to be very energy intensive even compared to image generation, as well as agentic tasks including computer use, coding, and tool use. We also aim to build more interest and buy-in from companies developing AI models, and we hope that more proprietary models will be tested and benchmarked alongside the current models, which are primarily open weights.
The future of AI Energy Score depends on community support. Together, we can build the transparency foundation the industry needs to align AI innovation with our planetary boundaries. If you’re interested in contributing, integrating the Score into your own systems, or exploring other forms of collaboration, feel free to start a discussion here.