Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI has published a guide and interactive infographic recommending GPU power limits and undervolting as a first step for reducing heat and noise in high-power local AI workstations. The guide says local inference is often memory-bandwidth-bound, so lowering GPU power can reduce watts and temperature while preserving much of tokens-per-second throughput, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU tuning guide arguing that local AI workstation users can often reduce heat and fan noise by power-limiting or undervolting their graphics cards while keeping most of their tokens-per-second throughput, a claim aimed at people running local LLM inference on high-power NVIDIA GPUs.

The guide recommends starting with a power limit rather than a manual voltage curve. It describes power limiting as a one-slider change in tools such as MSI Afterburner on Windows or nvidia-smi on Linux, with a suggested starting point around 70% of the GPU’s rated power. According to the source, this approach restricts the card rather than pushing it beyond stock settings, and the card automatically adjusts voltage and clocks.

The article’s main performance claim is based on sustained RTX 4090 workload data presented in the guide. At stock settings, the guide lists 390 watts, 72°C and 100% speed. At a 70% power limit, it lists 300 watts, 67°C and 93.4% of speed, implying a 90-watt reduction for about a 6.6% throughput loss. At 60%, the guide lists 260 watts, 62°C and 91.5% speed. At 40%, performance falls more sharply to 61.3%, which the guide treats as past the useful range.

The source also cites card-by-card examples, including an RTX 4090 cap to 300 watts that it says keeps 97.8% of performance, and RTX 5090 power-cap figures that it says show roughly 5% lower speed at 450 watts and about 10% lower speed at 400 watts. Those figures are presented as workload-dependent, and the guide says users should measure their own tokens per second, power draw, held clock and temperature under the actual model and quantization they run.

Why It Matters

The development matters because local inference workloads are moving into deskside workstations where heat, power draw and fan noise can be limiting factors. A large GPU running near stock power can add hundreds of watts of heat to a room, raise case temperatures and make long inference sessions less practical.

If the guide’s results hold for a user’s workload, power limiting offers a lower-cost first step than buying a new cooler, replacing a case or rebuilding airflow. The significance is not that every system will keep the same throughput, but that the source’s data shows a broad efficiency band where watts fall faster than tokens per second. For workstation owners paying for power, managing noise or trying to run long local jobs, that tradeoff can affect daily usability.

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

16.384 NVIDIA CUDA Core

As an affiliate, we earn on qualifying purchases.

Background

The guide frames power limiting and undervolting as the first lever in a broader series on reducing heat and noise in high-power AI workstations. Its reasoning is specific to inference rather than gaming: the source says many local LLM workloads are constrained by memory bandwidth, so the GPU core can spend time waiting on VRAM rather than running at full compute saturation.

That distinction matters because a gaming workload may lose frames when core clocks are reduced, while a memory-bound inference workload may lose less throughput. The guide says factory voltage curves include margin so cards remain stable across silicon quality and operating conditions. Manual undervolting attempts to keep a target clock at a lower voltage, while power limiting lets the card manage that tradeoff automatically.

The source advises users who go beyond a power limit to test undervolting under their real workload. It gives 0.9 to 0.95 volts as a common starting target, but says stability over a short run does not prove stability over multi-hour inference sessions.

“This is the first thing you should do to a high-power AI workstation”

— Thorsten Meyer AI guide

“Local inference is memory-bound”

— Thorsten Meyer AI guide

“Power limiting moves one slider”

— Thorsten Meyer AI guide

“you make changes at your own risk”

— Thorsten Meyer AI disclosure

MSI MEG Ai1600T PCIE5, Fully Modular Gaming 1600W Power Supply, 80+ Titanium, Dual 12V-2x6 Cables, Server-Grade Capacitor, ATX 3.1 & PCIe 5.1 Ready, Low-Noise, Braided, 12 Year Warranty

MSI MEG Ai1600T PCIE5, Fully Modular Gaming 1600W Power Supply, 80+ Titanium, Dual 12V-2×6 Cables, Server-Grade Capacitor, ATX 3.1 & PCIe 5.1 Ready, Low-Noise, Braided, 12 Year Warranty

Tri-certified Titanium certification (80 PLUS / Cybenetics / PPLP)

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

Several points remain workload-specific. The source says its figures vary by card, model, quantization and workload, and the supplied material does not include a full test methodology, full hardware configuration or reproducible benchmark logs. It is also not clear from the source material how results differ across AMD GPUs, multi-GPU systems, laptop GPUs or inference engines beyond the cited NVIDIA-focused examples.

The guide treats power limiting as reversible and widely used, but it does not remove the need for user testing. A power cap that works well for one model size, context length or batch setting may produce a different tokens-per-second result under another setup.

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

16.384 NVIDIA CUDA Core

As an affiliate, we earn on qualifying purchases.

What’s Next

Readers applying the guide’s recommendation would next set a conservative power limit, run their actual local inference workload for an extended period, and compare tokens per second, wattage, temperature and fan noise against stock settings. The next data point to watch is whether more independent tests across GPUs, inference engines and model types confirm the same efficiency band.

Flylin 3.5in IPS USB Mini Screen, CPU Hardware Temperature Monitor Type-C Sub Screen, AIDA64 PC Temperature Display Screen for Computer Case

【Multi -monitoring】This screen will display data from CPU, GPU, RAM, HDD, time and date. There are many templates…

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual news development?

Thorsten Meyer AI has published a guide and interactive infographic making the case that GPU power limiting and undervolting should be an early step for reducing heat and noise in local AI inference workstations.

Is the performance claim confirmed for every GPU?

No. The source provides RTX 4090 and RTX 5090 examples and says the pattern depends on the card, model, quantization and workload. Users still need to measure their own tokens-per-second results.

What is the difference between power limiting and undervolting?

Power limiting caps how much power the GPU may draw and lets the card adjust clocks and voltage. Undervolting changes the voltage-frequency curve directly, which may preserve more performance for the same heat cut but needs more testing.

Why would inference lose less speed than gaming?

According to the guide, many local inference workloads are limited by memory bandwidth rather than GPU core compute. If the core is waiting on VRAM, reducing core power may have a smaller effect on tokens per second than it would have on a compute-heavy workload.

What remains unclear?

The source material does not provide a full independent benchmark suite, and results outside the cited NVIDIA examples are not established here. Stability and throughput under long runs remain user-specific.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

10 Best Wall Mount Server Rack Cabinets in 2026

Author

Cryptogram Platform Team

Share article

Why It Matters

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

Background

MSI MEG Ai1600T PCIE5, Fully Modular Gaming 1600W Power Supply, 80+ Titanium, Dual 12V-2×6 Cables, Server-Grade Capacitor, ATX 3.1 & PCIe 5.1 Ready, Low-Noise, Braided, 12 Year Warranty

What Remains Unclear

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

What’s Next

Flylin 3.5in IPS USB Mini Screen, CPU Hardware Temperature Monitor Type-C Sub Screen, AIDA64 PC Temperature Display Screen for Computer Case

Key Questions

What is the actual news development?

Is the performance claim confirmed for every GPU?

What is the difference between power limiting and undervolting?

Why would inference lose less speed than gaming?

What remains unclear?

The Free-Download Question: When Running Your Own Model Actually Beats Paying

Experience the Bitcoin Battle: Visualize the Crypto War Live

I broke AppLovin’s mediation cipher protocol

Michael Saylor says Strategy would buy ’10 to 20′ bitcoin for every one it sells: report

14 Best Camera Gimbals for Mirrorless Cameras in 2026

14 Best Home Gym Cable Machines for 2026

Blue Angels conducting review after jet flies over Florida beachgoers

7 Best Action Camera Waterproof Premium in 2026

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

Author

Cryptogram Platform Team

Share article

Why It Matters

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

Background

MSI MEG Ai1600T PCIE5, Fully Modular Gaming 1600W Power Supply, 80+ Titanium, Dual 12V-2×6 Cables, Server-Grade Capacitor, ATX 3.1 & PCIe 5.1 Ready, Low-Noise, Braided, 12 Year Warranty

What Remains Unclear

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

What’s Next

Flylin 3.5in IPS USB Mini Screen, CPU Hardware Temperature Monitor Type-C Sub Screen, AIDA64 PC Temperature Display Screen for Computer Case

Key Questions

What is the actual news development?

Is the performance claim confirmed for every GPU?

What is the difference between power limiting and undervolting?

Why would inference lose less speed than gaming?

What remains unclear?

You May Also Like