I want to be honest from the start: I don't have the credentials you'd expect from someone publishing research on AI optimization. I don't have a computer science degree. I don't work at a Google or Meta research lab. I'm a freelance consultant who decided to take an intuition seriously — and follow it with the rigor a university researcher would use.
The result? Adaptive-K, a technique that saves up to 33% of computational costs on the world's most advanced AI models. A technique now under review for integration into NVIDIA's TensorRT-LLM, and that has caught the attention of the DeepSeek team.
The problem I discovered
Every time we ask ChatGPT to write an email or Claude to summarize a document, we consume electricity equivalent to a light bulb running for several minutes. Multiplied by billions of requests per day, AI's energy bill has become a significant line item in big tech budgets — and in global CO₂ emissions.
But studying the architectures of the most advanced models, I noticed something that seemed absurd: up to 33% of this energy is wasted. Why? Because AI models use the same amount of "brainpower" to answer any question, whether it's "what's 2+2" or "write a doctoral thesis on quantum physics."
savings
tested
degrees
protocol
My journey: from curiosity to method
I started studying machine learning in 2023, self-taught. No university, no mentor, just an internet connection, lots of curiosity, and a conviction: if I follow the same rules as real researchers, I can produce real results.
I spent months studying not just algorithms, but the scientific method itself. I read Karl Popper on falsifiability. I studied the guidelines from conferences like NeurIPS and ICML. I learned that the difference between an intuition and a scientific discovery lies in the rigor with which it's tested.
🔬 The VERTEX-RESEARCH Protocol
I developed a 10-phase research protocol, inspired by methods from top universities:
- Problem identification — Define what's not working and why it matters
- Systematic literature review — Map everything that's been done
- Theoretical framework — Build the mathematical foundations
- Falsifiable hypothesis formulation — Specify what would prove me wrong
- Experimental design — Plan experiments that can falsify the hypotheses
- Implementation — Reproducible, versioned, tested code
- Experimentation — Multi-seed, controls, ablation studies
- Validation/Falsification — Actively try to prove myself wrong
- Analysis and interpretation — Understand the why, not just the what
- Writing and publication — Communicate reproducibly
📄 The full protocol is available open source: VERTEX-RESEARCH Protocol v1.0
AI as a tool, not an author
I know what you're thinking: "He used ChatGPT to do the research." Yes and no. I used AI extensively, but with strict rules:
- ✓ AI can: search papers, explain concepts, generate boilerplate code, check mathematical errors, suggest experiments
- ✗ AI cannot: formulate hypotheses (I do that), interpret results (I do that), decide if an experiment is valid (I do that)
AI was a speed multiplier, not a replacement for critical thinking. It allowed me to cover in weeks ground that would have taken months. But every scientific decision was mine — and every mistake too.
The technical problem: all tokens are equal (but shouldn't be)
The most advanced language models — like Mistral's Mixtral, Alibaba's Qwen, or NVIDIA's Nemotron — use an architecture called Mixture-of-Experts (MoE). Imagine a team of specialized consultants: there's the math expert, the history one, the programming one.
The problem is that the current system summons always the same number of experts for every question. Two experts for Mixtral. Four for Qwen. Six for Nemotron. Regardless of request complexity.
How it works today vs. Adaptive-K
The solution: measuring uncertainty
The key to the solution lies in an information theory concept: entropy. When the model is confident about its answer, entropy is low. When it's uncertain, entropy is high.
Adaptive-K uses this information to dynamically decide how many experts to activate:
- ● Low entropy (model is confident) → fewer experts → savings
- ● Medium entropy → standard number of experts
- ● High entropy (model is uncertain) → more experts → quality preserved
Results: validated on 4 industrial models
The research wasn't limited to theoretical simulations. Adaptive-K was tested on four of the most widely used MoE models in the world, with concrete results:
Computational savings by model
Source: Adaptive-K Technical Paper • DOI: 10.5281/zenodo.18282008
The multiplicative effect: up to 70% savings
One of the most surprising discoveries from my research concerns the combination with other optimization techniques. When Adaptive-K is used together with quantization (which reduces number precision) and speculative decoding (which "predicts" answers), savings multiply.
This wasn't obvious. I could have expected the techniques to "cannibalize" each other. Instead, I mathematically proved (and experimentally validated) that they're orthogonal: each acts on a different aspect of computation.
Technique combination: multiplicative effect
Formula: 0.69 × 0.67 × 0.65 = 0.30 → 70% overall savings
Why this matters: my experience as an outsider
When I started sharing the first results, the most common reaction was skepticism. "Who are you to say that NVIDIA researchers missed something?" Fair question.
But here's the point: I didn't say they got something wrong. I said there's an opportunity that hadn't been explored. And I proved it with data, not opinions. When I opened the pull request on TensorRT-LLM, NVIDIA's reviewers didn't ask for my resume. They looked at the code, the tests, the benchmarks. And they assigned a reviewer.
Economic implications
For companies managing large-scale AI infrastructure, the numbers are significant:
Scenario: 1 billion tokens per day
But beyond economic savings, there's an environmental issue. It's estimated that training a single large LLM emits as much as 5 cars over their entire useful life. Inference — the daily use of these models — is rapidly surpassing training as the main source of AI energy consumption.
How it works technically
For those who want to dive deeper, here's a more technical explanation. An MoE model's router produces a probability distribution over available experts. The entropy of this distribution measures how "undecided" the router is:
Adaptive-K defines entropy thresholds that determine how many experts to activate:
- If H < 0.6 → use 1 expert (router is very confident)
- If 0.6 ≤ H < 1.2 → use 2 experts (moderate uncertainty)
- If H ≥ 1.2 → use 4 experts (high uncertainty)
Thresholds are automatically calibrated on a small representative dataset, ensuring output quality doesn't degrade.
Availability and next steps
The technology is already available as an open source package on PyPI:
A collaboration with NVIDIA is also underway to integrate Adaptive-K into TensorRT-LLM, the optimized GPU inference framework. A pull request (#10672) is currently under review.
2026 Roadmap
A message for fellow autodidacts
If you're reading this article and feeling discouraged because you don't have the "right credentials," I want to tell you something: the scientific method doesn't ask for passports.
I'm not saying it's easy. I spent nights studying papers I didn't understand. I threw away weeks of work when experiments failed. I had to learn to distinguish between "my code has a bug" and "my hypothesis is wrong" — two very different things.
But if you commit to following the rules of the game — falsifiable hypotheses, reproducible experiments, honest interpretation of results — your contributions will have value. Not because someone gave you a stamp, but because they work.
Conclusion: what I learned
Adaptive-K isn't just an optimization technique. For me, it's proof that:
- 1. Rigor beats credentials — If you follow the scientific method, results speak for themselves
- 2. AI is a multiplier — Used correctly, it lets you do research at speeds that were previously impossible for an individual
- 3. "Obvious" problems are often unsolved — Nobody had thought to use fewer experts when the router is confident. It was right in front of everyone.
- 4. Open source opens doors — NVIDIA doesn't know me. But they looked at my code, and they're evaluating it for integration.
Artificial intelligence is becoming increasingly pervasive in our lives, and with it grow energy and environmental costs. Adaptive-K demonstrates that we don't have to choose between performance and efficiency: with smart approaches, we can have both.
And sometimes, these ideas come from where you least expect them.
Want to learn more?
Read the complete technical paper, explore the research protocol, or try the interactive demo.
Not a computer engineer. An autodidact who developed the VERTEX-RESEARCH protocol for conducting rigorous ML research. Author of Adaptive-K routing, contributor to TensorRT-LLM (NVIDIA). Believes the scientific method doesn't ask for credentials, only rigor.