RESEARCH • INNOVATION • PERSONAL STORY

From Autodidact to Researcher:
How I Reduced AI Costs by 33%

I'm not a computer engineer. I don't have a PhD. But through a rigorous research protocol, self-directed learning, and disciplined use of AI as a tool, I developed a technique that's attracting attention from NVIDIA and DeepSeek.

Author Gabriele Balsamo January 19, 2026 12 min read

I want to be honest from the start: I don't have the credentials you'd expect from someone publishing research on AI optimization. I don't have a computer science degree. I don't work at a Google or Meta research lab. I'm a freelance consultant who decided to take an intuition seriously — and follow it with the rigor a university researcher would use.

The result? Adaptive-K, a technique that saves up to 33% of computational costs on the world's most advanced AI models. A technique now under review for integration into NVIDIA's TensorRT-LLM, and that has caught the attention of the DeepSeek team.

💡 My philosophy: AI is not a replacement for critical thinking — it's a multiplier. I used Claude and GPT-4 as research assistants, never as authors. Every hypothesis was formulated by me, every experiment designed following academic protocols, every result manually verified. AI helped me move faster, not think less.

The problem I discovered

Every time we ask ChatGPT to write an email or Claude to summarize a document, we consume electricity equivalent to a light bulb running for several minutes. Multiplied by billions of requests per day, AI's energy bill has become a significant line item in big tech budgets — and in global CO₂ emissions.

But studying the architectures of the most advanced models, I noticed something that seemed absurd: up to 33% of this energy is wasted. Why? Because AI models use the same amount of "brainpower" to answer any question, whether it's "what's 2+2" or "write a doctoral thesis on quantum physics."

"It's like turning on all the lights in your house just to go to the bathroom at night. It works, but it's a colossal waste. And nobody seemed to have thought about it."
33%
Maximum validated
savings
4
Industrial models
tested
0
Computer science
degrees
1
Rigorous
protocol

My journey: from curiosity to method

I started studying machine learning in 2023, self-taught. No university, no mentor, just an internet connection, lots of curiosity, and a conviction: if I follow the same rules as real researchers, I can produce real results.

I spent months studying not just algorithms, but the scientific method itself. I read Karl Popper on falsifiability. I studied the guidelines from conferences like NeurIPS and ICML. I learned that the difference between an intuition and a scientific discovery lies in the rigor with which it's tested.

🔬 The VERTEX-RESEARCH Protocol

I developed a 10-phase research protocol, inspired by methods from top universities:

  1. Problem identification — Define what's not working and why it matters
  2. Systematic literature review — Map everything that's been done
  3. Theoretical framework — Build the mathematical foundations
  4. Falsifiable hypothesis formulation — Specify what would prove me wrong
  5. Experimental design — Plan experiments that can falsify the hypotheses
  6. Implementation — Reproducible, versioned, tested code
  7. Experimentation — Multi-seed, controls, ablation studies
  8. Validation/Falsification — Actively try to prove myself wrong
  9. Analysis and interpretation — Understand the why, not just the what
  10. Writing and publication — Communicate reproducibly

📄 The full protocol is available open source: VERTEX-RESEARCH Protocol v1.0

AI as a tool, not an author

I know what you're thinking: "He used ChatGPT to do the research." Yes and no. I used AI extensively, but with strict rules:

AI was a speed multiplier, not a replacement for critical thinking. It allowed me to cover in weeks ground that would have taken months. But every scientific decision was mine — and every mistake too.

"Artificial intelligence is like having a tireless research assistant who knows all the literature. But intuition, critical judgment, the ability to say 'this doesn't add up' — that remains human."

The technical problem: all tokens are equal (but shouldn't be)

The most advanced language models — like Mistral's Mixtral, Alibaba's Qwen, or NVIDIA's Nemotron — use an architecture called Mixture-of-Experts (MoE). Imagine a team of specialized consultants: there's the math expert, the history one, the programming one.

The problem is that the current system summons always the same number of experts for every question. Two experts for Mixtral. Four for Qwen. Six for Nemotron. Regardless of request complexity.

How it works today vs. Adaptive-K

❌ Traditional system
Simple question: "Hello!"
E1
E2
E3
E4
→ 2 experts (wasteful!)
Complex question: "Explain relativity..."
E1
E2
E3
E4
→ 2 experts (insufficient?)
✅ Adaptive-K
Simple question: "Hello!"
E1
E2
E3
E4
→ 1 expert (50% savings!)
Complex question: "Explain relativity..."
E1
E2
E3
E4
→ 4 experts (when needed!)

The solution: measuring uncertainty

The key to the solution lies in an information theory concept: entropy. When the model is confident about its answer, entropy is low. When it's uncertain, entropy is high.

Adaptive-K uses this information to dynamically decide how many experts to activate:

Results: validated on 4 industrial models

The research wasn't limited to theoretical simulations. Adaptive-K was tested on four of the most widely used MoE models in the world, with concrete results:

Computational savings by model

NVIDIA Nemotron 3 Nano 33.3%
128 experts • Mamba2-Transformer Hybrid Architecture
Alibaba Qwen-MoE 32.4%
60 experts • 2.7B active parameters
Mistral Mixtral 8×7B 31.0%
8 experts • 46.7B total parameters
Allen AI OLMoE-1B-7B 24.7%
64 experts • 1B active parameters

Source: Adaptive-K Technical Paper • DOI: 10.5281/zenodo.18282008

The multiplicative effect: up to 70% savings

One of the most surprising discoveries from my research concerns the combination with other optimization techniques. When Adaptive-K is used together with quantization (which reduces number precision) and speculative decoding (which "predicts" answers), savings multiply.

This wasn't obvious. I could have expected the techniques to "cannibalize" each other. Instead, I mathematically proved (and experimentally validated) that they're orthogonal: each acts on a different aspect of computation.

Technique combination: multiplicative effect

69%
Adaptive-K
×
67%
4-bit Quantization
×
65%
Speculative Decoding
=
30%
Remaining compute
70% total savings!

Formula: 0.69 × 0.67 × 0.65 = 0.30 → 70% overall savings

Why this matters: my experience as an outsider

When I started sharing the first results, the most common reaction was skepticism. "Who are you to say that NVIDIA researchers missed something?" Fair question.

But here's the point: I didn't say they got something wrong. I said there's an opportunity that hadn't been explored. And I proved it with data, not opinions. When I opened the pull request on TensorRT-LLM, NVIDIA's reviewers didn't ask for my resume. They looked at the code, the tests, the benchmarks. And they assigned a reviewer.

🎯 The lesson: In the open source world and in research, results speak louder than credentials. If you follow the scientific method correctly, your data will be evaluated for what it is — not for who you are.

Economic implications

For companies managing large-scale AI infrastructure, the numbers are significant:

Scenario: 1 billion tokens per day

Current cost (cloud GPU)
~$300/day
$109,500/year
With Adaptive-K (31% savings)
~$207/day
$75,555/year (-$33,945)

But beyond economic savings, there's an environmental issue. It's estimated that training a single large LLM emits as much as 5 cars over their entire useful life. Inference — the daily use of these models — is rapidly surpassing training as the main source of AI energy consumption.

"Reducing AI inference energy consumption by 30% isn't just a cost issue. It's a necessity for the sector's sustainability."

How it works technically

For those who want to dive deeper, here's a more technical explanation. An MoE model's router produces a probability distribution over available experts. The entropy of this distribution measures how "undecided" the router is:

# Routing entropy calculation
H = -Σ p_i × log(p_i)
# Where p_i is the probability assigned to expert i

Adaptive-K defines entropy thresholds that determine how many experts to activate:

Thresholds are automatically calibrated on a small representative dataset, ensuring output quality doesn't degrade.

Availability and next steps

The technology is already available as an open source package on PyPI:

$ pip install adaptive-k-routing

A collaboration with NVIDIA is also underway to integrate Adaptive-K into TensorRT-LLM, the optimized GPU inference framework. A pull request (#10672) is currently under review.

2026 Roadmap

January 2026
✅ Validation on NVIDIA Nemotron 3 Nano (33.3% savings)
Q1 2026
TensorRT-LLM integration (PR #10672 under review)
Q2 2026
Integration with vLLM and HuggingFace Transformers
Q3 2026
Validation on DeepSeek-V3 (256 experts)

A message for fellow autodidacts

If you're reading this article and feeling discouraged because you don't have the "right credentials," I want to tell you something: the scientific method doesn't ask for passports.

I'm not saying it's easy. I spent nights studying papers I didn't understand. I threw away weeks of work when experiments failed. I had to learn to distinguish between "my code has a bug" and "my hypothesis is wrong" — two very different things.

But if you commit to following the rules of the game — falsifiable hypotheses, reproducible experiments, honest interpretation of results — your contributions will have value. Not because someone gave you a stamp, but because they work.

Conclusion: what I learned

Adaptive-K isn't just an optimization technique. For me, it's proof that:

Artificial intelligence is becoming increasingly pervasive in our lives, and with it grow energy and environmental costs. Adaptive-K demonstrates that we don't have to choose between performance and efficiency: with smart approaches, we can have both.

And sometimes, these ideas come from where you least expect them.

Want to learn more?

Read the complete technical paper, explore the research protocol, or try the interactive demo.

Author
Gabriele Balsamo
Self-taught researcher • Vertex Data Founder

Not a computer engineer. An autodidact who developed the VERTEX-RESEARCH protocol for conducting rigorous ML research. Author of Adaptive-K routing, contributor to TensorRT-LLM (NVIDIA). Believes the scientific method doesn't ask for credentials, only rigor.

Related resources