From Autodidact to Researcher: How I Reduced AI Costs by 33%

I want to be honest from the start: I don't have the credentials you'd expect from someone publishing research on AI optimization. I don't have a computer science degree. I don't work at a Google or Meta research lab. I'm a freelance consultant who decided to take an intuition seriously — and follow it with the rigor a university researcher would use.

The result? Adaptive-K, a technique that saves up to 33% of computational costs on the world's most advanced AI models. A technique now under review for integration into NVIDIA's TensorRT-LLM, and that has caught the attention of the DeepSeek team.

💡 My philosophy: AI is not a replacement for critical thinking — it's a multiplier. I used Claude and GPT-4 as research assistants, never as authors. Every hypothesis was formulated by me, every experiment designed following academic protocols, every result manually verified. AI helped me move faster, not think less.

The problem I discovered

Every time we ask ChatGPT to write an email or Claude to summarize a document, we consume electricity equivalent to a light bulb running for several minutes. Multiplied by billions of requests per day, AI's energy bill has become a significant line item in big tech budgets — and in global CO₂ emissions.

But studying the architectures of the most advanced models, I noticed something that seemed absurd: up to 33% of this energy is wasted. Why? Because AI models use the same amount of "brainpower" to answer any question, whether it's "what's 2+2" or "write a doctoral thesis on quantum physics."

"It's like turning on all the lights in your house just to go to the bathroom at night. It works, but it's a colossal waste. And nobody seemed to have thought about it."

33%

Maximum validated
savings

Industrial models
tested

Computer science
degrees

Rigorous
protocol

My journey: from curiosity to method

I started studying machine learning in 2023, self-taught. No university, no mentor, just an internet connection, lots of curiosity, and a conviction: if I follow the same rules as real researchers, I can produce real results.

I spent months studying not just algorithms, but the scientific method itself. I read Karl Popper on falsifiability. I studied the guidelines from conferences like NeurIPS and ICML. I learned that the difference between an intuition and a scientific discovery lies in the rigor with which it's tested.

🔬 The VERTEX-RESEARCH Protocol

I developed a 10-phase research protocol, inspired by methods from top universities:

Problem identification — Define what's not working and why it matters
Systematic literature review — Map everything that's been done
Theoretical framework — Build the mathematical foundations
Falsifiable hypothesis formulation — Specify what would prove me wrong
Experimental design — Plan experiments that can falsify the hypotheses
Implementation — Reproducible, versioned, tested code
Experimentation — Multi-seed, controls, ablation studies
Validation/Falsification — Actively try to prove myself wrong
Analysis and interpretation — Understand the why, not just the what
Writing and publication — Communicate reproducibly

📄 The full protocol is available open source: VERTEX-RESEARCH Protocol v1.0

AI as a tool, not an author

I know what you're thinking: "He used ChatGPT to do the research." Yes and no. I used AI extensively, but with strict rules:

✓ AI can: search papers, explain concepts, generate boilerplate code, check mathematical errors, suggest experiments
✗ AI cannot: formulate hypotheses (I do that), interpret results (I do that), decide if an experiment is valid (I do that)

AI was a speed multiplier, not a replacement for critical thinking. It allowed me to cover in weeks ground that would have taken months. But every scientific decision was mine — and every mistake too.

"Artificial intelligence is like having a tireless research assistant who knows all the literature. But intuition, critical judgment, the ability to say 'this doesn't add up' — that remains human."

The technical problem: all tokens are equal (but shouldn't be)

The most advanced language models — like Mistral's Mixtral, Alibaba's Qwen, or NVIDIA's Nemotron — use an architecture called Mixture-of-Experts (MoE). Imagine a team of specialized consultants: there's the math expert, the history one, the programming one.

The problem is that the current system summons always the same number of experts for every question. Two experts for Mixtral. Four for Qwen. Six for Nemotron. Regardless of request complexity.

How it works today vs. Adaptive-K

❌ Traditional system

Simple question: "Hello!"

→ 2 experts (wasteful!)

Complex question: "Explain relativity..."

→ 2 experts (insufficient?)

✅ Adaptive-K

Simple question: "Hello!"

→ 1 expert (50% savings!)

Complex question: "Explain relativity..."

→ 4 experts (when needed!)

The solution: measuring uncertainty

The key to the solution lies in an information theory concept: entropy. When the model is confident about its answer, entropy is low. When it's uncertain, entropy is high.

Adaptive-K uses this information to dynamically decide how many experts to activate:

● Low entropy (model is confident) → fewer experts → savings
● Medium entropy → standard number of experts
● High entropy (model is uncertain) → more experts → quality preserved

Results: validated on 4 industrial models

The research wasn't limited to theoretical simulations. Adaptive-K was tested on four of the most widely used MoE models in the world, with concrete results:

Computational savings by model

NVIDIA Nemotron 3 Nano 33.3%

128 experts • Mamba2-Transformer Hybrid Architecture

Alibaba Qwen-MoE 32.4%

60 experts • 2.7B active parameters

Mistral Mixtral 8×7B 31.0%

8 experts • 46.7B total parameters

Allen AI OLMoE-1B-7B 24.7%

64 experts • 1B active parameters

Source: Adaptive-K Technical Paper • DOI: 10.5281/zenodo.18282008

The multiplicative effect: up to 70% savings

One of the most surprising discoveries from my research concerns the combination with other optimization techniques. When Adaptive-K is used together with quantization (which reduces number precision) and speculative decoding (which "predicts" answers), savings multiply.

This wasn't obvious. I could have expected the techniques to "cannibalize" each other. Instead, I mathematically proved (and experimentally validated) that they're orthogonal: each acts on a different aspect of computation.

Technique combination: multiplicative effect

69%

Adaptive-K

67%

4-bit Quantization

65%

Speculative Decoding

30%

Remaining compute

70% total savings!

Formula: 0.69 × 0.67 × 0.65 = 0.30 → 70% overall savings

Why this matters: my experience as an outsider

When I started sharing the first results, the most common reaction was skepticism. "Who are you to say that NVIDIA researchers missed something?" Fair question.

But here's the point: I didn't say they got something wrong. I said there's an opportunity that hadn't been explored. And I proved it with data, not opinions. When I opened the pull request on TensorRT-LLM, NVIDIA's reviewers didn't ask for my resume. They looked at the code, the tests, the benchmarks. And they assigned a reviewer.

🎯 The lesson: In the open source world and in research, results speak louder than credentials. If you follow the scientific method correctly, your data will be evaluated for what it is — not for who you are.

Economic implications

For companies managing large-scale AI infrastructure, the numbers are significant:

Scenario: 1 billion tokens per day

Current cost (cloud GPU)

~$300/day

$109,500/year

With Adaptive-K (31% savings)

~$207/day

$75,555/year (-$33,945)

But beyond economic savings, there's an environmental issue. It's estimated that training a single large LLM emits as much as 5 cars over their entire useful life. Inference — the daily use of these models — is rapidly surpassing training as the main source of AI energy consumption.

"Reducing AI inference energy consumption by 30% isn't just a cost issue. It's a necessity for the sector's sustainability."

How it works technically

For those who want to dive deeper, here's a more technical explanation. An MoE model's router produces a probability distribution over available experts. The entropy of this distribution measures how "undecided" the router is:

# Routing entropy calculation

H = -Σ p_i × log(p_i)

# Where p_i is the probability assigned to expert i

Adaptive-K defines entropy thresholds that determine how many experts to activate:

If H < 0.6 → use 1 expert (router is very confident)
If 0.6 ≤ H < 1.2 → use 2 experts (moderate uncertainty)
If H ≥ 1.2 → use 4 experts (high uncertainty)

Thresholds are automatically calibrated on a small representative dataset, ensuring output quality doesn't degrade.

Availability and next steps

The technology is already available as an open source package on PyPI:

$ pip install adaptive-k-routing

A collaboration with NVIDIA is also underway to integrate Adaptive-K into TensorRT-LLM, the optimized GPU inference framework. A pull request (#10672) is currently under review.

2026 Roadmap

January 2026

✅ Validation on NVIDIA Nemotron 3 Nano (33.3% savings)

Q1 2026

TensorRT-LLM integration (PR #10672 under review)

Q2 2026

Integration with vLLM and HuggingFace Transformers

Q3 2026

Validation on DeepSeek-V3 (256 experts)

A message for fellow autodidacts

If you're reading this article and feeling discouraged because you don't have the "right credentials," I want to tell you something: the scientific method doesn't ask for passports.

I'm not saying it's easy. I spent nights studying papers I didn't understand. I threw away weeks of work when experiments failed. I had to learn to distinguish between "my code has a bug" and "my hypothesis is wrong" — two very different things.

But if you commit to following the rules of the game — falsifiable hypotheses, reproducible experiments, honest interpretation of results — your contributions will have value. Not because someone gave you a stamp, but because they work.

Conclusion: what I learned

Adaptive-K isn't just an optimization technique. For me, it's proof that:

1. Rigor beats credentials — If you follow the scientific method, results speak for themselves
2. AI is a multiplier — Used correctly, it lets you do research at speeds that were previously impossible for an individual
3. "Obvious" problems are often unsolved — Nobody had thought to use fewer experts when the router is confident. It was right in front of everyone.
4. Open source opens doors — NVIDIA doesn't know me. But they looked at my code, and they're evaluating it for integration.

Artificial intelligence is becoming increasingly pervasive in our lives, and with it grow energy and environmental costs. Adaptive-K demonstrates that we don't have to choose between performance and efficiency: with smart approaches, we can have both.

And sometimes, these ideas come from where you least expect them.

From Autodidact to Researcher:
How I Reduced AI Costs by 33%