Why Are AI Token Costs Rising? How to Fix It

There is a strange line item showing up in enterprise budgets this year. The unit price of intelligence is falling faster than almost any input in modern computing — and the bill is going up anyway. Finance teams keep asking the same question: if tokens are getting cheaper every quarter, why do our AI token costs keep climbing?

The answer is uncomfortable, because it can’t be fixed by switching providers or waiting for the next price cut. Gartner put numbers to it in March: by 2030, the cost to run inference on a trillion-parameter model is expected to fall by more than 90% from 2025 levels — yet total inference spend is still projected to rise, because consumption is climbing even faster than price is falling. Their framing is worth sitting with: don’t mistake the deflation of commodity tokens for cheap intelligence.

You can see the mechanism plainly in support operations, which is where I spend my time. The shift to agentic workflows means a single user task no longer fires one model call — it fires ten or twenty. Retrieval-augmented generation, the default pattern for grounding answers in your own data, quietly inflates the context on every one of those calls. The token meter doesn’t run once per question. It runs continuously, in the background, on data you’re paying to re-process again and again.

5–30×

More tokens per task for agentic models vs. a standard chatbot (Gartner, 2026)

3–5×

Context inflation introduced by typical RAG pipelines — the “context tax”

90%+

Projected drop in per-token inference cost by 2030 — that bills still outrun

Where the money actually goes

When teams try to control AI token costs, they reach for the obvious levers: a cheaper model, a volume discount, a prompt-caching tier. Those help at the margins. But they treat the symptom. The disease is what we feed the model, not what we pay per token.

In support specifically, the dominant cost driver is raw, unstructured text being shoveled into context windows on every interaction. A full ticket thread with its quoted replies, signatures, and forwarded chains can run thousands of tokens. Pull in the related case history and the CRM record, and a single “summarize this and tell me if the customer is at risk” request can carry a small novel into the model — most of it noise.

Multiply that by an agentic loop that re-sends the growing conversation on each step, across millions of interactions a year, and you arrive at the real bill. You are not paying for reasoning. You are paying, over and over, to transport the same unrefined data past the meter.

The expensive part of enterprise AI isn’t the thinking. It’s the hauling — moving raw, redundant context past the meter on every single call.

The architectural answer: pay for meaning, not volume

SupportLogic was built on a premise that now looks like a cost strategy as much as a quality one: the model should receive distilled signal, not raw transcript. Our Cognitive AI Cloud reads the unstructured world of support — tickets, chat, voice, email — and extracts more than forty discrete signals from it: frustration, churn risk, urgency, unmet commitments, sentiment trajectory. That extraction happens once, and the output is small.

That single design decision cascades into a set of token-cost levers that compound. Here is how they work.

Lever 01 · Signal extraction

Send a structured signal, not a four-thousand-token thread

Instead of handing a model the full raw conversation and asking it to figure out what matters, our Sentiment Agent delivers the conclusion: the sentiment score, the risk flag, the extracted intent. A frontier model reasoning over a compact signal set consumes a fraction of the input tokens it would burn parsing the original text — and returns a tighter, cheaper output because the work is already half done.

Lever 02 · CRM-Less Architecture

Stop paying the round-trip payload tax

Bolting AI onto the CRM means re-hydrating heavy CRM objects into context to give the model something to work with. Our CRM-Less Architecture keeps the intelligence layer independent of that payload. The model is grounded in pre-computed insight rather than in a re-serialized copy of your records — so you outsmart the CRM instead of paying to drag it into every prompt.

Lever 03 · Precision-guided RAG

Retrieve the passage, not the library

Generic RAG is where the context tax lives: thousands of pages stuffed into the window in the hope that the relevant lines are somewhere inside. With precision-guided retrieval — the technology behind our xFind acquisition, now part of Resolve SX — the system returns the specific grounding the answer needs and nothing else. Fewer input tokens per call, and a higher-quality answer, are the same move, not a trade-off.

Lever 04 · SupportLogic MCP Server

Compute once, serve many assistants

When Claude, ChatGPT, and Gemini each independently re-ingest and re-analyze the same support data, you pay the extraction cost three times. The SupportLogic MCP Server exposes grounded, pre-computed intelligence as a single source any assistant can draw on. The expensive understanding happens once; every downstream agent consumes the cheap, finished result.

Lever 05 · Right-sized models

Don’t spend frontier output tokens on a sentiment label

Our support-specific and business-specific NLP layers are tuned, efficient models that handle high-volume classification and scoring at a cost the headline frontier models can’t approach. The big, expensive reasoning models are reserved for the work that genuinely needs them. Matching the task to the right engine is the highest-impact cost lever there is — and it’s baked into how the platform routes work.

Lever 06 · Pre-computation & reuse

Analyze the ticket once, not on every query

Signals are computed at ingestion and reused across agents, dashboards, and your enterprise Data Cloud. The same interaction isn’t re-tokenized every time someone asks a question about it. In a world where always-on monitoring agents consume compute around the clock, refusing to redo work is one of the largest savings available.

The same fight, one layer up

This isn’t a battle SupportLogic fights alone, and it shouldn’t be. We attack token cost at the data layer — making sure the model only ever sees distilled signal instead of raw transcript. But the agentic stack wastes tokens at the execution layer too, and a new wave of infrastructure is forming to reclaim them there.

A good example is Modiqo, the AI infrastructure company founded by Chetan Conikee, which this week launched its execution layer, Rote. The premise rhymes with ours. Today’s agents work from a fresh chat log on every run — re-reasoning their way through tasks they already solved yesterday, and paying full token freight to do it. Rote watches what an agent does when it succeeds and turns that run into deterministic, replayable code, so proven workflows execute as cheap, repeatable steps instead of expensive improvisation. Inference gets reserved for the parts that genuinely need a model, and the context bloat that balloons agent runs gets reined in.

Put the two approaches side by side and the pattern is hard to miss. SupportLogic keeps wasteful context out of the prompt; Modiqo keeps wasteful computation out of the loop. Different layers of the stack, the same conviction: in production AI, the cheapest token is the one you never had to spend twice.

The bottom line

Cheaper tokens won’t save you. Fewer, smarter tokens will.

The providers will keep cutting prices, and it won’t be enough — because the volume curve is steeper than the price curve, and agentic AI bends it further every quarter. The enterprises that win on AI economics won’t be the ones who negotiated the best rate. They’ll be the ones whose architecture refuses to send wasteful context in the first place. That discipline can’t be bought after the fact; it has to be designed in.

I’ve argued for years in The Support Experience that great support is fundamentally about signal — hearing what the customer is really telling you, beneath the noise of the transcript. It turns out the same principle that makes support more human also makes it dramatically more efficient to run. When you build a system that listens for meaning instead of swallowing volume, the experience gets better and the bill gets smaller. Those were never separate goals.

Frequently asked questions

Why are enterprise AI token costs rising if token prices are falling?

Per-token prices are dropping sharply, but consumption is climbing even faster. Agentic workflows fire many model calls per task, and retrieval inflates the context on each call, so total spend rises even as unit prices fall.

How does SupportLogic cut AI token costs?

SupportLogic sends models distilled signal instead of raw transcripts. By extracting customer signals once, grounding answers with precision-guided retrieval, serving pre-computed intelligence through its MCP Server, and right-sizing models to the task, it sharply reduces the input and output tokens consumed per interaction.

What is the RAG “context tax”?

The context tax is the extra token spend created when retrieval-augmented generation stuffs large volumes of documents into the model context window on every query, inflating the token count per call several times over.

Does a CRM-Less Architecture lower AI costs?

Yes. Bolting AI onto a CRM means re-hydrating heavy CRM records into context on every call. A CRM-Less Architecture grounds the model in pre-computed intelligence instead, removing the round-trip payload that drives up token consumption.

See what your support AI is really costing you

Model the savings of sending signal instead of raw data — then watch it run on your own tickets in 45 days.

Request a demo Try the ROI calculator

How to cut AI token costs when the meter never stops