The Hidden Cost Problem in AI Agents (and What We Built to Fix It)

Here is a pattern that plays out regularly on teams building AI agents: something gets deployed to production, usage grows, and then a month later someone opens the API bill and has to explain why it’s three times what they expected.

The problem isn’t that the agent was expensive. The problem is that no one knew it was expensive until after the fact. Token usage compounds across parallel sessions, multi-step reasoning chains, and retrieval-augmented prompts in ways that don’t have obvious signals. The cost of a single session might look fine. The cost of a thousand sessions, some of which hit expensive edge cases, is a different number entirely.

This is a visibility problem. And it’s one that most teams solve too late.

Why the current tools don’t fully address it

There are good LLM observability platforms. LangSmith, Langfuse, Helicone, and Braintrust all give you trace data. They’re worth knowing about, and some teams should use them.

But there are also real reasons why teams don’t. They require SDK-level lock-in — you structure your agent code around the observability framework, not the other way around. They have framework-specific integrations that break when you switch models or upgrade libraries. The hosted tiers start at $39–$79/month, which is fine at scale but hard to justify while you’re still figuring out what you’re building.

The zero-dependency option — reading structured logs and doing the math yourself — technically works but doesn’t scale past a toy project.

What we wanted was something that fit into existing agent code without restructuring it, gave real-time cost and latency data per call, and could be self-hosted without ceremony. That’s TraceStack.

How it works

The SDK is pure Python standard library. There are no dependencies to manage, no version conflicts to debug, and no platform that needs to be running before you can instrument anything.

You instrument your agent code in one of three ways:

import agenttrace

agenttrace.init(api_key="at_...", project="my-agent")

# Option 1 — decorator. Timing is automatic.
@agenttrace.trace("call_claude")
def call_claude(prompt):
    response = anthropic.messages.create(model="claude-sonnet-4-6", ...)
    agenttrace.record_tokens(
        "claude-sonnet-4-6",
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
    )
    return response.content[0].text

# Option 2 — context manager for any span
with agenttrace.span("retrieve_context"):
    docs = vector_db.query(embedding, top_k=5)

# Option 3 — manual, for when you need full control
trace = agenttrace.start_trace("agent_run", tags=["prod"])
agenttrace.start_span("reasoning_step")
# ... do the thing
agenttrace.record_tokens("gpt-4o", input_tokens=800, output_tokens=200)
agenttrace.end_trace()

Traces send in a background daemon thread. The agent never blocks waiting for the trace to arrive. If the network is slow or the backend is down, the failure is silent — the agent keeps running.

What you get back

The backend — a FastAPI app that you can deploy to Render in about five minutes — stores trace and span data in Turso, a managed SQLite database that persists across deploys without disk management.

From there you can query:

GET /traces/stats?days=7 — total traces, total cost, average latency, and a breakdown by model over the last N days
GET /traces — paginated trace history filtered by project
GET /traces/{id} — full trace with all spans

The stats endpoint is the one that matters most in practice. It tells you what the last week cost by model, which is usually enough to identify whether something is running more expensively than expected and which part of the agent is responsible.

Why self-hostable matters

A lot of agents run in sensitive environments — against internal documents, customer data, proprietary systems. Routing that trace data through a third-party platform raises legitimate questions about what leaves the network and who can see it.

Self-hosting TraceStack means the trace data never leaves your infrastructure. The SDK talks to your backend, your backend stores in your Turso database, and nothing touches an external observability platform. For teams where that matters, it’s a meaningful difference.

For teams where it doesn’t, the hosted tier at tracestack.dev gives you the same visibility without running any infrastructure.

The cost estimate accuracy question

One thing worth being honest about: the cost estimates are based on hardcoded pricing tables, not live API prices. Token pricing changes periodically, and the SDK’s estimates will drift over time when it does.

The tables in the current SDK cover GPT-4o, GPT-4o mini, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, Gemini 1.5 Pro, Gemini 1.5 Flash, Llama 3.1 70B and 405B, and Mistral Large. For models not in the table, the estimate returns zero rather than guessing.

The estimates are close enough to be useful for spotting problems and comparing sessions. They’re not billing-grade accurate. If you need to reconcile to the cent, you still need to check the provider’s actual usage dashboard — TraceStack tells you where to look, not what the final number is.

What’s next

The current version is read-only analytics — you see what happened but you can’t do anything about it. The next phase is alerts: trigger a webhook or Slack notification when session cost exceeds a threshold, when error rate spikes, or when latency on a specific model degrades. That’s the transition from observability to operability.

A trace viewer UI is also in the roadmap — something you can share with a team to look at a specific session together without writing a query.

For now, the SDK is on PyPI and the backend is MIT-licensed on GitHub. If you’re building agents and want cost and latency data without restructuring your code around an observability platform, give it a try.

pip install agenttrace

Self-host instructions and the backend code are at github.com/ModologyStudiosLLC/agenttrace. Hosted tier at tracestack.dev.