Uncypher Engineering

Why watching your agent think changes everything

From blank screens to real-time intelligence.

April 202620 min read

#streaming#agents#infrastructure

The Problem

The Blank Screen Problem

An AI agent that takes several seconds to respond leaves a lot of silence to fill. Without streaming, you stare at nothing. With streaming, you watch the thinking itself. Same total time, completely different experience.

Without Streaming

With Streaming

Both agents took the same 4.2 seconds. Same total time. What changed wasn’t the speed. It was whether you could see anything happening.

This is the thing most people miss about streaming. It doesn’t make the agent faster. It makes the agent legible. And once you can see what the agent is doing, you can start doing things with what you see.

Seeing is only half the gift. The other half is knowing what you’re seeing. Zoom in on that stream and you’ll notice something: it’s not characters flowing by. It’s typed events.

Under the Hood

Anatomy of a Stream

A user asks: “Show me last month’s revenue by region.” Watch the events flow.

thinking

Revenue data grouped by region… I’ll query sales with GROUP BY.

Agent reasons through the problem. You'd never see this without streaming.

text

Let me pull that data for you.

Response starts. User knows the agent is working.

tool_call

execute_sql("SELECT region, SUM(revenue)…")

Your frontend can start rendering a table skeleton now.

tool_result

APAC: $142K · EMEA: $98K · Americas: $215K

Data flows through the same event stream.

text

Americas led with $215,400, followed by APAC…

Agent synthesizes data into a natural language summary.

done

847 tokens · $0.003 · 4.2s

Metadata for monitoring and billing.

Every agent framework (OpenAI, Anthropic, open-source, custom) produces some variation of these five event types. The names change; the pattern doesn’t.

Now the interesting question. What can you do when you have a live feed of what an AI is thinking?

Capabilities

Three things you can do with a live stream

Real-Time Safety Net

Say you’re running a customer-facing AI. The model’s about to describe your internal pricing structure. Without streaming, you find out after the user already saw it. With streaming, you have a choice.

1Agent (source)

producing tokens

idle

↓each token flows down

2Safety reader (the gate)

checks each token

waiting

↓only passing tokens reach the user

3User chat

what the user sees

waiting for safe content

Streaming through the gate...

The same mechanic applies to human-in-the-loop (a person hits stop), cost control (a token threshold fires), policy enforcement (a rate limit hits). Different triggers, same shape.

Predictive UI

Here’s a trick. The moment an agent decides it needs to run a SQL query, it emits a tool_call event. That happens before the query runs. Which means your UI can start getting ready before the data comes back.

Waiting for events

0.0s

Event stream

UI rendering

awaiting events...

Watching events...

You’re not waiting for data. You’re waiting with the right UI already loaded.

Stream Splitter

The stream isn’t just for your frontend. One agent produces events; three different systems can consume them. Same stream, different jobs.

Agent stream

Chat UI

...

Fact Checker

no facts yet

Slack Notifier

waiting for tool_call

One stream, many consumers

Pub/sub at the intelligence layer. Fan out to every system that cares.

Three patterns. You’ve seen each one play out. What they have in common is trust. Trust that the stream is there, ordered, reliable, for every user, at every scale. That trust isn’t automatic. Here’s how to earn it.

These demos run in the browser, in a few dozen lines of React. Production doesn’t. Every pattern above assumes the stream reaches the user reliably, at any scale, which is not free. Let’s talk about how to actually build this.

Architecture

Getting events from there to here

Three stages. Start simple. Each one is still inside the next.

The Afternoon Prototype

Agent runs in your web server process. SDK events pipe directly into the SSE response. No queue, no broker, no infrastructure. Fine for prototypes, demos, and low-traffic internal tools.

Move to Stage 2 when → your second user’s request queues behind the first.

Choosing your stage

The right stage is the simplest one that doesn’t break for your use case. Start here.

Why these choices

Four things you figure out the second or third time you build this. They’re not obvious up front. Each one corresponds to a mistake we or someone we know has made.

1. You can’t staple an agent onto a request.

Here’s the thing nobody warns you about. Your first instinct will be: “Agents are slow, so I’ll just handle them like any other API endpoint. Spawn more workers. Done.”

Try it. You’ll notice something weird. Your p99 latency on unrelated endpoints, like GET /health or POST /settings, starts drifting up. You’ll check your CPU. You’ll check your database. You’ll check your network. Everything looks fine. Except your uvicorn workers are stuck 80% of the time on agent requests, and every other endpoint is queueing behind them.

The subtler issue is deploys. Every time you push a new version of your API, the orchestrator rolls your pods. That kills every in-flight agent run. Users see a broken conversation mid-sentence. You can’t ship without coordinating with the agent clock, which you don’t want to do.

The fix is to separate the two things. HTTP requests are short, so keep those in the API. Agent runs are long and stateful, so move those to their own process with their own lifecycle. Now your API can roll at will. The worker can recover from crashes on its own schedule. You’ve stopped stapling a marathon onto a sprint.

This isn’t about “using a queue.” It’s about noticing that two things you thought were the same thing are actually different things.

2. The best message bus is the one your team already runs.

You’ll read a dozen articles about Kafka and come away thinking that’s what the big companies use, so maybe you should too. Fine. Go deploy a 3-node Kafka cluster. Set up KRaft. Configure partitioning. Write a consumer group. Come back and tell me what your weekend was like.

Now look at the actual shape of your problem. One producer per agent. A handful of consumers per conversation. Events that live for maybe a minute. Payloads of a few kilobytes of JSON. Kafka was built for the shape of LinkedIn’s activity feed: massive throughput, many producers, many consumers, durable forever. Your problem is a different problem. You’re paying enterprise operational cost for kindergarten-scale traffic.

Redis Streams is what you actually want. It’s append-only. XRANGE catches you up from any point, which is reconnection built in. XREAD BLOCK wakes up clients when new events arrive. Latency in microseconds. None of it is free forever, but all of it is free right now.

And (this is the part most comparisons skip) Redis is probably already in your stack. You’re using it for sessions, for rate limits, for cache. Adding streams is one more command on a connection you already have. No new deploy, no new ops runbook, no new thing that might go down at 3am.

If you don’t have Redis yet and you’re still small, don’t add it yet. A Postgres table with LISTEN/NOTIFY does the same job up to a few hundred concurrent streams. It won’t scale forever, but by the time it doesn’t, you’ll know.

The general move: match the tool to the shape of your problem and the shape of your team. Then stop.

3. Two different guarantees need two different writes.

Here’s a thing that took me a while to get right.

You’ve built your worker. It’s producing events. You need to get them to connected clients and to clients that briefly disconnected. Both matter. If you pick Streams only, connected clients work fine (they XREAD BLOCK 0 and get notified), but each live client holds a Redis connection open, and your connection pool grows with your user count. If you pick Pub/Sub only, fan-out is cheap, but there’s no way to recover events a client missed during a 2-second wifi blip. They’re published, nobody was listening, they’re gone.

The obvious-in-retrospect move is to write to both. XADD for durability, so the event is there, replayable, until you delete it. PUBLISH for immediacy, so here it is, right now, to anyone listening. One event, two writes.

It costs one extra Redis command per event. In exchange, your reconnection path becomes trivial: client reconnects, sends Last-Event-ID: 42, server XRANGEs from 43 forward, then subscribes for live. No gaps, no special handling, no “reconnection logic” in the application layer.

There’s a subtle invariant worth naming. If XADD succeeds and PUBLISH fails (say, a Redis hiccup between the two commands), the event is durable but live subscribers miss it. That’s not a bug. Connected clients will see it when they reconnect and replay. The contract is durable, always; live, usually. That’s enough.

The principle: durability and liveness are two different properties of your system, and they don’t compose from a single primitive. When you need both, stop trying to be clever with one. Use two things for two jobs.

4. The transport is not the record.

Redis is fast. It’s in memory. It has streams, pub/sub, low latency, cheap ops. You might be tempted to just… use it for everything. Let the Redis Stream be the permanent record of the conversation. Never delete it.

Don’t.

Here’s what happens. Your Redis hits its maxmemory limit, which it will, because you never expire anything. Eviction kicks in. By default, Redis picks the least-recently-used key to evict. The “least-recently-used key” is going to be somebody’s active conversation, the one they just walked away from for coffee, the one they’re about to come back to. That user sees: “Your conversation disappeared.”

This is the kind of failure that’s invisible in testing because your test conversations all fit in memory. It’s catastrophic in production once memory is saturated. And by the time you discover it, Redis is already in pain and you have no good options.

The fix isn’t “tune the eviction policy.” The fix is to notice what Redis is actually for. It’s transport: the fastest way to get an event from a worker to a browser. It is not the historical record of what the user’s AI said yesterday. That record belongs in Postgres, or any real database. Cheap to store. Predictable under memory pressure. Easy to query.

Once you split the two, everything gets simpler. Events flow through Redis during a turn (XADD for durability, PUBLISH for live delivery), and each one also lands in Postgres at write time. When the turn completes, you delete the Redis stream. Its job is done; the Postgres row persists. Cleanup is trivial: delete at completion, set a short marker key so late-arriving retries don’t resurrect the stream. The marker is a little ugly, I agree. It’s ugly in a bounded way.

The principle generalizes. Every piece of infrastructure in your system has a job. Caches cache. Queues queue. Databases persist. You will be tempted, regularly, to use one of them for something else because it’s fast or already there. Resist. The systems that scale are the systems that let each component do one thing well.

Architecture tells you where events live: the workers, the streams, the pub/sub channels. Now zoom in on the final hop, the wire between your API and the user’s browser.

The protocol you choose for that final hop decides what happens when the network misbehaves. Pick wrong and every reconnection becomes your problem to write, debug, and babysit in production.

Protocol

Why SSE, not WebSocket

When you’re picking a transport for streaming agent output, WebSocket is the flashier choice. It’s bidirectional, it’s binary-framed, it’s what every real-time product demo uses. But for this specific problem, it’s the wrong tool, and the reason is worth understanding.

SSE

WebSocket

Serveridle

What you just saw: SSE recovered automatically. WebSocket didn’t.

1
Reconnection comes free with SSE.
The browser retries automatically with Last-Event-ID. WebSocket has nothing built in — you write the retry loop, track sequence numbers, and deduplicate on return. The bugs you write doing this are subtle and only show up in flaky networks.
2
SSE is just HTTP. WebSocket is HTTP-plus.
Every CDN, proxy, and load balancer speaks HTTP. WebSocket requires an upgrade handshake — every hop in your stack needs explicit WS support. Cloudflare, AWS ALB, nginx all have separate WS configs that can silently break.
3
SSE is visible in DevTools. WebSocket is buried.
When something breaks in production, you want to see the events. SSE shows up in the Network tab as a normal streaming response — inspect it, curl it, replay it. WebSocket frames live in a separate panel with less tooling.
4
The direction matches the problem.
Agent output is one-way: the model talks, the client listens. WebSocket’s bidirectional channel is capability you’re paying for and not using. Paying for the wrong tool makes the debugging harder when it goes wrong.

WebSocket gives you tools you don’t need and makes you build what you do need. SSE is the opposite trade.

So: SSE on the wire, Redis Streams + Pub/Sub in the middle, workers on the edges. The system is sound. And yet. In production, streams still break in ways nobody warns you about.

Production

What Goes Wrong

Streaming works beautifully in a demo. In production, networks are unreliable, users are impatient, and agents sometimes fail. Here’s what actually breaks and how to handle it.

The client disconnects mid-stream

The user’s browser loses connection for a few seconds — a tunnel, a tab switch on mobile, flaky wifi. When it reconnects, it needs to pick up where it left off without missing events or replaying ones it already saw.

Solution: SSE has a built-in mechanism for this. Each event gets an ID. The client sends Last-Event-ID on reconnect. Your server reads the durable log from that point forward via XRANGE, then subscribes to live delivery via pub/sub. The gap is seamless.

Stale events poison a fresh task

User asks a question. Agent fails halfway. Error event written to the stream. User asks a new question. New SSE connection opens. Replay reads the stream and replays the old error event into the new task’s UI.

Solution: Every event carries a task_id. On replay, skip any event whose task_id doesn’t match the current task. Simple filter, prevents a whole class of confusing UI bugs.

Duplicate events during reconnection

The overlap between replay (reading the log) and live (subscribing to pub/sub) can produce duplicates. An event published at the switchover moment might appear in both sources.

Solution: The frontend keeps a Set<string> of seen event IDs. Before rendering, check the set. If seen, skip. O(1) lookup, scoped to the connection — no memory leak.

The agent never finishes

Agent hangs mid-execution — the LLM API timed out, a tool call is stuck, a dependency is unreachable. The user stares at a loading state that will never resolve.

Solution: A background reaper checks for tasks that haven’t sent an event in N minutes. It marks them failed, emits an error event to the stream, and cleans up. The user sees an explicit error instead of infinite loading.

When to delete the stream

The durable log served its purpose. But it’s still in memory. Leave it and reconnecting clients replay completed events. Delete too early and slow clients miss the tail end of the response.

Solution: Delete after task completion. Set a short-lived marker key so late writes know the stream was intentionally cleaned. Your database is the source of truth — the stream is ephemeral transport, not storage.

These five problems show up in every streaming system eventually. The fixes are straightforward once you know to look for them.

We’ve covered a lot of ground: the events, the patterns, the infrastructure, the wire, the failure modes. But the reason any of this matters is bigger than any single piece.

The Takeaway

The deeper shift

Without streaming, an agent is a black box. With it, the agent becomes a transparent collaborator: reasoning you can observe, intermediate states you can act on, mistakes you can catch before they reach the user.

The stream becomes an integration surface for your entire application. Start simple. Ship it. Then build the interesting stuff on top.

See it in action