How We Built Cue's Real-Time Transcription Pipeline

A look under the hood at how Cue delivers sub-200ms voice-to-text with context-aware corrections.

The challenge

Building a voice-to-text app sounds simple. Capture audio, send to an API, display text. But making it feel instant while maintaining accuracy across every macOS app? That's where it gets interesting.

The pipeline

Cue's transcription pipeline has four stages:

1. Audio capture

We use macOS's Core Audio framework to capture microphone input with minimal latency. The audio is chunked into 100ms segments and streamed — we don't wait for you to stop speaking.

2. Streaming transcription

Each chunk is sent to Deepgram's Nova-2 model via WebSocket. We get partial transcripts back in under 100ms. These partials are shown immediately so you see words appear as you speak.

3. Context-aware correction

Raw transcripts aren't perfect. "I'll send the poll request" should be "I'll send the pull request" if you're in a code editor. We run a lightweight LLM pass that considers:

The active app (VS Code vs Mail vs Slack)
Recent text in the field
Your correction history

4. Text injection

The final text is injected at the cursor position using macOS Accessibility APIs. This is what makes Cue work in every app — we're not pasting from clipboard, we're simulating keystrokes.

The result

End-to-end, the pipeline averages 180ms from speech to text appearing on screen. Fast enough that it feels like the text is coming from your thoughts, not your voice.