Dictare: voice interaction revolution for the world of computing

Coding agents have revolutionized programming.

Dictare revolutionizes human-machine interaction.

That's the level. Let me explain.

The problem nobody talks about¶

Every voice tool today lives in its own world. Wispr Flow does it one way. Superwhisper does it another. Every coding agent that adds voice support reimplements the same wheel: audio capture, model loading, GPU management, push-to-talk UX — each one different, each one heavy, each one locked to a single application.

The models themselves are extraordinary. Whisper, Parakeet, Kokoro — speech recognition and synthesis that was science fiction five years ago.

But every program downloads its own copy. Loads its own instance. Burns its own GPU memory. For ten apps that want voice, you get ten pipelines, ten configurations, ten different ways to press a button and talk.

Some go cloud. Now you're paying a subscription so your voice can travel to someone else's server and back. Your words, your code discussions, your private conversations — routed through infrastructure you don't control.

This is where we are. Fragmented. Inefficient. Expensive. Not private.

The missing piece¶

Speech models have become a commodity. They run on a laptop. They run on a phone. They will run on a thermostat. The technology is ready. What's missing is the infrastructure to make it usable — across applications, consistently, privately, without every developer reinventing the same thing.

Dictare is that infrastructure.

What Dictare is¶

Dictare is an open-source voice layer. It runs as a background service on your machine — like Ollama, but for voice. STT and TTS engines are loaded once, optimized for your hardware, and stay ready in memory. Any application connects through a simple open protocol and gets voice interaction for free.

No cloud. No subscription. No API keys. MIT licensed. 100% local. Your machine, your voice, your rules.

Want to see how it all comes together? This video shows everything.

For the best experience, watch on YouTube in 1080p

Why coding agents first¶

Coding agents have changed everything. Claude Code, Codex, Gemini CLI, Aider — they write code, refactor, debug, deploy. But you still type your instructions. You're having a conversation, but you're doing it through a keyboard.

The most revolutionary interaction tool deserved to be integrated with the most revolutionary tool in programming. The result is an exponential multiplier on productivity.

With Dictare, you speak to your coding agent. Not paste-text-into-a-field speak. Real, continuous, natural speech — delivered directly to the agent as if you typed it. Your voice drives the agent.

And because Dictare, thanks to the OpenVIP protocol, talks directly to your agent, no window focus is required. Your agent can stay behind three other windows. It still gets your words.

What makes it different¶

One interface for everything. Today you learn a different voice UX for every app. With Dictare, the interaction is the same everywhere. The same hotkey, the same voice commands, the same pipeline — whether you're talking to Claude Code, Codex, or a custom tool. OpenVIP standardizes human-machine voice interaction.

You choose the engine. Pick the STT model that understands your accent best. Pick the TTS voice you like. Configure silence thresholds for your speaking pace. Your machine, set up your way.

Multi-agent. Run Claude Code in one terminal, Codex in another, Gemini in a third. Switch between them with your voice: "agent jimi", "agent elvis". Each gets its own session. You're the conductor.

Subtle audio cues. Because you don't need window focus, you might be working in another app or walking around. Dictare gives you audio feedback — subtle sounds and TTS announcements when agents switch, when you mute, when something needs your attention. You always know what's happening without looking.

Unix philosophy. Dictare is composable. dictare transcribe outputs to stdout. dictare speak reads from stdin. Combine them:

dictare transcribe | llm | dictare speak

Voice in, LLM processing, voice out. Three commands, zero code.

The protocol¶

Dictare is the reference implementation of OpenVIP — the Open Voice Interaction Protocol.

OpenVIP is intentionally simple: a small set of HTTP endpoints using Server-Sent Events. Simple enough that an ESP32 or an Arduino with a network shield can implement it — anything that speaks HTTP can interact with it. Simple enough that adding voice to your app takes an afternoon, not a quarter.

The protocol standardizes the operations that every voice-enabled application needs: subscribe to transcriptions, send text input, request speech, check status. That's it. The simplicity is the point.

Install and try it¶

macOS:

brew install dragfly/tap/dictare

Grant Input Monitoring when prompted. Default hotkey: Right Cmd.

Linux:

curl -fsSL https://raw.githubusercontent.com/dragfly/dictare/main/install.sh | bash

First session:

dictare agent freddie

Tap Right Cmd. Speak. Double-tap to submit. That's it.

Full documentation at dictare.io/docs.

This is only the beginning¶

What you see today is voice for coding agents — because this is how you spread the word.

But the protocol is open, the architecture is modular, and the use cases are endless. Voice-controlled home automation without vendor lock-in. Accessibility tools that work with any application. Voice interfaces for embedded devices.

We're building the foundation. And the roadmap is full.

Dictare is open source, MIT licensed, and free forever.

Docs: dictare.io/docs
GitHub: github.com/dragfly/dictare
OpenVIP: openvip.dev

Try it. Break it. Star it if you like it. Tell me what's wrong with it.