Compiling Ideas
Compiling Ideas Podcast
Why I Built VoiceBridge: Taking Back Control of My Voice Workflow
0:00
-14:51

Why I Built VoiceBridge: Taking Back Control of My Voice Workflow

I spent months fighting with paid tools and janky workflows just to turn my voice into text and text back into audio. After enough frustration with SuperWhisper’s paywalls, Whispering’s broken clipboard support, and ElevenLabs subscriptions, I built VoiceBridge. It’s a free, local, cross-platform CLI that runs Whisper and VibeVoice on your own hardware with proper workflow integration. This is the story of why that mattered and how I built it.

The Problem Started Simple Enough

I was messing around with OpenAI’s Whisper model[¹] and VibeVoice on my PC one weekend. Both worked beautifully. Fast transcription, clean audio generation, all running locally on my RTX 5090. No cloud dependencies, no subscription fees, no privacy concerns. Just me and the models.

Then I tried to use them for real work.

That’s when things got messy.

I wanted to dictate a quick email. Transcribe a podcast interview. Have my computer read back a draft I’d written. Basic stuff. The kind of workflow that should just work. On macOS, you hit a hotkey and dictate. Text appears under your cursor. Simple. But I wasn’t on macOS. And even if I was, the dictate function absolute sucks.

So I went hunting for alternatives.

The Great Tool Hunt (And Why It Sucked)

First stop: SuperWhisper. Beautiful UI. Great reviews. Mac only. $20/month. Hard pass.

Next up: Whispering for Windows. Finally, something that ran local models. I installed it, tested it, and immediately hit a wall. The “copy to clipboard” feature didn’t work. The “insert under cursor” feature? Also broken. I’d transcribe something and then have to manually copy-paste it like some kind of cave person.

For text-to-speech, ElevenLabs was the gold standard. Incredible voice quality, simple API. Also $22/month for the starter plan. Also sending all my text to their servers.

Here’s the thing: I have an RTX 5090 sitting in my case doing basically nothing when I’m writing. I can run Whisper[¹] and VibeVoice locally. I get privacy. I get speed. I get to feel smug about not paying monthly fees. But none of that matters if the tooling sucks.

I didn’t want a fancy app. I wanted workflow integration. I wanted to:

  • Hit a hotkey, talk, and have text appear under my cursor

  • Copy text to my clipboard and have it read aloud

  • Select a text file and generate an audio file from it

  • Drag an audio file into a folder and get a transcript back

The tools could do the AI part. None of them could do the workflow part.

The Hacky Python Scripts Phase

I’m an engineer. I solve problems. So I wrote some Python scripts.

One script would listen to my microphone, run Whisper, and dump the result to a file. Another would read a file and pipe it to VibeVoice. A third would monitor a directory for new audio files and auto-transcribe them.

It worked. Sort of.

The problem was coordination. I’d be writing an email, want to dictate a sentence, switch to my terminal, run the script, wait for it to finish, copy the output, paste it into my email, and forget what I was going to say in the first place.

Or I’d want to listen to an article while cooking. So I’d select the text, copy it to a file, run the script, wait for the audio to generate, open the audio file, and by then the pasta was overcooked.

The individual pieces worked. The glue didn’t.

I needed a real tool.

Building VoiceBridge: The Plan

I knew what I wanted. A single CLI that could:

  1. Run Whisper[¹] and VibeVoice locally

  2. Integrate with my actual workflow (hotkeys, clipboard, file monitoring)

  3. Work on Linux, Windows, and macOS

  4. Be extensible enough to swap models later

The tech stack came together pretty fast. Python for the core. Typer[²] for the CLI. Pynput for global hotkeys. FFmpeg[³] for audio processing.

The hard part wasn’t the AI. The AI was already solved. The hard part was making it not suck to use.

Challenge 1: Hotkeys That Actually Work

Let’s talk about global hotkeys for a second. On paper, it’s simple. Listen for a key combination, trigger a function. In practice, it’s a nightmare of OS-specific quirks.

On Windows, you’ve got the Win32 API. On Linux, you’ve got X11 or Wayland (good luck). On macOS, you’ve got Accessibility permissions that users need to manually grant.

I went with pynput because it abstracts most of that mess. But even then, there were gotchas. Some key combinations are reserved by the OS. Some only work when your app has focus. Some work differently depending on your desktop environment.

The solution? Let users configure their own hotkeys. Don’t hardcode anything. Provide sane defaults, but make them overridable. And test on all three platforms.

I set up a listener that runs in the background. When you hit the configured hotkey, it starts recording from your microphone. When you release it, it stops, runs Whisper, and either copies the result to your clipboard or inserts it under your cursor.

That last part (insert under cursor) was the trickiest. On Linux, you can use xdotool. On macOS, you can use AppleScript. On Windows, you can use pyautogui. Each one has its own timing quirks and edge cases. But once it worked, it felt like magic.

Challenge 2: CLI Design That Doesn’t Require a PhD

I love a good CLI. I hate a bad one.

A bad CLI makes you memorize flags. A bad CLI has inconsistent naming. A bad CLI gives you cryptic error messages and no help text.

I wanted VoiceBridge to feel intuitive even if you’d never used it before. Good developer experience matters. Companies like Stripe and Twilio have proven that treating developers well pays off. Stripe’s DX is so good that engineers consistently name it as the company that treats their technical user base best[⁴]. The lesson? Be intentional with your words, make things easy to understand, and provide good guidance when someone makes a mistake.

Enter Typer[²]. It’s basically Click with type hints, which means you get automatic validation, help generation, and a clean syntax all in one. You define your commands as functions, add some decorators, and Typer does the rest.

Here’s what the STT command looks like:

@app.command()
def stt(
    audio_file: Optional[Path] = typer.Argument(None),
    output: Optional[Path] = typer.Option(None, “--output”, “-o”),
    insert_cursor: bool = typer.Option(False, “--insert-cursor”),
    copy_clipboard: bool = typer.Option(False, “--copy”),
):
    “”“Transcribe audio to text using Whisper.”“”
    # implementation

Clean. Readable. Self-documenting. Run voicebridge stt --help and you get a nice help screen. Pass the wrong type? You get a clear error message. No guesswork.

I organized the CLI into clear subcommands: stt for speech-to-text, tts for text-to-speech, daemon for background services. Each subcommand has its own flags and options. No sprawling mess of top-level flags.

Challenge 3: Cross-Platform Without Losing Your Mind

Cross-platform Python is a special kind of hell.

File paths? Different on Windows. Audio backends? Different on macOS. Clipboard access? Different everywhere.

I needed abstractions. Clean ones.

For file paths, I used pathlib everywhere. It handles path separators, normalization, and all the other nonsense automatically.

For audio, I standardized on FFmpeg[³] as the preprocessing step. Every platform has FFmpeg. It handles format conversion, sample rate adjustment, channel mixing, all of it. VoiceBridge just shells out to FFmpeg and works with the normalized output.

For clipboard and keyboard automation, I built adapter classes. Each platform gets its own implementation, but they all implement the same interface. The core logic doesn’t care which OS it’s running on. It just calls clipboard.copy(text) and the adapter figures out the rest.

This is the Ports and Adapters pattern[⁵] in action. The core domain logic (run Whisper, generate audio, process text) doesn’t know or care about OS details. The adapters handle the dirty work. Alistair Cockburn introduced this pattern to create loosely coupled application components that can be easily connected to their software environment. The hexagon isn’t important because six is a magic number. It just gives you room to draw all your ports and adapters without being constrained by traditional layered diagrams.

It meant more upfront design. It also meant I could test the core logic without spinning up a full Windows VM.

Challenge 4: Future-Proofing the Model Layer

Whisper[¹] and VibeVoice are great. But they won’t be the best forever.

Maybe in six months, someone releases a better speech-to-text model. The TTS landscape is evolving fast, with billion-parameter models[⁶] trained on 100K+ hours of data achieving new levels of naturalness. Maybe I want to add support for Coqui TTS or Bark or whatever comes next. I didn’t want to rewrite the whole tool every time.

So I built a model abstraction layer.

Every model implements a simple interface:

class STTModel(ABC):
    @abstractmethod
    def transcribe(self, audio_path: Path) -> str:
        pass
class TTSModel(ABC):
    @abstractmethod
    def generate(self, text: str, output_path: Path) -> None:
        pass

The CLI doesn’t call Whisper directly. It calls stt_model.transcribe(). The model implementation happens to be Whisper right now. But swapping it out is just a matter of writing a new adapter.

Same goes for TTS. Right now it’s VibeVoice. Tomorrow it could be something else.

This is the kind of design decision that feels over-engineered when you’re writing it and brilliant when you need to extend it later.

Challenge 5: Handling the Daemon Process

One of the killer features I wanted was background monitoring. Drop an audio file into a folder, get a transcript. Copy text to your clipboard, hit a hotkey, hear it read aloud.

That required a daemon[⁷]. A long-running process that listens for events and reacts to them.

Python daemons are straightforward until you need to stop them cleanly. You’ve got signal handling, cleanup routines, state management. Get it wrong and you leak resources or corrupt files.

I used a simple event loop with graceful shutdown handling:

def run_daemon():
    running = True

    def signal_handler(sig, frame):
        nonlocal running
        running = False

    signal.signal(signal.SIGINT, signal_handler)
    signal.signal(signal.SIGTERM, signal_handler)

    while running:
        # check for new files
        # process clipboard
        # handle hotkeys
        time.sleep(0.1)

When you send a SIGTERM or hit Ctrl+C, the daemon finishes its current task and shuts down cleanly. No orphaned processes. No half-written files.

I also added a PID file so you can check if the daemon is running and stop it from another terminal. Basic stuff, but it makes the tool feel polished.

The Result: A Tool That Feels Right

After a few weeks of evenings and weekends, I had something that worked. Not just technically, but ergonomically.

I could write an email, hit a hotkey, dictate a paragraph, and keep typing. No context switching.

I could select a blog post, copy it, and listen to it while I made dinner.

I could dump a podcast episode into a watched folder and get a transcript in the morning.

It felt like the tools I’d been using should have felt all along.

What I Learned

Building VoiceBridge taught me a few things.

First, workflow integration matters more than model quality. The best AI in the world is useless if it’s a pain to actually use. Netflix and Spotify both figured this out years ago. They have entire teams dedicated to developer productivity[⁸], measuring and improving engineering workflows. Netflix’s 80-person developer productivity engineering team owns everything from build to test to CI. They measure success by how fast developers can ship. Workflow matters.

Second, cross-platform support is worth the upfront investment. Yes, it’s harder to build. But it means your tool is useful to more people. And the abstractions you build to support multiple platforms make your codebase better.

Third, CLIs don’t have to suck. With tools like Typer[²], you can build interfaces that are both powerful and approachable. Good help text, sensible defaults, and clear error messages go a long way.

Finally, open source is a forcing function for good design. When you know other people might read your code, you write better abstractions. You document your decisions. You think harder about edge cases.

Try It Yourself

If you’ve got a GPU sitting idle and you’re tired of paying for transcription and TTS services, give VoiceBridge a shot.

It’s free. It’s local. It runs on Linux, Windows, and macOS.

Installation is simple:

pip install voicebridge
voicebridge stt --help

The repo has examples, docs, and a quickstart guide: github.com/PatrickKoss/VoiceBridge

And if you find a bug or want a feature, open an issue. Or better yet, send a PR. The abstractions are clean. The code is readable. You’ll figure it out.

Now if you’ll excuse me, I’ve got an email to dictate.

References

[¹]: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356. https://arxiv.org/abs/2212.04356

[²]: Tiangolo. (n.d.). Typer: Python CLI framework.

https://typer.tiangolo.com/

[³]: FFmpeg. (n.d.). FFmpeg Documentation. https://ffmpeg.org/documentation.html

[⁴]: Thoughtworks. (n.d.). Elevate developer experiences with CLI design guidelines. https://www.thoughtworks.com/insights/blog/engineering-effectiveness/elevate-developer-experiences-cli-design-guidelines

[⁵]: Cockburn, A. (2005). Hexagonal Architecture. https://alistair.cockburn.us/hexagonal-architecture

[⁶]: Zhang, Z., et al. (2024). BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data. arXiv:2402.08093. https://arxiv.org/abs/2402.08093

[⁷]: Wikipedia contributors. (2024). Daemon (computing). In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Daemon_(computing)

[⁸]: Netflix Technology Blog. (n.d.). Developer Productivity Engineering at Netflix.

https://netflixtechblog.com

& Spotify Engineering. (2020). How We Improved Developer Productivity for Our DevOps Teams. https://engineering.atspotify.com/2020/08/how-we-improved-developer-productivity-for-our-devops-teams

Discussion about this episode

User's avatar

Ready for more?