Building Voice into AI Applications: Streaming TTS and Multi-Voice Audio Production
Building Voice into AI Applications What if your AI-generated story could sound like an audiobook with a full cast? What if each character had a distinct, recognizable voice—and when one...
Building Voice into AI Applications
What if your AI-generated story could sound like an audiobook with a full cast? What if each character had a distinct, recognizable voice—and when one picks up a guitar, you actually hear the song?
We’ve been deep in audio development at Kusog AI, integrating text-to-speech capabilities into the aiAgent platform. The result is a two-tier audio system: one optimized for real-time streaming of LLM responses, and another for production-quality multi-voice content with sound effects and music.
Two Modes for Two Use Cases
Voice in AI applications isn’t one-size-fits-all. A chatbot responding in real-time has fundamentally different requirements than an audiobook with multiple characters. We built for both.
Tier 1: Streaming Audio (AudioManager)
For conversational AI and simple narration, audio needs to start playing immediately. Users don’t want to wait for an entire response to generate before hearing it.
Our AudioManager handles this with:
- WebSocket streaming — Audio arrives sentence-by-sentence, almost instantly
- Progressive decoding — Each chunk is decoded and queued while the previous plays
- Sentence highlighting — The UI tracks which sentence is currently being spoken
- Immediate mode — For LLM voice responses, audio plays now or not at all (no interrupting ongoing speech)
We’ve tested 90 minutes of continuous book playback through this pipeline. It works seamlessly.
The architecture: text streams from the LLM → TTS model generates audio → ffmpeg in streaming mode applies EQ and effects → WebSocket delivers to browser → AudioManager decodes and plays.
Tier 2: Multi-Track Production (DAW)
When users create stories with multiple characters, we bring in heavier machinery: a full Digital Audio Workstation repurposed from our StringClub platform—11 years of audio development.
This enables:
- Multiple character voices on separate tracks
- Dialogue timing like music notation — characters can talk over each other naturally
- Sound effects layered on dedicated tracks
- Music integration — songs generated via Mureka from lyrics in the script
- Server-side mixdown to a single MP3
A scene can have three people conversing, one picks up a guitar, a song plays, then conversation resumes. The DAW treats dialogue timing like notes in a string quartet.
The TTS Infrastructure
We evaluated several TTS models, each with trade-offs. Two made it to production:
Kokoro — Quality and Expressiveness
Our top pick for natural-sounding speech. Kokoro offers:
- 54 voices with excellent multilingual support
- Natural inflection that other models lack
- The French speakers actually sound French when speaking English
For scenes requiring emotional range or accent authenticity, Kokoro delivers.
Microsoft SpeechT5 — Scale and Variety
When you need 20 distinct background characters, variety matters. We expanded SpeechT5 with voices from research datasets:
- 357+ voices from VCTK, LibriTTS, and CMU datasets
- Reliable and consistent output
- Easy to fine-tune for new voices
A single scene might use voices from both models across multiple datasets. Our abstraction layer makes this seamless.
Why Self-Hosted?
We considered services like ElevenLabs—the voice quality is excellent. But we chose self-hosted models on our NVIDIA CUDA Kubernetes cluster for several reasons:
- Control — We manage scalability, not a third party
- Cost — No per-character pricing at scale
- Security — Sensitive content never leaves our infrastructure
- Flexibility — Mix voices from different models in the same scene
The streaming pipeline runs through ffmpeg, which lets us apply EQ, control playback speed, and add effects while maintaining real-time delivery.
Caching and Performance
Generated audio is stored in a media library. Replays don’t regenerate—they retrieve cached audio. But when content changes, the cache invalidates automatically.
This matters for iterative workflows. A user editing their story hears the new version without manual cache management, but listening to the same chapter twice doesn’t burn GPU cycles.
A Real Example: The Awakening of Dillon Mercer
I’ve written a 122,000-word novel, The Awakening of Dillon Mercer. One chapter has the main character sing a song to his girlfriend Susan.
Using the aiAgent platform, I’m creating an audio version that’s more like a movie without the video:
- The app generates a screenplay from the prose
- Each character gets assigned a distinct voice
- The chapter renders with dialogue, narration, and the actual song embedded
- The listener hears the conversation before and after the song plays within it
This isn’t a reading of the book. It’s a production.
What’s Next
We’re actively developing:
- Full-screen playback with multiple language support in the same content
- Multilingual scenes — imagine a French class where teacher and students speak both English and French naturally
- Deeper sentence-level highlighting synchronized with audio playback
The Real Challenge
A comment on my recent LinkedIn post captured it well:
“Multi-voice synthesis is the step that makes AI content feel less robotic. The challenge isn’t the tech anymore—it’s making listeners genuinely care about the voice they’re hearing.”
400+ voices ready. Screenplay-style storytelling where every character sounds unique and recognizable. The infrastructure is in place. Now it’s about crafting experiences that connect.
Interested in adding voice capabilities to your AI application? Let’s talk about what’s possible.