Home
Blog
2026
01
Building voice into ai applications: streaming tts and multi voice audio production

Building Voice into AI Applications: Streaming TTS and Multi-Voice Audio Production

Building Voice into AI Applications What if your AI-generated story could sound like an audiobook with a full cast? What if each character had a distinct, recognizable voice—and when one...

Matt Denman

January 06, 2026

4 min read

Building Voice into AI Applications

What if your AI-generated story could sound like an audiobook with a full cast? What if each character had a distinct, recognizable voice—and when one picks up a guitar, you actually hear the song?

We’ve been deep in audio development at Kusog AI, integrating text-to-speech capabilities into the aiAgent platform. The result is a two-tier audio system: one optimized for real-time streaming of LLM responses, and another for production-quality multi-voice content with sound effects and music.

Two Modes for Two Use Cases

Voice in AI applications isn’t one-size-fits-all. A chatbot responding in real-time has fundamentally different requirements than an audiobook with multiple characters. We built for both.

Tier 1: Streaming Audio (AudioManager)

For conversational AI and simple narration, audio needs to start playing immediately. Users don’t want to wait for an entire response to generate before hearing it.

Our AudioManager handles this with:

WebSocket streaming — Audio arrives sentence-by-sentence, almost instantly
Progressive decoding — Each chunk is decoded and queued while the previous plays
Sentence highlighting — The UI tracks which sentence is currently being spoken
Immediate mode — For LLM voice responses, audio plays now or not at all (no interrupting ongoing speech)

We’ve tested 90 minutes of continuous book playback through this pipeline. It works seamlessly.

The architecture: text streams from the LLM → TTS model generates audio → ffmpeg in streaming mode applies EQ and effects → WebSocket delivers to browser → AudioManager decodes and plays.

Tier 2: Multi-Track Production (DAW)

When users create stories with multiple characters, we bring in heavier machinery: a full Digital Audio Workstation repurposed from our StringClub platform—11 years of audio development.

This enables:

Multiple character voices on separate tracks
Dialogue timing like music notation — characters can talk over each other naturally
Sound effects layered on dedicated tracks
Music integration — songs generated via Mureka from lyrics in the script
Server-side mixdown to a single MP3

A scene can have three people conversing, one picks up a guitar, a song plays, then conversation resumes. The DAW treats dialogue timing like notes in a string quartet.

The TTS Infrastructure

We evaluated several TTS models, each with trade-offs. Two made it to production:

Kokoro — Quality and Expressiveness

Our top pick for natural-sounding speech. Kokoro offers:

54 voices with excellent multilingual support
Natural inflection that other models lack
The French speakers actually sound French when speaking English

For scenes requiring emotional range or accent authenticity, Kokoro delivers.

Microsoft SpeechT5 — Scale and Variety

When you need 20 distinct background characters, variety matters. We expanded SpeechT5 with voices from research datasets:

357+ voices from VCTK, LibriTTS, and CMU datasets
Reliable and consistent output
Easy to fine-tune for new voices

A single scene might use voices from both models across multiple datasets. Our abstraction layer makes this seamless.

Why Self-Hosted?

We considered services like ElevenLabs—the voice quality is excellent. But we chose self-hosted models on our NVIDIA CUDA Kubernetes cluster for several reasons:

Control — We manage scalability, not a third party
Cost — No per-character pricing at scale
Security — Sensitive content never leaves our infrastructure
Flexibility — Mix voices from different models in the same scene

The streaming pipeline runs through ffmpeg, which lets us apply EQ, control playback speed, and add effects while maintaining real-time delivery.

Caching and Performance

Generated audio is stored in a media library. Replays don’t regenerate—they retrieve cached audio. But when content changes, the cache invalidates automatically.

This matters for iterative workflows. A user editing their story hears the new version without manual cache management, but listening to the same chapter twice doesn’t burn GPU cycles.

A Real Example: The Awakening of Dillon Mercer

I’ve written a 122,000-word novel, The Awakening of Dillon Mercer. One chapter has the main character sing a song to his girlfriend Susan.

Using the aiAgent platform, I’m creating an audio version that’s more like a movie without the video:

The app generates a screenplay from the prose
Each character gets assigned a distinct voice
The chapter renders with dialogue, narration, and the actual song embedded
The listener hears the conversation before and after the song plays within it

This isn’t a reading of the book. It’s a production.

What’s Next

We’re actively developing:

Full-screen playback with multiple language support in the same content
Multilingual scenes — imagine a French class where teacher and students speak both English and French naturally
Deeper sentence-level highlighting synchronized with audio playback

The Real Challenge

A comment on my recent LinkedIn post captured it well:

“Multi-voice synthesis is the step that makes AI content feel less robotic. The challenge isn’t the tech anymore—it’s making listeners genuinely care about the voice they’re hearing.”

400+ voices ready. Screenplay-style storytelling where every character sounds unique and recognizable. The infrastructure is in place. Now it’s about crafting experiences that connect.

Interested in adding voice capabilities to your AI application? Let’s talk about what’s possible.