Building Voice into AI Applications: Streaming TTS and Multi-Voice Audio Production
AI

Building Voice into AI Applications: Streaming TTS and Multi-Voice Audio Production

Building Voice into AI Applications What if your AI-generated story could sound like an audiobook with a full cast? What if each character had a distinct, recognizable voice—and when one...

4 min read

Building Voice into AI Applications

What if your AI-generated story could sound like an audiobook with a full cast? What if each character had a distinct, recognizable voice—and when one picks up a guitar, you actually hear the song?

We’ve been deep in audio development at Kusog AI, integrating text-to-speech capabilities into the aiAgent platform. The result is a two-tier audio system: one optimized for real-time streaming of LLM responses, and another for production-quality multi-voice content with sound effects and music.


Two Modes for Two Use Cases

Voice in AI applications isn’t one-size-fits-all. A chatbot responding in real-time has fundamentally different requirements than an audiobook with multiple characters. We built for both.

Tier 1: Streaming Audio (AudioManager)

For conversational AI and simple narration, audio needs to start playing immediately. Users don’t want to wait for an entire response to generate before hearing it.

Our AudioManager handles this with:

  • WebSocket streaming — Audio arrives sentence-by-sentence, almost instantly
  • Progressive decoding — Each chunk is decoded and queued while the previous plays
  • Sentence highlighting — The UI tracks which sentence is currently being spoken
  • Immediate mode — For LLM voice responses, audio plays now or not at all (no interrupting ongoing speech)

We’ve tested 90 minutes of continuous book playback through this pipeline. It works seamlessly.

The architecture: text streams from the LLM → TTS model generates audio → ffmpeg in streaming mode applies EQ and effects → WebSocket delivers to browser → AudioManager decodes and plays.

Tier 2: Multi-Track Production (DAW)

When users create stories with multiple characters, we bring in heavier machinery: a full Digital Audio Workstation repurposed from our StringClub platform—11 years of audio development.

This enables:

  • Multiple character voices on separate tracks
  • Dialogue timing like music notation — characters can talk over each other naturally
  • Sound effects layered on dedicated tracks
  • Music integration — songs generated via Mureka from lyrics in the script
  • Server-side mixdown to a single MP3

A scene can have three people conversing, one picks up a guitar, a song plays, then conversation resumes. The DAW treats dialogue timing like notes in a string quartet.


The TTS Infrastructure

We evaluated several TTS models, each with trade-offs. Two made it to production:

Kokoro — Quality and Expressiveness

Our top pick for natural-sounding speech. Kokoro offers:

  • 54 voices with excellent multilingual support
  • Natural inflection that other models lack
  • The French speakers actually sound French when speaking English

For scenes requiring emotional range or accent authenticity, Kokoro delivers.

Microsoft SpeechT5 — Scale and Variety

When you need 20 distinct background characters, variety matters. We expanded SpeechT5 with voices from research datasets:

  • 357+ voices from VCTK, LibriTTS, and CMU datasets
  • Reliable and consistent output
  • Easy to fine-tune for new voices

A single scene might use voices from both models across multiple datasets. Our abstraction layer makes this seamless.

Why Self-Hosted?

We considered services like ElevenLabs—the voice quality is excellent. But we chose self-hosted models on our NVIDIA CUDA Kubernetes cluster for several reasons:

  • Control — We manage scalability, not a third party
  • Cost — No per-character pricing at scale
  • Security — Sensitive content never leaves our infrastructure
  • Flexibility — Mix voices from different models in the same scene

The streaming pipeline runs through ffmpeg, which lets us apply EQ, control playback speed, and add effects while maintaining real-time delivery.


Caching and Performance

Generated audio is stored in a media library. Replays don’t regenerate—they retrieve cached audio. But when content changes, the cache invalidates automatically.

This matters for iterative workflows. A user editing their story hears the new version without manual cache management, but listening to the same chapter twice doesn’t burn GPU cycles.


A Real Example: The Awakening of Dillon Mercer

I’ve written a 122,000-word novel, The Awakening of Dillon Mercer. One chapter has the main character sing a song to his girlfriend Susan.

Using the aiAgent platform, I’m creating an audio version that’s more like a movie without the video:

  1. The app generates a screenplay from the prose
  2. Each character gets assigned a distinct voice
  3. The chapter renders with dialogue, narration, and the actual song embedded
  4. The listener hears the conversation before and after the song plays within it

This isn’t a reading of the book. It’s a production.


What’s Next

We’re actively developing:

  • Full-screen playback with multiple language support in the same content
  • Multilingual scenes — imagine a French class where teacher and students speak both English and French naturally
  • Deeper sentence-level highlighting synchronized with audio playback

The Real Challenge

A comment on my recent LinkedIn post captured it well:

“Multi-voice synthesis is the step that makes AI content feel less robotic. The challenge isn’t the tech anymore—it’s making listeners genuinely care about the voice they’re hearing.”

400+ voices ready. Screenplay-style storytelling where every character sounds unique and recognizable. The infrastructure is in place. Now it’s about crafting experiences that connect.


Interested in adding voice capabilities to your AI application? Let’s talk about what’s possible.

Related Articles

AI Partnership Programming: Two Approaches to Staying in the Driver's Seat Video
Jan 22, 2026 13 min read

AI Partnership Programming: Two Approaches to Staying in the Driver's Seat

AI Partnership Programming: Two Approaches to Staying in the Driver’s Seat The promise of AI-assisted development comes packaged with impressive statistics: “AI writes 50% of our code” or “10x productivity gains.” These numbers create a seductive narrative where developers offload coding work to AI while focusing on higher-level tasks—or perhaps...
Building a Multi-Tenant CDN with CloudFront Functions and KeyValueStore AWS
Dec 17, 2025 5 min read

Building a Multi-Tenant CDN with CloudFront Functions and KeyValueStore

Serving multiple tenants from a single CloudFront distribution is a cost-effective way to scale a SaaS platform. Rather than spinning up separate infrastructure for each customer, you can use CloudFront Functions combined with KeyValueStore (KV) to dynamically route requests based on hostname—all at the edge, with sub-millisecond latency. In this...
Agent Security and Prompt Injection: How to Safely Integrate AI Tools Video
Nov 20, 2025 3 min read

Agent Security and Prompt Injection: How to Safely Integrate AI Tools

🛡️ Agent Security and Prompt Injection The capabilities of Large Language Models (LLMs) to control applications via tool calls (functions) are revolutionary. However, this introduces serious security risks, primarily from Prompt Injection. Prompt injection occurs when a user or outside data source (like a LinkedIn profile’s “About” section) injects malicious...