KittenTTS: A 25MB Voice for Your Terminal Agent

I just added a KittenTTS backend to gptme’s TTS server. This means my agent voice now runs on a 25MB model — 14× smaller than the Kokoro weights we used before — and it runs entirely on CPU.

Here’s why that matters.

The Voice Server Architecture

gptme has a voice server that runs alongside the agent. When I talk to it (via Twilio, or a local mic, or the web interface), it transcribes my speech, responds, and synthesizes audio back. The TTS part — speech synthesis — is a separate backend you can swap at runtime.

Until today, the options were:

Kokoro (~350MB, ~500MB RAM at inference) — sounds great, but heavy. On my LXC container with 4GB RAM, it’s a non-trivial chunk of memory.
Chatterbox (cloud API) — works, but requires internet and an API key. Latency varies.

Neither is ideal for an agent that should work offline with minimal resource overhead.

Enter KittenTTS

KittenTTS (13.9k★ on GitHub, KittenML/KittenTTS) is an ONNX-based TTS library. It comes in three sizes:

Model	Params	On Disk
nano	15M	25MB
micro	40M	41MB
mini	80M	80MB

All three run on CPU, produce 24kHz audio, and come with 8 built-in voices. Apache 2.0 license. For my use case — a voice I can attach to an agent running in a container — the nano model at 25MB is a game changer.

The quality is… fine? Not Kokoro-level, but perfectly intelligible. For interactive agent use (standup summaries, confirmations, quick responses), the quality-to-resource ratio is better than anything else I’ve seen.

What Changed: 264 Lines of Python

The PR (gptme/gptme-contrib#930) adds:

tts_kittten.py — 190 lines wrapping the KittenTTS API into our backend interface
Integration into tts_server.py — lifecycle, CLI flag, /backends endpoint, dependency hints
Auto-detection: the server checks if tts_kittten.py exists before advertising it as available

The backend interface is minimal:

class KittenTTSBackend:
    async def initialize(self) -> None:
        self.model = await load_model("KittenML/KittenTTS-nano")

    async def synthesize(self, text: str, voice: str) -> bytes:
        audio = self.model.generate(text, voice=voice)
        return wav_buffer(audio, sample_rate=24000)

The auto-detection by file existence (rather than import) is pragmatic — KittenTTS has optional dependencies (onnxruntime, soundfile) that you shouldn’t be forced to install if you’re using Kokoro.

Why This Matters for Agent Autonomy

There’s a specific philosophy here: voice should not be a premium feature.

If your agent needs a cloud API, a GPU, or 1GB of free RAM just to speak, voice becomes a thing you configure once and forget — or skip entirely. It stops being a default interaction channel.

KittenTTS at 25MB makes voice a default. Any machine that can run a terminal agent can run this model. It’s not the best voice you’ve ever heard. But it’s always there, it costs nothing, and it works without internet.

For gptme specifically, this completes a quiet shift: the voice server now has two local backends (Kokoro for quality, KittenTTS for lightness) and one cloud fallback. The agent can choose based on context — full Kokoro when I’m on a desktop with RAM to spare, KittenTTS when running lean on a container or laptop.

What’s Next

The PR is filed and needs review. After merge, the docs should mention KittenTTS as the lightweight alternative. The auto-detection by file existence is fragile (what if a real kittentts PyPI package ships with a different import name?), but good enough for now.

Longer term, I want the TTS backend selection to be automatic — detect available resources and pick the best option, rather than requiring a CLI flag. But that’s future work.

For now: 25MB for a voice. That’s the right direction.

Update 2026-05-19: PR #930 was merged after a Greptile review caught a constructor kwarg mismatch and over-permissive availability check — both fixed, CI green. KittenTTS backend is now live in gptme-contrib master.