June 6, 20264 min read

A Browser Voice Agent Built on Three Workers

Voxtral Mini for speech, Hermes-2-Pro-Llama-3-8B for planning, Kokoro for TTS — three models, three Web Workers, no server.

Voice AgentWebGPUWeb Workers

Three AI models. No server. The voice agent on webml.ai runs Voxtral Mini for speech recognition, Hermes-2-Pro-Llama-3-8B for command planning, and Kokoro TTS for speech synthesis — all in the same browser tab, simultaneously.

The architecture that makes this possible is straightforward: each model runs in its own Web Worker. Voxtral gets one, the LLM planner gets one, Kokoro gets one. Workers are isolated OS threads. Running inference inside them means the main thread stays responsive while GPU shaders execute and tokens generate.

Audio capture starts with an AudioWorklet — not ScriptProcessor, which is deprecated. The worklet runs in the audio rendering thread and captures 16kHz mono PCM. Every chunk gets transferred to the Voxtral Worker as a transferable ArrayBuffer: zero-copy, no serialization cost. Voxtral processes each chunk as it arrives and streams partial transcriptions back to the main thread.

Command detection uses an idle timeout rather than a “stop speaking” button. When 1.8 seconds pass without a new transcription token, the current transcript is treated as a complete utterance and sent to the planning worker. That threshold is a tradeoff — shorter means faster response but more false triggers on natural speech pauses.

The planning worker runs Hermes-2-Pro-Llama-3-8B via @mlc-ai/web-llm, loaded as Hermes-2-Pro-Llama-3-8B-q4f16_1-MLC. This model handles structured output reliably, which matters here. The agent needs to emit valid JSON matching a specific tool call schema on every response — hallucinated keys or malformed output would silently fail. Temperature is set to 0, max_tokens capped at 180, and JSON Schema constraints enforce the output shape.

The tools themselves are registered through @webmcp-js/core. Navigate, search, read, scroll — each tool has a typed input schema. The planner’s output is validated against that schema before execution. A model that decides to call a nonexistent tool or pass a wrong argument type gets caught at the validation layer, not at execution time.

Kokoro TTS is the lightest piece: 82 million parameters, WASM-based, no WebGPU required. Text from the planner’s response goes to the Kokoro Worker, which synthesizes audio and streams it back for playback. The small parameter count means synthesis starts fast enough to begin playing before the full sentence is done.

The constraint you can’t route around: WebGPU. Voxtral Mini and Hermes both require it. That means desktop Chrome or Edge only, today. Firefox has a partial WebGPU implementation behind a flag but WebLLM doesn’t ship it. Safari has WebGPU but lacks required features. Mobile is out entirely.

First load requires downloading both models. Voxtral Mini at q4f16 and Hermes-2-Pro at q4f16_1-MLC are several gigabytes combined. The Cache API stores them after first download, so subsequent visits are fast. But that first load needs user patience and a good connection.

Model choice at each layer is independent. Swap Voxtral for Xenova/whisper-base via Transformers.js if you need a WASM fallback for speech. Swap Hermes for Llama-3.2-3B-Instruct-q4f32_1-MLC if you want a smaller planner. Kokoro is already the smallest viable TTS option running in the browser, so that layer is harder to shrink.

The three-worker design is the part worth reusing regardless of which models you pick. One worker per model, transferable buffers for audio, validated structured output for tool calls.