← All posts
Engineering·8 min read·

How We Designed the Jinn Wake Word System

Inside the Jinn HoloBox wake word pipeline: why we chose openWakeWord, how we tuned for <5% false reject rates, and what it takes to listen 24/7 on ARM.

The Jinn HoloBox uses openWakeWord, an open-source framework, to listen for "Hey Jinn" entirely on-device — no audio leaves the hardware until the wake phrase is detected. We chose it over commercial alternatives because it is Apache 2.0 licensed, runs at under 5% CPU on our Cortex-A55, and delivers a false acceptance rate below 0.5 per hour with threshold tuning — meeting our targets for a device that listens 24/7.

Why does wake word detection matter so much?

Wake word detection is the first interaction a user has with a voice assistant. If it fails — either by not recognizing a legitimate command (false reject) or by activating when no one said the wake phrase (false accept) — the entire product feels broken.

Consider the math of always-on listening. A device that is active 16 hours per day processes roughly 57,600 seconds of audio daily. At a false acceptance rate of even 1 per hour, that is 16 spurious activations per day — enough to make users unplug the device. Conversely, a false reject rate above 10% means the user has to repeat themselves every tenth command, which destroys the conversational flow.

Our targets:

False reject rate: <5% (miss no more than 1 in 20 legitimate commands)
False acceptance rate: <0.5 per hour (fewer than 8 spurious activations in a 16-hour day)
Detection latency: <200 ms from end of wake phrase to system response
CPU usage: <10% of a single core (to leave headroom for other tasks)

How did we evaluate wake word engines?

We tested three engines head-to-head on the RK3566 hardware:

FeatureopenWakeWordPicovoice PorcupineSnowboy
LicenseApache 2.0 (fully open)Free for ≤3 users; commercial from $6,000Apache 2.0 (unmaintained)
Last updated2025 (active)2026 (active)2020 (archived)
Custom wake wordYes (synthetic data training)Yes (console + fine-tuning)Yes (user recordings)
Model size~1.3 MB (ONNX)~2 MB (proprietary)~3 MB
CPU usage (RK3566)3-5% single core2-3% single core8-12% single core
False reject rate (our testing)4.2% at tuned threshold3.1% at default11.8% at default
False accept rate (our testing)0.3/hour at tuned threshold0.2/hour at default1.4/hour at default
Platform supportPython, C (via ONNX)Python, C, Java, JS, Go, SwiftPython, C++
Noise robustnessGood (trained on diverse audio)Excellent (proprietary noise augmentation)Fair

Why not Porcupine?

Porcupine had the best raw accuracy in our testing. Its false reject rate of 3.1% and false acceptance rate of 0.2/hour were slightly better than openWakeWord. So why did we not use it?

Licensing. Picovoice's free tier is limited to projects with no more than three active users. For a consumer product shipping thousands of units, commercial licensing starts at $6,000 and scales per device. For an open-source hardware project targeting a $299 price point, a per-device licensing fee on the wake word engine directly conflicts with our goal of keeping the software stack free.

Additionally, Porcupine's models are proprietary binary blobs. We cannot inspect, modify, or audit them — a problem for a device that is always listening in people's homes. With openWakeWord, every layer of the model is inspectable.

Why not Snowboy?

Snowboy was a popular open-source option, but it has been unmaintained since Kitt.ai was acquired by Baidu in 2020. In our testing, its false reject rate of 11.8% was more than double our target. According to Picovoice's open-source wake word benchmark, Porcupine achieves 11x better accuracy and 6.5x faster inference than Snowboy on equivalent hardware. Snowboy is no longer a viable choice for production use.

How does the openWakeWord pipeline work?

The detection pipeline processes audio in four stages:

1. Audio capture and preprocessing

Audio comes from our dual-MEMS PDM microphone array through ALSA at 16 kHz, 16-bit mono. Before reaching the wake word model, the signal passes through:

Acoustic echo cancellation (AEC): Removes the device's own speaker output from the microphone signal using speexdsp. Without this, the wake word engine would trigger on TTS playback.
Automatic gain control (AGC): Normalizes volume levels so the model sees consistent input regardless of whether the user is 2 feet or 10 feet away.

2. Feature extraction

openWakeWord converts raw audio into mel spectrograms — a frequency-domain representation that mirrors human auditory perception. The mel spectrogram computation uses an ONNX implementation of PyTorch's melspectrogram function with fixed parameters (80 mel bands, 25 ms window, 10 ms hop).

Each inference processes an 80 ms audio frame, producing a feature vector that captures the spectral characteristics of that moment in time.

3. Neural network inference

The core model is a small neural network (~1.3 MB) that takes a sliding window of mel spectrogram frames and outputs a probability that the wake phrase was spoken. The model architecture uses a combination of convolutional and recurrent layers optimized for streaming audio — it processes frames sequentially without needing to buffer the entire utterance.

On the RK3566, inference takes approximately 2-3 ms per frame using ONNX Runtime on the CPU. We evaluated running on the 0.8 TOPS NPU via Rockchip's RKNN SDK, but the CPU path was already fast enough that the added complexity of NPU integration was not justified.

4. Threshold and smoothing

The raw model output is a probability between 0 and 1. We apply a tuned threshold (calibrated per model) and temporal smoothing to convert this into a binary detection decision.

Threshold tuning is the most important step for production quality. openWakeWord's documentation targets <5% false reject rates and <0.5/hour false accept rates with appropriate threshold tuning. We spent two weeks tuning thresholds using a test corpus of:

500 positive samples ("Hey Jinn" spoken by 25 different speakers, various distances and noise levels)
200 hours of negative audio (TV shows, music, household conversations, kitchen noise)

The final threshold was set at 0.72 for our "Hey Jinn" model — high enough to reject most environmental noise, low enough to catch natural variations in how people say the phrase.

How did we train a custom "Hey Jinn" model?

openWakeWord supports training custom wake word models using synthetic speech data. The process:

1.Generate synthetic utterances: Using text-to-speech engines (Google TTS, Azure TTS, and others), we generated approximately 5,000 synthetic recordings of "Hey Jinn" with varying speaker characteristics, accents, speeds, and emphasis patterns.
2.Collect negative samples: We assembled a negative dataset from publicly available audio corpora — LibriSpeech, Common Voice, and AudioSet — representing the range of non-wake-word audio the model will encounter in daily use.
3.Data augmentation: Each synthetic sample was augmented with room impulse responses (simulating different room acoustics), background noise at various SNR levels, and pitch/speed variations. This expanded our effective training set to ~50,000 samples.
4.Model training: The model was trained using openWakeWord's training pipeline, which handles architecture selection, hyperparameter optimization, and validation against held-out test sets.
5.On-device validation: The trained model was tested on the actual RK3566 hardware with real microphones in realistic environments (kitchen, living room, bedroom with fan noise).

The entire training pipeline runs on a standard development machine with a GPU — no specialized hardware needed. Training a new wake word model takes approximately 4-6 hours.

What are the hardest real-world challenges?

TV and smart speaker interference

The most common false acceptance trigger is not random noise — it is human speech from television. TV dialogue contains a much wider range of phonemes and speech patterns than environmental noise, and occasionally a character will say something that sounds vaguely like "Hey Jinn."

We mitigate this with the AEC pipeline (which removes audio playing from the HoloBox's own speaker) and by training the model on negative samples that include TV and podcast audio. For external audio sources (a TV across the room), the model relies on the spectral differences between live speech directed at the device and broadcast audio arriving from a distance.

Cocktail party problem

When multiple people are talking simultaneously, the wake word engine must detect "Hey Jinn" spoken by one person through the voices of others. This is fundamentally hard with a dual-mic setup — beamforming with two microphones provides only limited spatial filtering.

Our current approach: we tune the model to be slightly more sensitive (lower threshold) in detected multi-speaker environments, accepting a marginally higher false acceptance rate in exchange for fewer missed detections. A three-microphone array in a future hardware revision would significantly improve this.

Accents and speech patterns

"Hey Jinn" is phonetically simple, but speakers vary enormously. Some pronounce "Jinn" with a hard J, others soften it. Some pause between "Hey" and "Jinn," others run them together. Children's voices have fundamentally different spectral characteristics than adult voices.

Our synthetic training data covers many of these variations, but we continue to collect anonymized (opt-in) detection metrics from beta testers to identify weak spots. The model has been retrained twice since initial deployment based on this feedback.

How do we measure production performance?

We track three metrics in production (with user consent):

Detection rate: Percentage of intended activations that are detected (target: >95%)
False activations per day: Tracked via a lightweight counter that increments each time the wake word pipeline activates without subsequent speech input (target: <8 per 16-hour day)
Detection latency: Time from end of wake phrase audio to system acknowledgment (target: <200 ms, measured: 80-120 ms typically)

These metrics are aggregated and anonymized — we never record or transmit raw audio. The detection latency of 80-120 ms is well within our 200 ms target, leaving comfortable margin for the audio pipeline to hand off to the speech-to-text service.

Key takeaways

1.openWakeWord delivers <5% false reject rate and <0.5/hour false acceptance rate with proper threshold tuning — competitive with commercial solutions at zero licensing cost.
2.On the RK3566 Cortex-A55, wake word inference takes 2-3 ms per 80 ms audio frame, using only 3-5% of a single CPU core — lightweight enough to run 24/7 without impact on other workloads.
3.Acoustic echo cancellation is not optional for a device with a built-in speaker — without it, the wake word engine triggers on the device's own TTS output.
4.Threshold tuning against a representative test corpus (500+ positive samples, 200+ hours of negative audio) is the single highest-impact step for production wake word quality.
5.Synthetic speech training data works well for custom wake words, but real-world edge cases (TV interference, accents, multi-speaker environments) require ongoing model refinement based on production telemetry.
wake word detectionon-device wake wordvoice activationopenWakeWord

Want an AI agent on your counter?

Jinn HoloBox is available for pre-order at $299 ($150 off retail).

Pre-Order Now