How We Designed the Jinn Wake Word System
Inside the Jinn HoloBox wake word pipeline: why we chose openWakeWord, how we tuned for <5% false reject rates, and what it takes to listen 24/7 on ARM.
The Jinn HoloBox uses openWakeWord, an open-source framework, to listen for "Hey Jinn" entirely on-device — no audio leaves the hardware until the wake phrase is detected. We chose it over commercial alternatives because it is Apache 2.0 licensed, runs at under 5% CPU on our Cortex-A55, and delivers a false acceptance rate below 0.5 per hour with threshold tuning — meeting our targets for a device that listens 24/7.
Why does wake word detection matter so much?
Wake word detection is the first interaction a user has with a voice assistant. If it fails — either by not recognizing a legitimate command (false reject) or by activating when no one said the wake phrase (false accept) — the entire product feels broken.
Consider the math of always-on listening. A device that is active 16 hours per day processes roughly 57,600 seconds of audio daily. At a false acceptance rate of even 1 per hour, that is 16 spurious activations per day — enough to make users unplug the device. Conversely, a false reject rate above 10% means the user has to repeat themselves every tenth command, which destroys the conversational flow.
Our targets:
How did we evaluate wake word engines?
We tested three engines head-to-head on the RK3566 hardware:
| Feature | openWakeWord | Picovoice Porcupine | Snowboy |
|---|---|---|---|
| License | Apache 2.0 (fully open) | Free for ≤3 users; commercial from $6,000 | Apache 2.0 (unmaintained) |
| Last updated | 2025 (active) | 2026 (active) | 2020 (archived) |
| Custom wake word | Yes (synthetic data training) | Yes (console + fine-tuning) | Yes (user recordings) |
| Model size | ~1.3 MB (ONNX) | ~2 MB (proprietary) | ~3 MB |
| CPU usage (RK3566) | 3-5% single core | 2-3% single core | 8-12% single core |
| False reject rate (our testing) | 4.2% at tuned threshold | 3.1% at default | 11.8% at default |
| False accept rate (our testing) | 0.3/hour at tuned threshold | 0.2/hour at default | 1.4/hour at default |
| Platform support | Python, C (via ONNX) | Python, C, Java, JS, Go, Swift | Python, C++ |
| Noise robustness | Good (trained on diverse audio) | Excellent (proprietary noise augmentation) | Fair |
Why not Porcupine?
Porcupine had the best raw accuracy in our testing. Its false reject rate of 3.1% and false acceptance rate of 0.2/hour were slightly better than openWakeWord. So why did we not use it?
Licensing. Picovoice's free tier is limited to projects with no more than three active users. For a consumer product shipping thousands of units, commercial licensing starts at $6,000 and scales per device. For an open-source hardware project targeting a $299 price point, a per-device licensing fee on the wake word engine directly conflicts with our goal of keeping the software stack free.
Additionally, Porcupine's models are proprietary binary blobs. We cannot inspect, modify, or audit them — a problem for a device that is always listening in people's homes. With openWakeWord, every layer of the model is inspectable.
Why not Snowboy?
Snowboy was a popular open-source option, but it has been unmaintained since Kitt.ai was acquired by Baidu in 2020. In our testing, its false reject rate of 11.8% was more than double our target. According to Picovoice's open-source wake word benchmark, Porcupine achieves 11x better accuracy and 6.5x faster inference than Snowboy on equivalent hardware. Snowboy is no longer a viable choice for production use.
How does the openWakeWord pipeline work?
The detection pipeline processes audio in four stages:
1. Audio capture and preprocessing
Audio comes from our dual-MEMS PDM microphone array through ALSA at 16 kHz, 16-bit mono. Before reaching the wake word model, the signal passes through:
2. Feature extraction
openWakeWord converts raw audio into mel spectrograms — a frequency-domain representation that mirrors human auditory perception. The mel spectrogram computation uses an ONNX implementation of PyTorch's melspectrogram function with fixed parameters (80 mel bands, 25 ms window, 10 ms hop).
Each inference processes an 80 ms audio frame, producing a feature vector that captures the spectral characteristics of that moment in time.
3. Neural network inference
The core model is a small neural network (~1.3 MB) that takes a sliding window of mel spectrogram frames and outputs a probability that the wake phrase was spoken. The model architecture uses a combination of convolutional and recurrent layers optimized for streaming audio — it processes frames sequentially without needing to buffer the entire utterance.
On the RK3566, inference takes approximately 2-3 ms per frame using ONNX Runtime on the CPU. We evaluated running on the 0.8 TOPS NPU via Rockchip's RKNN SDK, but the CPU path was already fast enough that the added complexity of NPU integration was not justified.
4. Threshold and smoothing
The raw model output is a probability between 0 and 1. We apply a tuned threshold (calibrated per model) and temporal smoothing to convert this into a binary detection decision.
Threshold tuning is the most important step for production quality. openWakeWord's documentation targets <5% false reject rates and <0.5/hour false accept rates with appropriate threshold tuning. We spent two weeks tuning thresholds using a test corpus of:
The final threshold was set at 0.72 for our "Hey Jinn" model — high enough to reject most environmental noise, low enough to catch natural variations in how people say the phrase.
How did we train a custom "Hey Jinn" model?
openWakeWord supports training custom wake word models using synthetic speech data. The process:
The entire training pipeline runs on a standard development machine with a GPU — no specialized hardware needed. Training a new wake word model takes approximately 4-6 hours.
What are the hardest real-world challenges?
TV and smart speaker interference
The most common false acceptance trigger is not random noise — it is human speech from television. TV dialogue contains a much wider range of phonemes and speech patterns than environmental noise, and occasionally a character will say something that sounds vaguely like "Hey Jinn."
We mitigate this with the AEC pipeline (which removes audio playing from the HoloBox's own speaker) and by training the model on negative samples that include TV and podcast audio. For external audio sources (a TV across the room), the model relies on the spectral differences between live speech directed at the device and broadcast audio arriving from a distance.
Cocktail party problem
When multiple people are talking simultaneously, the wake word engine must detect "Hey Jinn" spoken by one person through the voices of others. This is fundamentally hard with a dual-mic setup — beamforming with two microphones provides only limited spatial filtering.
Our current approach: we tune the model to be slightly more sensitive (lower threshold) in detected multi-speaker environments, accepting a marginally higher false acceptance rate in exchange for fewer missed detections. A three-microphone array in a future hardware revision would significantly improve this.
Accents and speech patterns
"Hey Jinn" is phonetically simple, but speakers vary enormously. Some pronounce "Jinn" with a hard J, others soften it. Some pause between "Hey" and "Jinn," others run them together. Children's voices have fundamentally different spectral characteristics than adult voices.
Our synthetic training data covers many of these variations, but we continue to collect anonymized (opt-in) detection metrics from beta testers to identify weak spots. The model has been retrained twice since initial deployment based on this feedback.
How do we measure production performance?
We track three metrics in production (with user consent):
These metrics are aggregated and anonymized — we never record or transmit raw audio. The detection latency of 80-120 ms is well within our 200 ms target, leaving comfortable margin for the audio pipeline to hand off to the speech-to-text service.
Key takeaways
Want an AI agent on your counter?
Jinn HoloBox is available for pre-order at $299 ($150 off retail).
Pre-Order Now