Running AI on ARM: Performance Lessons from the RK3566
What we learned about running wake word detection, LLM inference, and WebGL rendering on a quad-core Cortex-A55 with 0.8 TOPS NPU.
The RK3566's quad-core Cortex-A55 at 1.8 GHz with a 0.8 TOPS NPU can run on-device wake word detection at under 10% CPU load and process audio in real time — but it cannot run LLM inference locally. The key to AI on low-power ARM is being ruthless about what runs on-device versus in the cloud, and optimizing the boundary between them for perceived latency.
What AI workloads does the HoloBox actually run on-device?
Not everything labeled "AI" needs a GPU cluster. The HoloBox runs three AI-adjacent workloads locally:
Everything else — LLM inference, speech-to-text, text-to-speech — runs in the cloud via the user's own API keys. This is a deliberate architectural choice, not a compromise.
How does the Cortex-A55 perform for audio AI?
The Cortex-A55 is ARM's efficiency core, designed for low power rather than peak throughput. In Geekbench 5, the RK3566 scores approximately 155 single-core and 452 multi-core — roughly equivalent to a 2014-era smartphone. That sounds dire, but audio AI workloads have different requirements than the benchmarks measure.
Wake word detection processes 16 kHz mono audio in 80 ms frames. Each frame requires:
Total: approximately 3-4 ms per 80 ms frame, or about 4-5% of a single core's capacity. The openWakeWord project documentation confirms that a single Raspberry Pi 3 core (also Cortex-A53, the A55's predecessor) can run 15-20 models simultaneously in real time.
CPU budget breakdown
We profile CPU usage continuously during development. Here is a representative snapshot during an active conversation:
| Process | CPU usage (% of 4 cores) |
|---|---|
| Chromium (avatar rendering) | 25-40% |
| Node.js runtime worker | 8-15% |
| openWakeWord | 3-5% |
| Go gateway | 1-3% |
| Audio pipeline (ALSA + AEC) | 2-4% |
| System (kernel, systemd, Xorg) | 5-8% |
| **Total** | **44-75%** |
This leaves 25-56% headroom depending on the conversation phase. Idle (wake word listening only) drops total CPU usage to roughly 15-20%.
What about the 0.8 TOPS NPU?
The RK3566 includes Rockchip's RKNN NPU rated at 0.8 TOPS (tera operations per second). For context, that is approximately:
| Device | NPU / AI accelerator | TOPS |
|---|---|---|
| RK3566 | RKNN | 0.8 |
| RK3588 | RKNN | 6.0 |
| Google Coral | Edge TPU | 4.0 |
| Apple A17 Pro | Neural Engine | 35.0 |
| Nvidia Jetson Orin Nano | CUDA + DLA | 40.0 |
At 0.8 TOPS, the NPU is useful for lightweight classification and detection models — think object recognition on camera frames or keyword spotting — but not for running transformer-based language models. A 7-billion parameter LLM quantized to INT4 requires roughly 10-15 TOPS for acceptable token generation speed (5+ tokens/second). The RK3566 NPU is an order of magnitude short.
We currently do not use the NPU in production. Our wake word model runs on the CPU via ONNX Runtime because the CPU path is fast enough (3-4 ms per frame) and avoids the complexity of the RKNN SDK. We are evaluating NPU-accelerated VAD models for a future release, where offloading audio classification to the NPU could free 3-5% of CPU headroom.
Why not run a small language model locally?
We tested this. We ran TinyLlama 1.1B (INT4 quantized) on the RK3566 CPU using llama.cpp:
At 1.4 tokens/second, generating a 50-word response takes roughly 25 seconds. Compare that to GPT-4o via API at 50-80 tokens/second — the cloud path delivers a complete response before the local model has generated the first sentence.
More critically, the 890 MB RAM usage would consume most of our working headroom, leaving the system unstable. We concluded that local LLM inference on the RK3566 is technically possible but practically unusable for conversational AI.
The hybrid architecture
Instead of running everything locally or everything in the cloud, we split workloads by latency sensitivity:
| Workload | Where it runs | Why |
|---|---|---|
| Wake word detection | On-device (CPU) | Must be always-on, <100 ms response |
| Voice activity detection | On-device (CPU) | Real-time audio processing |
| Echo cancellation | On-device (CPU) | Hardware-coupled, latency-critical |
| Speech-to-text | Cloud (API) | Requires large acoustic models |
| LLM reasoning | Cloud (API) | Requires 7B+ parameter models |
| Text-to-speech | Cloud (API) | Neural TTS models are too large for local |
| Avatar rendering | On-device (GPU) | Visual feedback must be immediate |
| Smart home commands | On-device (gateway) | Local network, low latency |
This split means the HoloBox is always responsive to voice (local wake word + VAD) even when the internet is slow, while leveraging cloud compute for the heavy AI workloads.
How does the Mali-G52 GPU handle avatar rendering?
The Mali-G52 2EE in the RK3566 supports OpenGL ES 3.2, Vulkan 1.1, and has a theoretical fill rate of 6.8 Gpix/s. In practice, rendering our VRM avatar is the most demanding GPU task.
Our avatar pipeline:
With these optimizations, we achieve 28-35 fps on the HoloBox. Without them — using the stock VRM materials with lighting, at full resolution, at 60 fps — we measured 4-8 fps. The 98-mesh, 55,000-triangle original avatar model rendered at 1.6 fps before we switched to an optimized 2,130-triangle model.
GPU optimization lessons
What are the thermal implications of sustained AI workloads?
Running wake word detection, audio processing, and avatar rendering simultaneously puts the RK3566 under sustained ~50-60% CPU + moderate GPU load. In our thermal testing:
The Cortex-A55 throttles at 85 degrees C, so even under stress we maintain a 13-degree margin in a 25 degrees C ambient environment. The key insight: the A55's efficiency means sustained AI workloads are thermally feasible without a fan, which would not be true with higher-performance A76 cores drawing 2-3x the power.
What would more compute buy us?
If we had the RK3588's 6 TOPS NPU and A76 cores, we could realistically run:
These are features we want for a future high-end variant. But for the $299 HoloBox, the RK3566 handles the workloads that matter most — always-on listening and responsive visual feedback — while the cloud handles the heavy lifting.
Key takeaways
Want an AI agent on your counter?
Jinn HoloBox is available for pre-order at $299 ($150 off retail).
Pre-Order Now