← All posts
Engineering·9 min read·

Running AI on ARM: Performance Lessons from the RK3566

What we learned about running wake word detection, LLM inference, and WebGL rendering on a quad-core Cortex-A55 with 0.8 TOPS NPU.

The RK3566's quad-core Cortex-A55 at 1.8 GHz with a 0.8 TOPS NPU can run on-device wake word detection at under 10% CPU load and process audio in real time — but it cannot run LLM inference locally. The key to AI on low-power ARM is being ruthless about what runs on-device versus in the cloud, and optimizing the boundary between them for perceived latency.

What AI workloads does the HoloBox actually run on-device?

Not everything labeled "AI" needs a GPU cluster. The HoloBox runs three AI-adjacent workloads locally:

1.Wake word detection (openWakeWord): Listens continuously for "Hey Jinn" using a ~1.3 MB ONNX model. This is the most latency-sensitive task — it must respond within 100 ms of the user finishing the wake phrase.
2.Audio preprocessing: Voice activity detection (VAD), acoustic echo cancellation (AEC), and audio feature extraction. These are lightweight DSP operations but must run in real time on every audio frame.
3.WebGL avatar rendering: A Three.js-based VRM avatar that responds to conversation state. This is GPU-bound, running on the Mali-G52.

Everything else — LLM inference, speech-to-text, text-to-speech — runs in the cloud via the user's own API keys. This is a deliberate architectural choice, not a compromise.

How does the Cortex-A55 perform for audio AI?

The Cortex-A55 is ARM's efficiency core, designed for low power rather than peak throughput. In Geekbench 5, the RK3566 scores approximately 155 single-core and 452 multi-core — roughly equivalent to a 2014-era smartphone. That sounds dire, but audio AI workloads have different requirements than the benchmarks measure.

Wake word detection processes 16 kHz mono audio in 80 ms frames. Each frame requires:

Mel spectrogram computation (~0.5 ms on one A55 core)
Neural network inference through the wake word model (~2-3 ms)
Post-processing and threshold comparison (~0.1 ms)

Total: approximately 3-4 ms per 80 ms frame, or about 4-5% of a single core's capacity. The openWakeWord project documentation confirms that a single Raspberry Pi 3 core (also Cortex-A53, the A55's predecessor) can run 15-20 models simultaneously in real time.

CPU budget breakdown

We profile CPU usage continuously during development. Here is a representative snapshot during an active conversation:

ProcessCPU usage (% of 4 cores)
Chromium (avatar rendering)25-40%
Node.js runtime worker8-15%
openWakeWord3-5%
Go gateway1-3%
Audio pipeline (ALSA + AEC)2-4%
System (kernel, systemd, Xorg)5-8%
**Total****44-75%**

This leaves 25-56% headroom depending on the conversation phase. Idle (wake word listening only) drops total CPU usage to roughly 15-20%.

What about the 0.8 TOPS NPU?

The RK3566 includes Rockchip's RKNN NPU rated at 0.8 TOPS (tera operations per second). For context, that is approximately:

DeviceNPU / AI acceleratorTOPS
RK3566RKNN0.8
RK3588RKNN6.0
Google CoralEdge TPU4.0
Apple A17 ProNeural Engine35.0
Nvidia Jetson Orin NanoCUDA + DLA40.0

At 0.8 TOPS, the NPU is useful for lightweight classification and detection models — think object recognition on camera frames or keyword spotting — but not for running transformer-based language models. A 7-billion parameter LLM quantized to INT4 requires roughly 10-15 TOPS for acceptable token generation speed (5+ tokens/second). The RK3566 NPU is an order of magnitude short.

We currently do not use the NPU in production. Our wake word model runs on the CPU via ONNX Runtime because the CPU path is fast enough (3-4 ms per frame) and avoids the complexity of the RKNN SDK. We are evaluating NPU-accelerated VAD models for a future release, where offloading audio classification to the NPU could free 3-5% of CPU headroom.

Why not run a small language model locally?

We tested this. We ran TinyLlama 1.1B (INT4 quantized) on the RK3566 CPU using llama.cpp:

Model load time: 8.2 seconds
Prompt processing: 2.1 tokens/second (for a 100-token prompt)
Token generation: 1.4 tokens/second
RAM usage: 890 MB

At 1.4 tokens/second, generating a 50-word response takes roughly 25 seconds. Compare that to GPT-4o via API at 50-80 tokens/second — the cloud path delivers a complete response before the local model has generated the first sentence.

More critically, the 890 MB RAM usage would consume most of our working headroom, leaving the system unstable. We concluded that local LLM inference on the RK3566 is technically possible but practically unusable for conversational AI.

The hybrid architecture

Instead of running everything locally or everything in the cloud, we split workloads by latency sensitivity:

WorkloadWhere it runsWhy
Wake word detectionOn-device (CPU)Must be always-on, <100 ms response
Voice activity detectionOn-device (CPU)Real-time audio processing
Echo cancellationOn-device (CPU)Hardware-coupled, latency-critical
Speech-to-textCloud (API)Requires large acoustic models
LLM reasoningCloud (API)Requires 7B+ parameter models
Text-to-speechCloud (API)Neural TTS models are too large for local
Avatar renderingOn-device (GPU)Visual feedback must be immediate
Smart home commandsOn-device (gateway)Local network, low latency

This split means the HoloBox is always responsive to voice (local wake word + VAD) even when the internet is slow, while leveraging cloud compute for the heavy AI workloads.

How does the Mali-G52 GPU handle avatar rendering?

The Mali-G52 2EE in the RK3566 supports OpenGL ES 3.2, Vulkan 1.1, and has a theoretical fill rate of 6.8 Gpix/s. In practice, rendering our VRM avatar is the most demanding GPU task.

Our avatar pipeline:

1.Load a VRM model (optimized to ~1000 triangles, single mesh, single material)
2.Replace standard MToon materials with MeshBasicMaterial (unlit) to skip lighting calculations
3.Render at 0.75x device pixel ratio (effectively 540x960) and upscale
4.Cap frame rate at 30 fps via requestAnimationFrame throttling

With these optimizations, we achieve 28-35 fps on the HoloBox. Without them — using the stock VRM materials with lighting, at full resolution, at 60 fps — we measured 4-8 fps. The 98-mesh, 55,000-triangle original avatar model rendered at 1.6 fps before we switched to an optimized 2,130-triangle model.

GPU optimization lessons

Draw calls matter more than triangle count on mobile GPUs. Going from 98 draw calls (one per mesh) to 1 draw call improved fps by 10x, even though triangle count only dropped 25x.
Unlit materials are essential. MeshBasicMaterial skips the fragment shader lighting calculations that dominate GPU time on the Mali-G52.
Resolution scaling is free performance. Rendering at 0.75x DPR saves 44% of fragment shader work. On a 5-inch screen at arm's length, the quality difference is imperceptible.
Texture size matters. We cap all textures at 512x512 pixels. The Mali-G52 has limited texture cache, and larger textures cause cache thrashing that tanks performance.

What are the thermal implications of sustained AI workloads?

Running wake word detection, audio processing, and avatar rendering simultaneously puts the RK3566 under sustained ~50-60% CPU + moderate GPU load. In our thermal testing:

Idle (wake word only): SoC junction ~45 degrees C, power draw ~1.8W
Active conversation (full stack): SoC junction ~65 degrees C, power draw ~3.5W
Stress test (max CPU + GPU): SoC junction ~72 degrees C, power draw ~4.8W

The Cortex-A55 throttles at 85 degrees C, so even under stress we maintain a 13-degree margin in a 25 degrees C ambient environment. The key insight: the A55's efficiency means sustained AI workloads are thermally feasible without a fan, which would not be true with higher-performance A76 cores drawing 2-3x the power.

What would more compute buy us?

If we had the RK3588's 6 TOPS NPU and A76 cores, we could realistically run:

On-device speech-to-text (Whisper tiny/base) at near real-time speed
Small language models (1-3B parameters) for offline fallback responses
More complex avatar rendering with real-time lip sync and physics

These are features we want for a future high-end variant. But for the $299 HoloBox, the RK3566 handles the workloads that matter most — always-on listening and responsive visual feedback — while the cloud handles the heavy lifting.

Key takeaways

1.The RK3566's Cortex-A55 cores handle wake word detection at 3-5% CPU load per model — audio AI is computationally light compared to language model inference.
2.The 0.8 TOPS NPU is an order of magnitude too small for LLM inference (which needs 10-15+ TOPS for usable speed), but adequate for lightweight classification tasks.
3.Local LLM inference on the RK3566 produces only 1.4 tokens/second with TinyLlama 1.1B — unusable for conversation but a useful benchmark for future hardware planning.
4.GPU draw calls, not triangle count, are the primary performance bottleneck on mobile GPUs. Reducing from 98 to 1 draw call improved avatar rendering by 10x.
5.The hybrid architecture — latency-sensitive tasks on-device, compute-intensive tasks in the cloud — is the pragmatic approach for sub-$300 AI hardware.
6.Thermal headroom of 13+ degrees C under stress means sustained AI workloads are feasible in a fanless enclosure with the A55's efficiency cores.
ARM AI performanceRK3566 AIedge AI hardwareon-device inference

Want an AI agent on your counter?

Jinn HoloBox is available for pre-order at $299 ($150 off retail).

Pre-Order Now