Engineering·9 min read·April 25, 2026

Running AI on ARM: Performance Lessons from the RK3566

What we learned about running wake word detection, LLM inference, and WebGL rendering on a quad-core Cortex-A55 with 0.8 TOPS NPU.

The RK3566's quad-core Cortex-A55 at 1.8 GHz with a 0.8 TOPS NPU can run on-device wake word detection at under 10% CPU load and process audio in real time — but it cannot run LLM inference locally. The key to AI on low-power ARM is being ruthless about what runs on-device versus in the cloud, and optimizing the boundary between them for perceived latency.

What AI workloads does the HoloBox actually run on-device?

Not everything labeled "AI" needs a GPU cluster. The HoloBox runs three AI-adjacent workloads locally:

1.Wake word detection (openWakeWord): Listens continuously for "Hey Jinn" using a ~1.3 MB ONNX model. This is the most latency-sensitive task — it must respond within 100 ms of the user finishing the wake phrase.

2.Audio preprocessing: Voice activity detection (VAD), acoustic echo cancellation (AEC), and audio feature extraction. These are lightweight DSP operations but must run in real time on every audio frame.

3.WebGL avatar rendering: A Three.js-based VRM avatar that responds to conversation state. This is GPU-bound, running on the Mali-G52.

Everything else — LLM inference, speech-to-text, text-to-speech — runs in the cloud via the user's own API keys. This is a deliberate architectural choice, not a compromise.

How does the Cortex-A55 perform for audio AI?

The Cortex-A55 is ARM's efficiency core, designed for low power rather than peak throughput. In Geekbench 5, the RK3566 scores approximately 155 single-core and 452 multi-core — roughly equivalent to a 2014-era smartphone. That sounds dire, but audio AI workloads have different requirements than the benchmarks measure.

Wake word detection processes 16 kHz mono audio in 80 ms frames. Each frame requires:

—Mel spectrogram computation (~0.5 ms on one A55 core)

—Neural network inference through the wake word model (~2-3 ms)

—Post-processing and threshold comparison (~0.1 ms)

Total: approximately 3-4 ms per 80 ms frame, or about 4-5% of a single core's capacity. The openWakeWord project documentation confirms that a single Raspberry Pi 3 core (also Cortex-A53, the A55's predecessor) can run 15-20 models simultaneously in real time.

CPU budget breakdown

We profile CPU usage continuously during development. Here is a representative snapshot during an active conversation:

Process	CPU usage (% of 4 cores)
Chromium (avatar rendering)	25-40%
Node.js runtime worker	8-15%
openWakeWord	3-5%
Go gateway	1-3%
Audio pipeline (ALSA + AEC)	2-4%
System (kernel, systemd, Xorg)	5-8%
Total	44-75%

This leaves 25-56% headroom depending on the conversation phase. Idle (wake word listening only) drops total CPU usage to roughly 15-20%.

What about the 0.8 TOPS NPU?

The RK3566 includes Rockchip's RKNN NPU rated at 0.8 TOPS (tera operations per second). For context, that is approximately:

Device	NPU / AI accelerator	TOPS
RK3566	RKNN	0.8
RK3588	RKNN	6.0
Google Coral	Edge TPU	4.0
Apple A17 Pro	Neural Engine	35.0
Nvidia Jetson Orin Nano	CUDA + DLA	40.0

At 0.8 TOPS, the NPU is useful for lightweight classification and detection models — think object recognition on camera frames or keyword spotting — but not for running transformer-based language models. A 7-billion parameter LLM quantized to INT4 requires roughly 10-15 TOPS for acceptable token generation speed (5+ tokens/second). The RK3566 NPU is an order of magnitude short.

We currently do not use the NPU in production. Our wake word model runs on the CPU via ONNX Runtime because the CPU path is fast enough (3-4 ms per frame) and avoids the complexity of the RKNN SDK. We are evaluating NPU-accelerated VAD models for a future release, where offloading audio classification to the NPU could free 3-5% of CPU headroom.

Why not run a small language model locally?

We tested this. We ran TinyLlama 1.1B (INT4 quantized) on the RK3566 CPU using llama.cpp:

—Model load time: 8.2 seconds

—Prompt processing: 2.1 tokens/second (for a 100-token prompt)

—Token generation: 1.4 tokens/second

—RAM usage: 890 MB

At 1.4 tokens/second, generating a 50-word response takes roughly 25 seconds. Compare that to GPT-4o via API at 50-80 tokens/second — the cloud path delivers a complete response before the local model has generated the first sentence.

More critically, the 890 MB RAM usage would consume most of our working headroom, leaving the system unstable. We concluded that local LLM inference on the RK3566 is technically possible but practically unusable for conversational AI.

The hybrid architecture

Instead of running everything locally or everything in the cloud, we split workloads by latency sensitivity:

Workload	Where it runs	Why
Wake word detection	On-device (CPU)	Must be always-on, <100 ms response
Voice activity detection	On-device (CPU)	Real-time audio processing
Echo cancellation	On-device (CPU)	Hardware-coupled, latency-critical
Speech-to-text	Cloud (API)	Requires large acoustic models
LLM reasoning	Cloud (API)	Requires 7B+ parameter models
Text-to-speech	Cloud (API)	Neural TTS models are too large for local
Avatar rendering	On-device (GPU)	Visual feedback must be immediate
Smart home commands	On-device (gateway)	Local network, low latency

This split means the HoloBox is always responsive to voice (local wake word + VAD) even when the internet is slow, while leveraging cloud compute for the heavy AI workloads.

How does the Mali-G52 GPU handle avatar rendering?

The Mali-G52 2EE in the RK3566 supports OpenGL ES 3.2, Vulkan 1.1, and has a theoretical fill rate of 6.8 Gpix/s. In practice, rendering our VRM avatar is the most demanding GPU task.

Our avatar pipeline:

1.Load a VRM model (optimized to ~1000 triangles, single mesh, single material)

2.Replace standard MToon materials with MeshBasicMaterial (unlit) to skip lighting calculations

3.Render at 0.75x device pixel ratio (effectively 540x960) and upscale

4.Cap frame rate at 30 fps via requestAnimationFrame throttling

With these optimizations, we achieve 28-35 fps on the HoloBox. Without them — using the stock VRM materials with lighting, at full resolution, at 60 fps — we measured 4-8 fps. The 98-mesh, 55,000-triangle original avatar model rendered at 1.6 fps before we switched to an optimized 2,130-triangle model.

GPU optimization lessons

—Draw calls matter more than triangle count on mobile GPUs. Going from 98 draw calls (one per mesh) to 1 draw call improved fps by 10x, even though triangle count only dropped 25x.

—Unlit materials are essential. MeshBasicMaterial skips the fragment shader lighting calculations that dominate GPU time on the Mali-G52.

—Resolution scaling is free performance. Rendering at 0.75x DPR saves 44% of fragment shader work. On a 5-inch screen at arm's length, the quality difference is imperceptible.

—Texture size matters. We cap all textures at 512x512 pixels. The Mali-G52 has limited texture cache, and larger textures cause cache thrashing that tanks performance.

What are the thermal implications of sustained AI workloads?

Running wake word detection, audio processing, and avatar rendering simultaneously puts the RK3566 under sustained ~50-60% CPU + moderate GPU load. In our thermal testing:

—Idle (wake word only): SoC junction ~45 degrees C, power draw ~1.8W

—Active conversation (full stack): SoC junction ~65 degrees C, power draw ~3.5W

—Stress test (max CPU + GPU): SoC junction ~72 degrees C, power draw ~4.8W

The Cortex-A55 throttles at 85 degrees C, so even under stress we maintain a 13-degree margin in a 25 degrees C ambient environment. The key insight: the A55's efficiency means sustained AI workloads are thermally feasible without a fan, which would not be true with higher-performance A76 cores drawing 2-3x the power.

What would more compute buy us?

If we had the RK3588's 6 TOPS NPU and A76 cores, we could realistically run:

—On-device speech-to-text (Whisper tiny/base) at near real-time speed

—Small language models (1-3B parameters) for offline fallback responses

—More complex avatar rendering with real-time lip sync and physics

These are features we want for a future high-end variant. But for the $299 HoloBox, the RK3566 handles the workloads that matter most — always-on listening and responsive visual feedback — while the cloud handles the heavy lifting.

Key takeaways

1.The RK3566's Cortex-A55 cores handle wake word detection at 3-5% CPU load per model — audio AI is computationally light compared to language model inference.

2.The 0.8 TOPS NPU is an order of magnitude too small for LLM inference (which needs 10-15+ TOPS for usable speed), but adequate for lightweight classification tasks.

3.Local LLM inference on the RK3566 produces only 1.4 tokens/second with TinyLlama 1.1B — unusable for conversation but a useful benchmark for future hardware planning.

4.GPU draw calls, not triangle count, are the primary performance bottleneck on mobile GPUs. Reducing from 98 to 1 draw call improved avatar rendering by 10x.

5.The hybrid architecture — latency-sensitive tasks on-device, compute-intensive tasks in the cloud — is the pragmatic approach for sub-$300 AI hardware.

6.Thermal headroom of 13+ degrees C under stress means sustained AI workloads are feasible in a fanless enclosure with the A55's efficiency cores.

ARM AI performanceRK3566 AIedge AI hardwareon-device inference

Want an AI agent on your counter?

Jinn HoloBox is available for pre-order at $299 ($150 off retail).

Pre-Order Now