As edge-AI applications spread, getting models to run fast and efficiently on-device has become a top concern for developers. RK1820, an SoC that combines a RISC-V coprocessor with a high-performance AI accelerator, gives a complete, easy-to-use end-to-end AI flow. This article walks through the RK1820 framework—from PC-side model creation to board-side deployment—and shows how Neardi makes the AI coprocessor truly production-ready.
In embedded systems and AI-acceleration scenarios, the host CPU handles general control and resource management, while the coprocessor performs compute-intensive, specialized or real-time tasks. Efficient cooperation through shared memory, FIFO or RPC markedly boosts system performance, cuts power, and preserves design flexibility—why modern SoCs increasingly adopt coprocessor architectures. RK1820 is an AI coprocessor; paired with RK3588/RK3576 and linked by PCIe, it is used flexibly, quickly and efficiently.
The CPU’s job in an embedded system is to make sure the entire chain—sensing, computing, communicating, controlling—happens in the right order, at the right time, and with the lowest possible energy. Specialized engines (NPU, DSP, FPGA coprocessors) only “calculate fast”; when to run, what to run, and where to deliver the result are all directed by the CPU.
The CPU is the embedded system’s “all-round butler”: it fetches instructions, translates them into real work, and tells the ALU to execute, while also acting like a finance director—deciding how memory, cache, and clocks are fairly shared. Externally, GPIO, UART, I²C, SPI, USB, and Ethernet are its “mouth” and “hands”; sensors, displays, and network modules must ask it before talking. When tasks pile up it becomes the project manager, slicing time in the RTOS so motors or brakes get sub-second response. At power-on it self-checks, pulls in the bootloader to verify identity, and only then lets the system run. Seeing idle time, it drops frequency, voltage, or simply sleeps to extend battery life. Finally, it is the “conductor”—DSPs, NPUs, and other external accelerators start when, move what data, and place results where, all on its single command; the entire pipeline of compute, storage, peripherals, and accelerators is coordinated crystal-clear in its hands.
Concept and Working Principle of a Coprocessor
A coprocessor is the CPU’s “plug-in skill pack”: it cannot fetch instructions or run an OS on its own. It accepts micro-commands or data blocks from the host CPU, executes specific operators (MAC, convolution, floating-point, FFT, AES, CRC, trig, etc.) on dedicated hardware at high performance and low power, then drops the result back into shared memory and interrupts the CPU.
1. The compiler “tags” heavy functions (matrix ops, convolutions) as “off-load to coprocessor”.
2. When the CPU reaches these tags it skips the calculation, instead pushing parameters into the coprocessor’s “inbox”—a few registers, a FIFO, or a DMA buffer—and hits “START”.
3. The coprocessor runs at full speed in its own clock domain while the CPU continues with other work; both sides work in parallel.
4. On completion the coprocessor says “done” by raising a status bit, issuing an interrupt, or writing results to shared memory. The CPU picks up the result and continues execution.
5. If an error occurs (overflow, illegal instruction) the coprocessor records an error code in its status register and signals an exception; the CPU reads the exception vector and handles it.
The whole flow is like a boss handing a blueprint to a dedicated machine: the machine roars through the job while the boss keeps taking calls; when finished it rings the bell and delivers the goods—if something breaks it flashes a red light, and the process stays smooth and worry-free.
How does the host CPU communicate with the coprocessor?
To enable efficient cooperation, the two sides are usually linked by a high-speed interconnect; the most common choice is PCIe.
PCIe is a point-to-point serial bus standard offering high bandwidth, low latency and good scalability, so it has become the first-choice interface between CPUs, GPUs, NPUs and other high-performance devices.
High data bandwidth: each lane (x1) delivers up to 1 GB/s (PCIe 3.0) or even 4 GB/s (PCIe 4.0).
Low latency: the point-to-point architecture cuts arbitration and wait time, making it suitable for real-time AI data exchange.
Bidirectional transfer: full-duplex lanes transmit and receive simultaneously, keeping data in sync.
Hot-plug capable: some designs allow module-level hot-plug for easier maintenance and expansion.