PUBLIC ARCHITECTURE OVERVIEW

NeuraEdge NPU.
Architecture at a glance.

Configurable, energy-efficient neural network inference accelerator for edge computing. Tiled systolic array with adaptive sparsity, packet-switched on-chip interconnect, and fine-grained power management. Delivered as portable, process-agnostic RTL.

Weight-stationary dataflow2D mesh NoCINT8 compute · INT32 accum5 power domains · 8 statesIEEE 1149.1 JTAG

BLOCK DIAGRAM

Architecture at a glance.

+----------------------------------------------------------------------------------------+
|                              NeuraEdge NPU                                             |
|                                                                                        |
|  +-------------------+    +-------------------+    +-------------------+               |
|  |    Compute Tile   |    |    Compute Tile   |    |    Compute Tile   |    ...        |
|  |  +-------------+  |    |  +-------------+  |    |  +-------------+  |               |
|  |  | PE Array    |  |    |  | PE Array    |  |    |  | PE Array    |  |               |
|  |  | (Systolic)  |  |    |  | (Systolic)  |  |    |  | (Systolic)  |  |               |
|  |  +------+------+  |    |  +------+------+  |    |  +------+------+  |               |
|  |         | NoC     |    |         | NoC     |    |         | NoC     |               |
|  |  +------v------+  |    |  +------v------+  |    |  +------v------+  |               |
|  |  | NoC Router  |  |    |  | NoC Router  |  |    |  | NoC Router  |  |               |
|  |  +-------------+  |    |  +-------------+  |    |  +-------------+  |               |
|  +--------+----------+    +--------+----------+    +--------+----------+               |
|           |                        |                        |                          |
|  +--------+------------------------------------------------+-------------------------+ |
|  |                    2D Mesh Network-on-Chip               |                         | |
|  |              (Packet-switched, credit-based flow)        |                         | |
|  +---------------------------------------------------------+-------------------------+ |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           Control & Configuration                                 |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | Host Interface   |--->| Layer Sequencer  |--->| Tile Configuration Fabric |   |  |
|  |  | (APB/AXI-Lite)   |    | (Descriptor Eng) |    | (Broadcast/Multicast)     |   |  |
|  |  +------------------+    +--------+---------+    +---------------------------+   |  |
|  |                            |                                                    |  |
|  |                   +--------v---------+                                          |  |
|  |                   |   DMA Engine     |                                          |  |
|  |                   | (AXI4 Master)    |                                          |  |
|  |                   +------------------+                                          |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           Memory Subsystem                                         |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | Local Scratchpad |    | Buffer Hierarchy |    | External Memory Interface |   |  |
|  |  | (per-PE)         |    | (per-tile)       |    | (AXI4)                    |   |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           Power Management                                         |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | Clock Gating     |    | Power Gating     |    | Sparsity-Driven           |   |  |
|  |  | (per-lane)       |    | (per-tile)       |    | Power Optimization        |   |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           DFT / Test Infrastructure                                |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | JTAG TAP         |    | Scan Chains      |    | Memory BIST               |   |  |
|  |  | (IEEE 1149.1)    |    | (per-domain)     |    | (March algorithm)         |   |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  External Interfaces:                                                                  |
|    AXI4-Lite (CSR) ◄──► Host CPU    |    AXI4 (DMA) ◄──► External Memory              |
|    IRQ ◄──► Host CPU                  |    JTAG / Scan ◄──► ATE                        |
+----------------------------------------------------------------------------------------+

SECTION 3.1

NPU Compute Core

The compute fabric is organized as a configurable 2D mesh of compute tiles. Each tile contains a systolic array of Processing Elements (PEs), with each PE implementing multiple parallel MAC (Multiply-Accumulate) lanes.

Feature	Description
Dataflow	Weight-stationary systolic array — weights loaded once and held locally while activations stream through
Precision	INT8 inputs (weights and activations), INT32 accumulators with overflow protection
Scalability	Tile count and PE array dimensions are parameterizable; single-tile to large multi-tile arrays
Pipeline	Multi-stage MAC pipeline with registered intermediate results for timing closure

Each PE integrates:

—Multiple parallel INT8 MAC lanes

—Local weight buffer (SRAM-based, 512 B per PE)

—INT32 accumulator bank

—Adaptive sparsity detection unit

—Per-lane clock gating logic

SECTION 3.2

Memory Subsystem

The memory architecture follows a distributed, hierarchical model designed to minimize off-chip bandwidth and maximize data locality.

Level	Scope	Purpose
Local Scratchpad	Per-PE	Holds resident weights for the systolic array; eliminates redundant memory fetches
Tile Buffer	Per-tile	Intermediate storage for activations and partial results shared across PEs within a tile
External Interface	Chip-level	AXI4 master interface for access to off-chip system memory (LPDDR, SRAM, etc.)

The DMA engine manages data movement between external memory and on-chip buffers through dedicated channels for weights, activations, and output results.

SECTION 3.3

Network-on-Chip (NoC)

A packet-switched 2D mesh interconnect provides deterministic, deadlock-free communication between compute tiles and the control plane.

Feature	Description
Topology	2D mesh with configurable dimensions
Router Ports	5 ports per router (North, East, South, West, Local)
Routing	Dimension-order (XY) routing — deterministic and deadlock-free
Flow Control	Credit-based with backpressure prevention
Arbitration	Round-robin per output port
Flit Width	Configurable; sized to match system bandwidth requirements

The NoC supports both unicast and multicast traffic patterns, enabling efficient weight broadcasting and result collection across the tile array.

SECTION 3.4

Control & Configuration

Host CPU ──[APB/AXI-Lite]──► CSR Bridge ──► Layer Sequencer ──► Tile Config Fabric ──► Compute Tiles
                                      │
                                      └──► DMA Engine ──[AXI4]──► External Memory

Component	Function
Host Interface	Standard bus interface (APB or AXI-Lite) for register access from the host processor
Layer Sequencer	Descriptor-based execution engine that parses layer configurations, orchestrates DMA transfers, and manages tile synchronization
Tile Configuration Fabric	Broadcast and multicast distribution of enable signals, PE masks, sparsity modes, and precision settings to all tiles
Descriptor Format	Fixed-width layer descriptors encoding operation type, tensor dimensions, memory addresses, and configuration flags

An optional embedded RISC-V core can be integrated to handle control-plane tasks autonomously, offloading the host processor.

SECTION 3.5

External Interfaces

Interface	Protocol	Direction	Purpose
CSR Bus	APB / AXI-Lite	Slave	Host CPU access to control and status registers
DMA Bus	AXI4	Master	High-bandwidth data transfers to/from external memory
Interrupt	Single-line	Output	Layer completion, error, and status notifications to host
JTAG	IEEE 1149.1	Bidirectional	Debug access and boundary scan
Scan	Standard scan	Bidirectional	Production test access via ATE

SECTION 3.6

Power Management

NeuraEdge implements a multi-tier power management architecture. Power state transitions follow a sequenced protocol: save state → isolate outputs → power down → power up → restore state → release isolation.

Technique	Granularity	Description
Per-Lane Clock Gating	Individual MAC lane	Independent clock gate per MAC lane, driven by activity and sparsity signals
Per-Tile Power Gating	Entire compute tile	Full power domain isolation with state retention; tiles can be independently powered down
Sparsity-Driven Optimization	Per-PE	Adaptive detection of data sparsity reduces active MAC lanes, lowering dynamic power proportionally
Idle-State Gating	Chip-level	All clocks gated when no work is pending; leakage minimized through retention flops

SECTION 3.7

DFT / Test Infrastructure

Feature	Standard	Description
JTAG TAP	IEEE 1149.1	Full TAP controller with standard instructions (EXTEST, SAMPLE/PRELOAD, IDCODE, BYPASS, RUNBIST, CLAMP, HIGHZ)
Scan Chains	Industry-standard	Partitioned scan chains respecting tile and domain boundaries; scan-enabled flops throughout
Memory BIST	March algorithm	Built-in self-test for all on-chip SRAMs with pass/fail reporting and failure address capture
At-Speed Testing	OPCG	On-chip clock generation for transition fault testing at functional frequency
Test Modes	Multiple	Functional, scan shift, scan capture (slow and at-speed), MBIST, and JTAG modes

ARCHITECTURE STYLE SUMMARY

Design decisions at a glance.

Aspect	Approach
Compute Model	Tiled systolic array with weight-stationary dataflow
Interconnect	2D mesh NoC, packet-switched, dimension-order routing
Memory	Distributed hierarchical: per-PE scratchpad → per-tile buffer → external via AXI4
Control	Descriptor-based layer sequencer with DMA orchestration
Precision	INT8 compute, INT32 accumulation
Sparsity	Adaptive structured sparsity with hardware skip logic
Power	Multi-granularity: per-lane clock gating, per-tile power gating, sparsity-driven optimization
Testability	IEEE 1149.1 JTAG, partitioned scan chains, memory BIST, at-speed test support
Integration	Standard AMBA interfaces (APB/AXI-Lite for control, AXI4 for data); process-agnostic RTL

TARGET APPLICATIONS

Edge AI inference workloads.

Always-on sensing: Keyword spotting, voice activity detection, wake-word recognition

Vision at the edge: Image classification, object detection, anomaly detection, visual wake word

Industrial IoT: Predictive maintenance, quality inspection, sensor fusion

Robotics and drones: On-board perception, navigation, obstacle avoidance

Smart infrastructure: Occupancy detection, energy management, traffic monitoring

Evaluate the full architecture.

This page covers the public architecture overview. The complete technical specification — including register maps, timing diagrams, and integration constraints — is available post-NDA.

Request Architecture Review →View Full Spec Sheet

NeuraEdge NPU.Architecture at a glance.