PUBLIC ARCHITECTURE OVERVIEW
NeuraEdge NPU.
Architecture at a glance.
Configurable, energy-efficient neural network inference accelerator for edge computing. Tiled systolic array with adaptive sparsity, packet-switched on-chip interconnect, and fine-grained power management. Delivered as portable, process-agnostic RTL.
BLOCK DIAGRAM
Architecture at a glance.
+----------------------------------------------------------------------------------------+ | NeuraEdge NPU | | | | +-------------------+ +-------------------+ +-------------------+ | | | Compute Tile | | Compute Tile | | Compute Tile | ... | | | +-------------+ | | +-------------+ | | +-------------+ | | | | | PE Array | | | | PE Array | | | | PE Array | | | | | | (Systolic) | | | | (Systolic) | | | | (Systolic) | | | | | +------+------+ | | +------+------+ | | +------+------+ | | | | | NoC | | | NoC | | | NoC | | | | +------v------+ | | +------v------+ | | +------v------+ | | | | | NoC Router | | | | NoC Router | | | | NoC Router | | | | | +-------------+ | | +-------------+ | | +-------------+ | | | +--------+----------+ +--------+----------+ +--------+----------+ | | | | | | | +--------+------------------------------------------------+-------------------------+ | | | 2D Mesh Network-on-Chip | | | | | (Packet-switched, credit-based flow) | | | | +---------------------------------------------------------+-------------------------+ | | | | +----------------------------------------------------------------------------------+ | | | Control & Configuration | | | | +------------------+ +------------------+ +---------------------------+ | | | | | Host Interface |--->| Layer Sequencer |--->| Tile Configuration Fabric | | | | | | (APB/AXI-Lite) | | (Descriptor Eng) | | (Broadcast/Multicast) | | | | | +------------------+ +--------+---------+ +---------------------------+ | | | | | | | | | +--------v---------+ | | | | | DMA Engine | | | | | | (AXI4 Master) | | | | | +------------------+ | | | +----------------------------------------------------------------------------------+ | | | | +----------------------------------------------------------------------------------+ | | | Memory Subsystem | | | | +------------------+ +------------------+ +---------------------------+ | | | | | Local Scratchpad | | Buffer Hierarchy | | External Memory Interface | | | | | | (per-PE) | | (per-tile) | | (AXI4) | | | | | +------------------+ +------------------+ +---------------------------+ | | | +----------------------------------------------------------------------------------+ | | | | +----------------------------------------------------------------------------------+ | | | Power Management | | | | +------------------+ +------------------+ +---------------------------+ | | | | | Clock Gating | | Power Gating | | Sparsity-Driven | | | | | | (per-lane) | | (per-tile) | | Power Optimization | | | | | +------------------+ +------------------+ +---------------------------+ | | | +----------------------------------------------------------------------------------+ | | | | +----------------------------------------------------------------------------------+ | | | DFT / Test Infrastructure | | | | +------------------+ +------------------+ +---------------------------+ | | | | | JTAG TAP | | Scan Chains | | Memory BIST | | | | | | (IEEE 1149.1) | | (per-domain) | | (March algorithm) | | | | | +------------------+ +------------------+ +---------------------------+ | | | +----------------------------------------------------------------------------------+ | | | | External Interfaces: | | AXI4-Lite (CSR) ◄──► Host CPU | AXI4 (DMA) ◄──► External Memory | | IRQ ◄──► Host CPU | JTAG / Scan ◄──► ATE | +----------------------------------------------------------------------------------------+
SECTION 3.1
NPU Compute Core
The compute fabric is organized as a configurable 2D mesh of compute tiles. Each tile contains a systolic array of Processing Elements (PEs), with each PE implementing multiple parallel MAC (Multiply-Accumulate) lanes.
| Feature | Description |
|---|---|
| Dataflow | Weight-stationary systolic array — weights loaded once and held locally while activations stream through |
| Precision | INT8 inputs (weights and activations), INT32 accumulators with overflow protection |
| Scalability | Tile count and PE array dimensions are parameterizable; single-tile to large multi-tile arrays |
| Pipeline | Multi-stage MAC pipeline with registered intermediate results for timing closure |
Each PE integrates:
—Multiple parallel INT8 MAC lanes
—Local weight buffer (SRAM-based, 512 B per PE)
—INT32 accumulator bank
—Adaptive sparsity detection unit
—Per-lane clock gating logic
SECTION 3.2
Memory Subsystem
The memory architecture follows a distributed, hierarchical model designed to minimize off-chip bandwidth and maximize data locality.
| Level | Scope | Purpose |
|---|---|---|
| Local Scratchpad | Per-PE | Holds resident weights for the systolic array; eliminates redundant memory fetches |
| Tile Buffer | Per-tile | Intermediate storage for activations and partial results shared across PEs within a tile |
| External Interface | Chip-level | AXI4 master interface for access to off-chip system memory (LPDDR, SRAM, etc.) |
The DMA engine manages data movement between external memory and on-chip buffers through dedicated channels for weights, activations, and output results.
SECTION 3.3
Network-on-Chip (NoC)
A packet-switched 2D mesh interconnect provides deterministic, deadlock-free communication between compute tiles and the control plane.
| Feature | Description |
|---|---|
| Topology | 2D mesh with configurable dimensions |
| Router Ports | 5 ports per router (North, East, South, West, Local) |
| Routing | Dimension-order (XY) routing — deterministic and deadlock-free |
| Flow Control | Credit-based with backpressure prevention |
| Arbitration | Round-robin per output port |
| Flit Width | Configurable; sized to match system bandwidth requirements |
The NoC supports both unicast and multicast traffic patterns, enabling efficient weight broadcasting and result collection across the tile array.
SECTION 3.4
Control & Configuration
Host CPU ──[APB/AXI-Lite]──► CSR Bridge ──► Layer Sequencer ──► Tile Config Fabric ──► Compute Tiles
│
└──► DMA Engine ──[AXI4]──► External Memory| Component | Function |
|---|---|
| Host Interface | Standard bus interface (APB or AXI-Lite) for register access from the host processor |
| Layer Sequencer | Descriptor-based execution engine that parses layer configurations, orchestrates DMA transfers, and manages tile synchronization |
| Tile Configuration Fabric | Broadcast and multicast distribution of enable signals, PE masks, sparsity modes, and precision settings to all tiles |
| Descriptor Format | Fixed-width layer descriptors encoding operation type, tensor dimensions, memory addresses, and configuration flags |
An optional embedded RISC-V core can be integrated to handle control-plane tasks autonomously, offloading the host processor.
SECTION 3.5
External Interfaces
| Interface | Protocol | Direction | Purpose |
|---|---|---|---|
| CSR Bus | APB / AXI-Lite | Slave | Host CPU access to control and status registers |
| DMA Bus | AXI4 | Master | High-bandwidth data transfers to/from external memory |
| Interrupt | Single-line | Output | Layer completion, error, and status notifications to host |
| JTAG | IEEE 1149.1 | Bidirectional | Debug access and boundary scan |
| Scan | Standard scan | Bidirectional | Production test access via ATE |
SECTION 3.6
Power Management
NeuraEdge implements a multi-tier power management architecture. Power state transitions follow a sequenced protocol: save state → isolate outputs → power down → power up → restore state → release isolation.
| Technique | Granularity | Description |
|---|---|---|
| Per-Lane Clock Gating | Individual MAC lane | Independent clock gate per MAC lane, driven by activity and sparsity signals |
| Per-Tile Power Gating | Entire compute tile | Full power domain isolation with state retention; tiles can be independently powered down |
| Sparsity-Driven Optimization | Per-PE | Adaptive detection of data sparsity reduces active MAC lanes, lowering dynamic power proportionally |
| Idle-State Gating | Chip-level | All clocks gated when no work is pending; leakage minimized through retention flops |
SECTION 3.7
DFT / Test Infrastructure
| Feature | Standard | Description |
|---|---|---|
| JTAG TAP | IEEE 1149.1 | Full TAP controller with standard instructions (EXTEST, SAMPLE/PRELOAD, IDCODE, BYPASS, RUNBIST, CLAMP, HIGHZ) |
| Scan Chains | Industry-standard | Partitioned scan chains respecting tile and domain boundaries; scan-enabled flops throughout |
| Memory BIST | March algorithm | Built-in self-test for all on-chip SRAMs with pass/fail reporting and failure address capture |
| At-Speed Testing | OPCG | On-chip clock generation for transition fault testing at functional frequency |
| Test Modes | Multiple | Functional, scan shift, scan capture (slow and at-speed), MBIST, and JTAG modes |
ARCHITECTURE STYLE SUMMARY
Design decisions at a glance.
| Aspect | Approach |
|---|---|
| Compute Model | Tiled systolic array with weight-stationary dataflow |
| Interconnect | 2D mesh NoC, packet-switched, dimension-order routing |
| Memory | Distributed hierarchical: per-PE scratchpad → per-tile buffer → external via AXI4 |
| Control | Descriptor-based layer sequencer with DMA orchestration |
| Precision | INT8 compute, INT32 accumulation |
| Sparsity | Adaptive structured sparsity with hardware skip logic |
| Power | Multi-granularity: per-lane clock gating, per-tile power gating, sparsity-driven optimization |
| Testability | IEEE 1149.1 JTAG, partitioned scan chains, memory BIST, at-speed test support |
| Integration | Standard AMBA interfaces (APB/AXI-Lite for control, AXI4 for data); process-agnostic RTL |
TARGET APPLICATIONS
Edge AI inference workloads.
Always-on sensing: Keyword spotting, voice activity detection, wake-word recognition
Vision at the edge: Image classification, object detection, anomaly detection, visual wake word
Industrial IoT: Predictive maintenance, quality inspection, sensor fusion
Robotics and drones: On-board perception, navigation, obstacle avoidance
Smart infrastructure: Occupancy detection, energy management, traffic monitoring
Evaluate the full architecture.
This page covers the public architecture overview. The complete technical specification — including register maps, timing diagrams, and integration constraints — is available post-NDA.