PUBLIC ARCHITECTURE OVERVIEW

NeuraEdge NPU.
Architecture at a glance.

Configurable, energy-efficient neural network inference accelerator for edge computing. Tiled systolic array with adaptive sparsity, packet-switched on-chip interconnect, and fine-grained power management. Delivered as portable, process-agnostic RTL.

Weight-stationary dataflow2D mesh NoCINT8 compute · INT32 accum5 power domains · 8 statesIEEE 1149.1 JTAG

BLOCK DIAGRAM

Architecture at a glance.

+----------------------------------------------------------------------------------------+
|                              NeuraEdge NPU                                             |
|                                                                                        |
|  +-------------------+    +-------------------+    +-------------------+               |
|  |    Compute Tile   |    |    Compute Tile   |    |    Compute Tile   |    ...        |
|  |  +-------------+  |    |  +-------------+  |    |  +-------------+  |               |
|  |  | PE Array    |  |    |  | PE Array    |  |    |  | PE Array    |  |               |
|  |  | (Systolic)  |  |    |  | (Systolic)  |  |    |  | (Systolic)  |  |               |
|  |  +------+------+  |    |  +------+------+  |    |  +------+------+  |               |
|  |         | NoC     |    |         | NoC     |    |         | NoC     |               |
|  |  +------v------+  |    |  +------v------+  |    |  +------v------+  |               |
|  |  | NoC Router  |  |    |  | NoC Router  |  |    |  | NoC Router  |  |               |
|  |  +-------------+  |    |  +-------------+  |    |  +-------------+  |               |
|  +--------+----------+    +--------+----------+    +--------+----------+               |
|           |                        |                        |                          |
|  +--------+------------------------------------------------+-------------------------+ |
|  |                    2D Mesh Network-on-Chip               |                         | |
|  |              (Packet-switched, credit-based flow)        |                         | |
|  +---------------------------------------------------------+-------------------------+ |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           Control & Configuration                                 |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | Host Interface   |--->| Layer Sequencer  |--->| Tile Configuration Fabric |   |  |
|  |  | (APB/AXI-Lite)   |    | (Descriptor Eng) |    | (Broadcast/Multicast)     |   |  |
|  |  +------------------+    +--------+---------+    +---------------------------+   |  |
|  |                            |                                                    |  |
|  |                   +--------v---------+                                          |  |
|  |                   |   DMA Engine     |                                          |  |
|  |                   | (AXI4 Master)    |                                          |  |
|  |                   +------------------+                                          |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           Memory Subsystem                                         |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | Local Scratchpad |    | Buffer Hierarchy |    | External Memory Interface |   |  |
|  |  | (per-PE)         |    | (per-tile)       |    | (AXI4)                    |   |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           Power Management                                         |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | Clock Gating     |    | Power Gating     |    | Sparsity-Driven           |   |  |
|  |  | (per-lane)       |    | (per-tile)       |    | Power Optimization        |   |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  +----------------------------------------------------------------------------------+  |
|  |                           DFT / Test Infrastructure                                |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  |  | JTAG TAP         |    | Scan Chains      |    | Memory BIST               |   |  |
|  |  | (IEEE 1149.1)    |    | (per-domain)     |    | (March algorithm)         |   |  |
|  |  +------------------+    +------------------+    +---------------------------+   |  |
|  +----------------------------------------------------------------------------------+  |
|                                                                                        |
|  External Interfaces:                                                                  |
|    AXI4-Lite (CSR) ◄──► Host CPU    |    AXI4 (DMA) ◄──► External Memory              |
|    IRQ ◄──► Host CPU                  |    JTAG / Scan ◄──► ATE                        |
+----------------------------------------------------------------------------------------+

SECTION 3.1

NPU Compute Core

The compute fabric is organized as a configurable 2D mesh of compute tiles. Each tile contains a systolic array of Processing Elements (PEs), with each PE implementing multiple parallel MAC (Multiply-Accumulate) lanes.

FeatureDescription
DataflowWeight-stationary systolic array — weights loaded once and held locally while activations stream through
PrecisionINT8 inputs (weights and activations), INT32 accumulators with overflow protection
ScalabilityTile count and PE array dimensions are parameterizable; single-tile to large multi-tile arrays
PipelineMulti-stage MAC pipeline with registered intermediate results for timing closure

Each PE integrates:

Multiple parallel INT8 MAC lanes

Local weight buffer (SRAM-based, 512 B per PE)

INT32 accumulator bank

Adaptive sparsity detection unit

Per-lane clock gating logic

SECTION 3.2

Memory Subsystem

The memory architecture follows a distributed, hierarchical model designed to minimize off-chip bandwidth and maximize data locality.

LevelScopePurpose
Local ScratchpadPer-PEHolds resident weights for the systolic array; eliminates redundant memory fetches
Tile BufferPer-tileIntermediate storage for activations and partial results shared across PEs within a tile
External InterfaceChip-levelAXI4 master interface for access to off-chip system memory (LPDDR, SRAM, etc.)

The DMA engine manages data movement between external memory and on-chip buffers through dedicated channels for weights, activations, and output results.

SECTION 3.3

Network-on-Chip (NoC)

A packet-switched 2D mesh interconnect provides deterministic, deadlock-free communication between compute tiles and the control plane.

FeatureDescription
Topology2D mesh with configurable dimensions
Router Ports5 ports per router (North, East, South, West, Local)
RoutingDimension-order (XY) routing — deterministic and deadlock-free
Flow ControlCredit-based with backpressure prevention
ArbitrationRound-robin per output port
Flit WidthConfigurable; sized to match system bandwidth requirements

The NoC supports both unicast and multicast traffic patterns, enabling efficient weight broadcasting and result collection across the tile array.

SECTION 3.4

Control & Configuration

Host CPU ──[APB/AXI-Lite]──► CSR Bridge ──► Layer Sequencer ──► Tile Config Fabric ──► Compute Tiles
                                      │
                                      └──► DMA Engine ──[AXI4]──► External Memory
ComponentFunction
Host InterfaceStandard bus interface (APB or AXI-Lite) for register access from the host processor
Layer SequencerDescriptor-based execution engine that parses layer configurations, orchestrates DMA transfers, and manages tile synchronization
Tile Configuration FabricBroadcast and multicast distribution of enable signals, PE masks, sparsity modes, and precision settings to all tiles
Descriptor FormatFixed-width layer descriptors encoding operation type, tensor dimensions, memory addresses, and configuration flags

An optional embedded RISC-V core can be integrated to handle control-plane tasks autonomously, offloading the host processor.

SECTION 3.5

External Interfaces

InterfaceProtocolDirectionPurpose
CSR BusAPB / AXI-LiteSlaveHost CPU access to control and status registers
DMA BusAXI4MasterHigh-bandwidth data transfers to/from external memory
InterruptSingle-lineOutputLayer completion, error, and status notifications to host
JTAGIEEE 1149.1BidirectionalDebug access and boundary scan
ScanStandard scanBidirectionalProduction test access via ATE

SECTION 3.6

Power Management

NeuraEdge implements a multi-tier power management architecture. Power state transitions follow a sequenced protocol: save state → isolate outputs → power down → power up → restore state → release isolation.

TechniqueGranularityDescription
Per-Lane Clock GatingIndividual MAC laneIndependent clock gate per MAC lane, driven by activity and sparsity signals
Per-Tile Power GatingEntire compute tileFull power domain isolation with state retention; tiles can be independently powered down
Sparsity-Driven OptimizationPer-PEAdaptive detection of data sparsity reduces active MAC lanes, lowering dynamic power proportionally
Idle-State GatingChip-levelAll clocks gated when no work is pending; leakage minimized through retention flops

SECTION 3.7

DFT / Test Infrastructure

FeatureStandardDescription
JTAG TAPIEEE 1149.1Full TAP controller with standard instructions (EXTEST, SAMPLE/PRELOAD, IDCODE, BYPASS, RUNBIST, CLAMP, HIGHZ)
Scan ChainsIndustry-standardPartitioned scan chains respecting tile and domain boundaries; scan-enabled flops throughout
Memory BISTMarch algorithmBuilt-in self-test for all on-chip SRAMs with pass/fail reporting and failure address capture
At-Speed TestingOPCGOn-chip clock generation for transition fault testing at functional frequency
Test ModesMultipleFunctional, scan shift, scan capture (slow and at-speed), MBIST, and JTAG modes

ARCHITECTURE STYLE SUMMARY

Design decisions at a glance.

AspectApproach
Compute ModelTiled systolic array with weight-stationary dataflow
Interconnect2D mesh NoC, packet-switched, dimension-order routing
MemoryDistributed hierarchical: per-PE scratchpad → per-tile buffer → external via AXI4
ControlDescriptor-based layer sequencer with DMA orchestration
PrecisionINT8 compute, INT32 accumulation
SparsityAdaptive structured sparsity with hardware skip logic
PowerMulti-granularity: per-lane clock gating, per-tile power gating, sparsity-driven optimization
TestabilityIEEE 1149.1 JTAG, partitioned scan chains, memory BIST, at-speed test support
IntegrationStandard AMBA interfaces (APB/AXI-Lite for control, AXI4 for data); process-agnostic RTL

TARGET APPLICATIONS

Edge AI inference workloads.

Always-on sensing: Keyword spotting, voice activity detection, wake-word recognition

Vision at the edge: Image classification, object detection, anomaly detection, visual wake word

Industrial IoT: Predictive maintenance, quality inspection, sensor fusion

Robotics and drones: On-board perception, navigation, obstacle avoidance

Smart infrastructure: Occupancy detection, energy management, traffic monitoring

Evaluate the full architecture.

This page covers the public architecture overview. The complete technical specification — including register maps, timing diagrams, and integration constraints — is available post-NDA.