# **RipTide:** A Programmable, Energy-Minimal Dataflow Compiler and Architecture

#### Smart devices at the extreme edge are rapidly emerging with huge industrial impact



Tiny, smart sensor devices that enable advanced processing or inference



Nano Satellites



Medical Wearables



Wildlife Monitoring

### **Trillions of devices coming!**<sup>1</sup>

#### Must sustainably & efficiently **compute** at the edge. <u>How</u>?

1. Run variety of apps on ultra-low power (ULP), µWs

2. More compute on-device, less communication<sup>2</sup>



<sup>1</sup>Arm, "How to build a trillion connected things." <sup>2</sup>Gobieski et al., "Intelligence Beyond the Edge: Inference on Intermittent Embedded Systems." (ASPLOS '19).

Graham Gobieski, Souradip Ghosh, Tony Nowatzki\*, Todd C. Mowry, Nathan Beckmann, Brandon Lucia Carnegie Mellon University, \*UCLA

#### **Goal: develop a highly flexible** and energy efficient compute



1. Wastes up to 90% energy on non-compute<sup>3</sup>

**2. Inflexible** by design, limited to **single app** 

# **CGRAs are flexible & efficient!**

#### What is a coarse-grained reconfigurable array (CGRA)?



Grid of processing elements (**PE**) connected by a **NoC**.

**1.** PEs support "coarse" op type (add, load, shift, etc.)

x: add ...

y: add ...

z: mul x,y

**2.** Compilers extract code into a *dataflow* graph to map to a CGRA (often small loops).

**3.** CGRA *execution* can be statically scheduled or use "dataflow firing": a PE "fires" once its inputs arrive via the NoC (no fetch/decode).



(\*)

<sup>3</sup>Horowitz, "Computing's energy problem (and what we can do about it)." (ISSCC '14).

#### **Prior ULP CGRAs are limited**

void foo (...) { for (i = 0...n)vlh v1, a + i**vadd** v3, v1, v2 vsh b + i, v3CGRA code in assembly<sup>4</sup>

Runs only affine inner loops. No irregular controlflow or memory.

**Insight:** To achieve efficiency, CGRAs need to run entire apps and support common PL idioms

# **RipTide is a new ULP CGRA compiler & arch.**





R 

IVI

Handles arbitrary code via 1) Complex **control-flow** 2) Irregular mem. accesses 3) Enforced **mem. ordering** 

= Full compiler in LLVM → = SAT/ILP mapper (to CGRA) **Optimizes** away ops, reduces op subgraphs

reusing routers to





