

### Managing MPSoCs beyond their Thermal Design Power\*

### Luca Benini Università di Bologna & STMicroelectronics Luca.benini@unibo.it

\*Work supported by INTEL, FP7 THERMINATOR, FP7 Artist-Design

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA RISERVATO AL PERSONALE DELL'UNIVERSITÀ DI BOLOGNA E NON PUÒ ESSERE LITULIZZATO AL TERMINI DI LEGGE DA ALTRE PERSONE O PER FINI NON ISTITUZIONALI



### **Thermal Power Wall**

# Transistor count increases exponentially, but...

Can no longer power the entire chip (voltages, cooling do not scale)

Traditional HW power-aware techniques are insufficient (e.g., voltage-freq. scaling)





## Mobile SoCs are cool...right?

### Wrong!

So.. got myself a HTC (T-Mobile) HD2 [...].

As i found out the problem is pretty common: damn thing restart itself **thermally related - the old CPU overheat problem**. By searching the net I found out that it's pretty common with some HTC models. HD2 has it, Desire has it, Nexus One has it, hell even some xperia models have it.. about half of the devices powered by anything from the Snapdragon series could have it.

June 2011 - http://forum.xda-developers.com/showthread.php?t=982454

### Why?

ARM has unveiled its next generation "Eagle" (Cortex A15) processor, pitching it at **everything from smartphones to energy-efficient servers**. The A15 will be initially produced at 32nm or 28nm, although ARM claims the roadmap stretches down to 20nm. It will **deliver clock speeds of up to 2.5GHz**.

Aug. 2011 - http://www.pcpro.co.uk/news/360994/arm-preys-on-smartphones-and-servers-with-eagle



### A 2011 Mobile SoC



- Tegra II
  - TSMC 40nm (LP/G)
  - A9 1GHz (G)
  - GPU, etc. 330MHz (LP)
  - GEForce ULV (8 shaders)
  - 2 separate Vdd rails
  - 1MB L2\$
  - 32b LPDDR2 (600MHz DR)
- Tegra II 3D
  - A9 1.2GHz
  - GPU 400Mhz





### 3D-SoCs are even worse





### **Rushing to Many-Core**

Hardware Trends  $\rightarrow$  1000+ core system Software Trends  $\rightarrow$  Concurrency (1000x +) .000 COLE **Massively Parallel** Large Scale SoCs Intel 80 Core chip (1,000s of cores) SCC Tera Scale Multi Project (48 Cores) Processors bores on **Research Challenge:** UniProcessors Power management for a 1,000 core (single core) heterogeneous SoC  $\rightarrow$  Extreme MIMO!





### Outline

- Introduction
- Scalable Control
- Scalable model learning
- Experimental Environment
- Challenges ahead



### **DRM - General Architecture**





### **DRM - General Architecture**

- System (Chip Scale)
- Sensors
  - Performance counter
    - PMU
  - Core temperature
- Actuator Knobs
  - ACPI states
    - P-State → DVFS
    - C-State  $\rightarrow$  P<sub>GATING</sub>
  - Task allocation
- Controller
  - Reactive
    - Threshold/Heuristic
    - Controller theory
  - Proactive
    - Predictors





### **Energy Controller**





ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



### Thermal transient

O.S time scale

Neighborhood

Thermal locality (Direct Fourier law in

- Continuous model:
  - Thermal neighborhood = Physical
- Discrete model:
  - Thermal neighborhood depends on sample time
- Hotspot simulation of 'Intel SCC like' 48core
  - Each core : Area = 11.82mm2, Priax = 2.6W
  - We powered on only Core(5,3)
  - T neighborhood > +0.1°C
- Thermal transient Model Order
  - Different materials reflects in different time constants [1]
    - Silicon die, heat spreader, heat sink
    - Second order model

[1] W. Huang Differentiating the roles of IR measurement and simulation for  $\frac{1}{4}$  38 power and temperature-aware design 2009.







### **Thermal Controller**





### **MPC Scalability**





### Addressing Scalability











#### **Distributed Controller**

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

# **Distributed Thermal Controller**



ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

## **Explicit Distributed Controller**



Our aim is to minimize the difference between the input  $P_{1,TC}$  (also called manipulated variable MV) and the reference ( $P_{1,EC}$ ). Our controller can only take in account a constant reference. To overcome this limitation we reformulate the tracking problem as a regulation problem consisting in taking the  $\Delta P_1$  (the new MV) to 0. The regulated power  $P_{1,TC}$  is:

$$P_{1,TC} = P_{1,EC} + \Delta P_{1,EC}$$

# Explicit Distributed Controller





At each time instant the system belongs to a region according with its current state. On each region the explicit controller executes the following linear control law:

$$u(x) = F_i \cdot x(t) + G_i$$

# Explicit Distributed Controller



The prediction evaluated by our explicit controller cannot take into account the measured disturbances ( $u_{MD}$ =[Tenv, P1,  $T_{neigh}$ ]). Thus we exploit the superposition principle of linear systems:

$$x(k+1) = f(x(k), u_{MV}(k), u_{MD}(k)) \rightarrow x(k+1) = f(x(k), u_{MV}(k), 0) + f(x(k), 0, uMD(k)))$$

To remap the effect of these elements we exploit the model to modify the state (x (k)  $\rightarrow$  x<sub>SHIFTED</sub>(k)) projecting one step forward the MDs effects.

$$x(k+1) = A_1 \cdot x(k) + B'_1 \cdot \Delta P_1 + B''_1 \cdot [P_{1,EC} \ T_{ENV} \ T_{neig}] \rightarrow x(k+1) = A_1 \cdot x_{SHIFTED}(k) + B'_1 \cdot \Delta P_{1,EC}$$

$$A_1 \cdot x_{SHIFTED}(k) = A_1 \cdot x(k) + B''_1 \cdot [P_{1,EC} \ T_{ENV} \ T_{neig}] \rightarrow x_{SHIFTED}(k) = x(k) + A_1^{-1} \cdot B''_1 \cdot [P_{1,EC} \ T_{ENV} \ T_{neig}]$$



### MPC trade-off

### Trace Driven Simulation (Matlab) – gold model

- Parsec trace obtained on real
- Power Model: Nonlinear vs. linear
- Thermal Model: one vs. two
- Centralized vs. Distributed





ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



### Outline

- Introduction
- Scalable Control
- Scalable model learning
- Experimental Environment
- Challenges Ahead

### **Model Identification**





### **LS System Identification**





### **Workload & Temperature**



Pseudorandom workload pattern



### **Black-box Identification**

#### Identification based on pure LS fitting

#### **MEASURED vs. SIMULATED TEMPERATURE**

Tcore2





### Validation



ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA



### **Gray Box identification**

LS model must be constrained by physical properties to avoid over-fitting





### **Physical Constraints**



## Model learning Scalability



## Model learning Scalability





### **Distributed Model learning**





Distributed

### Distributed model-learning



identification System data collection: input: PRBS signals to all cores (persistently exciting inpt sequence) output: Temperatures of all cores ( $T_i^o$  con i= # core) ARX model Parameters computation: Computation  $T^{p}(a_{i}, b_{i,i})=T(k+1|k)$  computed with previous equation Algorithm To output temperature (measured) • Results Least square algorithm: 1) Mean Absolute Error between  $\frac{1}{L}\sum_{k=3}^{L+2} \left(T^0 - T^p\right) = \min \frac{\left\|T^0 - T^p\right\|^2}{2}$ 2) Temperature response of core 1 original and identified models Mean Absolute Error Legend: 0,2 Fluidanimate Temperature Comparison (Core1) 0,15 °K **Facesim** 345 0,1 Dedup Bodytrack 340 0,05 Real Model Raytracing 335 Identified Model 0 4 Cores 8 Cores

ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA

50

51

52

54

53



### Outline

- Introduction
- Scalable Control
- Scalable model learning
- Experimental Environment
- Challenges ahead



### **Simulation Strategy**

### Trace driven Simulator [1]:

- Not suitable for full system simulation (How to simulate O.S.?)
- looses information on cross-dependencies

→ resulting in degraded simulation accuracy Closed loop simulator:

- Cycle accurate simulators [2] :
  - High modeling accuracy
  - Suppost well-established power and temperature co-simulation based on analytical models and system micro-architectural knowledge
  - Low simulation speed
  - Not suitable for full-system simulation
- Functional and instruction set simulators:
  - Allow full system simulation
  - Lower internal precision
  - Less detailed data  $\rightarrow$  no micro-architectural model
  - Introduces the challenge of having accurate power and temperature physical models



[1] P Chaparro et al. Understanding the thermal implications of multi-core architectures. 2007
 [2] Benini L. et al. MPARM: Exploring the multi-processor SoC design space with SystemC 2005





#### Simics by Virtutech:

- full system functional simulator
- models the entire system: peripherals, BIOS, network interfaces, cores, memories
- allows booting full OS, such as Linux SMP
- supports different target CPU (arm, sparc, x86)
- x86 model:
  - in-order
  - all instruction are retired in 1 cycle
  - does not account for memory latency

[1] Martin Milo M. K. et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset 2005



App.1

. ⊢

CPU1 CPU2

12

RUBY – GEMS (University of Wisconsin)[1]

**0.S**.

Network

DRAM

 Public cycle-accurate memory timing model

App.N

CPUN

**Virtual Platform** 

L2

SW

HW

- Different target memory architectures
- fully integrated with Virtutech Simics
- written in C++
- we use it as skeleton to apply our addons (as C++ object)



#### Performance koobte(19)/IFSduheadule:

- Neredled by Speins range postice to equation of the second s
- WRe Borden en wort en formatine Counter module to support it
  - · edopearts ato have. ian ber apply lice ative degree feine oppearratives:
- We add the ewild by ErSofnin stulections upport it clock cycles and stall cycles expired,
  - ensumestLigstactions, DRAM to have a constant clock frequency
  - L1 latency scale with Simics processor clock frequency





#### Power model module:

- At run-time estimate the power consumption of the target architecture
- Core model  $P_T = [P_D(f, CPI) + P_S(T, VDD)] *(1 idleness) + idleness *(P_{IDLE})$
- P<sub>D</sub> experimentally calibrated analytical power model
- Cache and memory power access cost estimated with CACTI [1]



[1] Thoziyoor Shyamkumar et al. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. 2008



### **Power Model**



### Modeling Real Platform – Power



 $P_D = k_A \cdot V_{DD}^2 \cdot f_{CK} + k_B + (k_C + k_D \cdot f_{CK}) \cdot CPI^{k_E}$ 

• We relate the static power with the operating point by using an analytical model



#### Temperature model module:

- we integrate our virtual platform with a thermal simulator [1]
- Input: power dissipated by the main functional units composing the target platform
- Output: Provides the temperature distribution along the simulated multicore die area as output



[1] Paci G. et al. Exploring "temperature-aware" design in low-power MPSoCs



### **Thermal Model**





- Thermal Model Calibration :
  - Derived from Intel<sup>®</sup> Core<sup>™</sup> 2 Duo layout
  - We calibrate the model parameter to simulate real HW transient
  - High accuracy (error < 1%) and same transient behavior





# STUDIORUM

## **Virtual Platform Performance**





## Mathworks Matlab/Simulink

- Numerical computing environment developed to design, implement and test numerical algorithms
- Mathworks Simulink for simulation of dynamic systems: simplifies and speedups the development cycle of control systems
- Can be called as a computational engine by writing C and Fortran programs that use Mathworks Matlab's engine library
- Controller design two steps:
  - developing the control algorithm that optimizes the system performance
  - implementing it in the system

We allow a Mathworks Matlab/Simulink description of the controller to directly drive at run-time the performance knobs of the emulated system



#### Mathworks Matlab interface:

- New module named Controller in RUBY
- Initialization: starts the Mathworks Matlab engine concurrent process,
- Every N cycle wake-up:
  - send the current performance monitor output to the Mathworks Simulink model
  - execute one step of the controller Mathworks Simulink model
  - propagate the Mathworks Simulink controller decision to the DVFS module





Mathworks Matlab interface:

- New module named Controller in RUBY
- Initialization: starts the Mathworks Matlab engine concurrent process,
- Every N cycle wake-up:

CONTROL-STRATEGIES DEVELOPMENT CYCLE

- 1. Controller design in Mathworks Matlab/Simulink framework
  - system represented by a simplified model
  - obtained by physical considerations and identification techniques
- 2. Set of simulation tests and design adjustments done in Simulink
- 3. Tuned controller evaluation with an accurate model of the plant done in the virtual platform
- 4. Performance analysis, by simulating the overall system

 $T \longrightarrow CONTROLLER \xrightarrow{f} PLANT MODEL \xrightarrow{T} T$   $T, Tmax, P^* \longrightarrow f \longrightarrow SIMICS \xrightarrow{T} T$ 

#BUS<sub>ACCESS</sub>/ Virtual Platform

Virtutech Simics

### Results





### Working on Real Chips (Intel)



### Working on Real Chips (Intel)



LMA MATER STUDIORUM – UNIVERSITA DI BOLOGN



### Working on Real Chips (Intel)

### Single Chip Cloud (45nm)



- 567.1 mm2
- 48cores @1GHz
- 2GHz NoC
- 25-125W
- 27 (f), 8 (V) islands



### **Thermal Sensor Variability**



#### Sensor map @ 533 MHz power virus

| 4271.5454 | 3137.1818 | 4358      | 4451.1818 | 4925.3636 | 4296.7273 — | 6000 |
|-----------|-----------|-----------|-----------|-----------|-------------|------|
| 4008.1818 | 2776.9091 | 4615.6364 | 4997.4545 | 4241.9091 | 3307.2727 – | 5500 |
| 4854.4546 | 4843.2727 | 4600.9091 | 4245.3636 | 3509.0909 | 5601 -      | 5000 |
| 3630.6364 | 3057.6364 | 4113.2727 | 5135.7273 | 4710.9091 | 3210.9091 - | 4500 |
| 3999.3636 | 3582.2727 | 3663.9091 | 4168.9091 | 4162.7273 | 4295.4545 — | 4000 |
| 2846      | 3501.0909 | 3652.6364 | 3471.3636 |           | 3566.0909 — | 3500 |
| 3994,5455 | 3432.6364 | 3138.6364 | 4182.5455 | 2669      | 3492 –      | 3000 |
| 3531.4545 | 4185      | 4897.4545 | 3881.5455 | 3936.7273 | 3647.4545 – | 2500 |

#### Sensor map @ 100 MHz idle



### Outline

- Introduction
- Scalable Control
- Scalable model learning
- Experimental Environment
- Challenges ahead



# The 1,000 Cores Chip

- STM-CEA Platform 2012 project
  - Die with 4 16-cores tiles with L1 & L2 → few tens of mm2 (28nm)
  - SCC die  $\rightarrow$  20 of these dies: 1,280 cores
  - Thousands of Vdd, f domains
- 3D stacking currently the only technology which can provide sufficient L3 bandwidth
  - Vertical thermal dissipation!
  - Heterogeneous requirements (DRAM≠LOGIC)
- Major static and dynamic variability

# Power management Challenges

- Truly scalable algorithms  $\rightarrow$  O(NlogN)
- Hardware support needed (e.g. DPM NoC)
- Cross-layer algos are needed
  - Real-time intra+inter layer communication
  - Abstraction and filtering
  - Multi-scale
- The threat of non-linearity
  - Hybrid control complexity (MILP is NP-HARD)
  - Lack of robustness (III-conditioning) and stability proofs

### SoCs as complex systems (societies/markets) → DPM as political sociology/finance?





ALMA MATER STUDIORUM ~ UNIVERSITÀ DI BOLOGNA