From Atomic Bombs to Supercomputers and Autonomous Vehicles: a Year of Research in the Los Alamos National Labs

(1)

From Atomic Bombs to

Supercomputers and Autonomous Vehicles: a Year of Research in the

Los Alamos National Labs

Paolo Rech

INF UFRGS, 12 August 2020

(2)

Outline

- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors

- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions

- Conclusions and Future Work

(3)

Outline

- Los Alamos National Lab - Rosen Scholar Fellow

- Research on Reliability - Neutron-induced Errors

(4)

Los Alamos

(5)

Los Alamos National Laboratory

Manhattan Project Site Y

(6)

LANL – Department Of Energy

(7)

Los Alamos National Lab

Los Alamos, NM: 11,000 habitants. 2,500 mt. of altitude Los Alamos National Lab: 10,100 employees.

Annual budget: 2.2 billion USD

(8)

Los Alamos National Lab

>80 instable nuclear reactors.

Some of them are in unknown locations

Los Alamos, NM: 11,000 habitants. 2,500 mt. of altitude Los Alamos National Lab: 10,100 employees.

Annual budget: 2.2 billion USD

(9)

Los Alamos National Lab

(10)

Life in Los Alamos

(11)

Life in Los Alamos

(12)

Life in Los Alamos

(13)

Life in Los Alamos

(14)

Life in Los Alamos

George R. R. Martin lives in Santa Fe

(15)

Trinity Site, 1

^st

Sunday of April

(16)

Atomic Bombs

(17)

Atomic Bombs

(18)

Los Alamos Nat. Lab. Discoveries

Enrico Fermi

(19)

John von Neumann

Los Alamos Nat. Lab. Discoveries

(20)

Los Alamos National Laboratory

(21)

Los Alamos National Laboratory

(22)

43,000 square feet (4,000 square meters) of computing in the classified (about the size of an American football field)

~10 production HPC supercomputers

Visualization theater and virtual reality cave 9.6MW available to the computing floor

Images extracted from LA-UR-12-20056

Los Alamos National Laboratory

(23)

Outline

- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors

(24)

Los Alamos National Laboratory

Rosen Scholar award: assigned every year to the most promising researcher for LANL mission.

Reliability of autonomous vehicles and supercomputers

(25)

Rosen Scholar Fellow

Luis Rosen

Manhattan project co-director with Oppenheimer and Von Neumann.

Creator of the first proton accelerator First director of LANSCE

Rosen Scholar Fellowship: $ for realizing a project,

(26)

Rosen Scholar Fellow

Gus Sinnis (lab director)

Rosen Scholar Fellow (Paolo)

ISR (Intelligence and Space Research) WNR (Weapon Nuclear Research)

USRC (Ultrascale Systems Research Center)

(27)

Summer School

Radiation Effects Summer School July - August 2019

- computing system reliability

- approximate computing and ANNs reliability - HPC reliability

- fault tolerance and efficiency - radiation experiments

Open to students from any country

(28)

Summer School

Radiation Effects Summer School

(29)

Outline

- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors

(30)

Research Philosophy

Radiation tests

International collaborations Fault Injection

Efficient SW/HW hardening strategies

architecture and algorithm analyses

(31)

Reliability importance

(32)

Reliability importance

(33)

Reliability importance

neutron detectors inside Trinity Hall and ran Monte- Carlo simulations to characterize the environment.

@LANL supercomputers occupy more than 4,000mq

(34)

Supercomputer Environment

Water cooling system increases the probability of errors

(35)

Supercomputer Environment

We have modeled and simulated with MCNP the water cooling

system of supercomputers.

Thermal neutron flux can be increased up to 40%-80%.

(36)

Reliability importance

(37)

Reliability importance

Reliability is a critical issue!

(38)

One device, different reliability requirements

Cloud

Good High Very High

Reliability Requirements

Consumer Data Center HPC

automotive

(39)

Cloud

ARM CPU co-processor GPU FPGA/SoC

automotive

One device, different reliability requirements

(40)

Cloud

ARM CPU co-processor GPU FPGA/SoC

automotive

Reliability Requirements

One device, different reliability requirements

(41)

Outline

- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors

(42)

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere shower of energetic particles:

Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm

²

h) @sea level*

*JEDEC JESD89A Standard

Cosmic Ray

Cosmic rays could be so energetic to pass the Van Allen belts

(43)

Humans vs Electronics

Chernobyl - 1986 Joker – radhard robot

lasted few seconds

Humans (not just Russians) are safe for ~10mins

(44)

Radiation Effects - Soft Errors

• One or more bit-flips

Single Event Upset (SEU) Multiple Bit Upset (MBU)

Soft Errors: the device is not permanently damaged, but the particle may generate:

0110010010010011 1101001101001001 0010010010010010 1000100010000010

(45)

Radiation Effects - Soft Errors

0110010010010011 1101001011001001 0010010100010010 1000100010000010

IONIZING PARTICLE

(46)

Radiation Effects - Soft Errors

• Transient voltage pulse

Single Event Transient (SET) 5 + 6

IONIZING PARTICLE

0110010010010011 1101001011001001 0010010100010010 1000100010000010

IONIZING PARTICLE

4

(47)

Silent Data Corruption vs Crash

Silent Data Corruption: the application provides

wrong answers. Silent = no flag/no indication of error.

(48)

Silent Data Corruption vs Crash

(49)

Silent Data Corruption vs Crash

(50)

Silent Data Corruption vs Crash

$27.00

(51)

Silent Data Corruption vs Crash

$2,700

(52)

Silent Data Corruption vs Crash

Neutron-induced faults can also induce Application Crash or Device Reboot

(53)

Silent Data Corruption vs Crash

Neutron-induced faults can also induce Application Crash or Device Reboot

Don’t (always) blame Microsoft/Apple

(54)

Radiation Test Facilities

ChipIR @ Rutherford Appleton Lab., UK

LANSCE @ Los Alamos National Lab., USA

(55)

Experiment @ChipIR

(56)

Experiment @ChipIR

(57)

Experiment @ChipIR

(58)

Experiment @ChipIR

AMD

(59)

Experiment @ChipIR

(60)

Radiation Sensitivity

A

x

B M

Cross Section flux (@sea level) = Error Rate

2048 double

Example: Matrix Multiplication on “Device A”

(61)

Radiation Sensitivity

A

x

B M

Cross Section flux (@sea level) = Error Rate

2048 double

Probability for 1 neutron to generate an error

(62)

Radiation Sensitivity

A

x

B M

3.4610⁴ FIT

1 error every 3,2 years Cross Section flux (@sea level) = Error Rate

2.6610^-6 cm² 13 n/cm²/h =

2048 double

Probability for 1 neutron to generate an error

(63)

TITAN error rate

TITAN supercomputer has 18,000 devices:

18,000 errors every 3,2 years:

14 errors per day!

(64)

Cars in the EU

There are 268 millions cars in the European Union.

On the average, according to the EU’s motor vehicle fleet, 4% of cars are on the move: ~10.7 millions cars

(65)

Cars in the EU

1.7 x 10⁷errors every 3.2 years… ~380 errors per hour There are 268 millions cars in

the European Union.

(66)

Cars in the EU

...there are ~465 millions mobile phones in the EU

1.7 x 10⁷errors every 3.2 years… ~380 errors per hour There are 268 millions cars in

the European Union.

(67)

January 2012

Simulation of environmental effects in the atmosphere over time

Expert scientist (Daniel Duffy @ NASA GSFC) visually identified a high amount of sea salt aerosol in the atmosphere in the simulation

Supercomputer SDC example

(68)

January 2012

Simulation of environmental effects in the atmosphere over time

Expert scientist (Daniel Duffy @ NASA GSFC) visually identified a high amount of sea salt aerosol in the atmosphere in the simulation

Supercomputer SDC example

*courtesy Sean Blanchard, LANL

(69)

HPC bad stories

Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003

● 1,100 Apple Power Mac G5

● Couldn't boot because of the failure rate

● Power Mac G5 did not have error-correcting code (ECC) memory

● Big Mac was broken apart and sold on-line

Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute

ASCI Q – (2002 #2 in Top500 list)

● Built with AlphaServers

● 7 Teraflops

● Couldn't run more than 1h without crash

● After putting metal side it could last 6h before crash

(70)

Aerospace Applications

Radiation is certainly an issue for those developing aerospace and military applications

Levels of radiation in space and very high altitudes are much higher than at earth surface

(71)

Mars Rover Landing

(72)

14 mins of boringness

Each step of Curiosity is controlled from the Earth…

…it takes 14 mins for the signal to reach Mars!

(73)

Autonomous Vehicles for Space Exploration

autonomous helicopter to guide next Mars curiosity

(74)

Autonomous Vehicles for Space Exploration

autonomous helicopter to guide next Mars curiosity

In 2018 I was at JPL to (very very marginally)

collaborate to improve the reliability of the AI system of the Mars helicopter

(75)

Outline

- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions

(76)

Automotive Applications

Soft Error leading to a single bit flip caused a fatal error in 2007

(77)

Automotive Applications

Soft Error leading to a single bit flip caused a fatal error in 2007

What’s the new trend in the

Automotive Market????

(78)

Self Driving Car

The new trend for automotive market is Self Driving Car!

(79)

Self Driving Car

The new trend for automotive market is Self Driving Car!

(80)

Functional Safety

ISO26262 - Automotive Safety Integrity (ASIL)

level D, which is the highest classification of injury risk 1 – Detect 99% of permanent and transient faults

2 – Error rate < 10 FIT (10 errors in 10⁹h of operation)

(81)

Functional Safety

2 – Error rate < 10 FIT (10 errors in 10⁹h of operation) 1 system error (Feb. 2016) in 1.5x10⁶ miles driven

(60,000 h - 30,000 h driven)

(82)

Functional Safety

2 – Error rate < 10 FIT (10 errors in 10⁹h of operation) 5,657,000 crashes caused by human driver “error” in 2013 in USA (6x10¹⁰ h driven).

human driver error rate:

28,582 FIT!

(83)

1 10 100 1000 10000 100000

Functional Safety - review

FIT [errors in 109 h]

It is really challenging to design ASIL-D compliant self-driving systems

(84)

1 10 100 1000 10000 100000

Functional Safety - review

Google Car Humans

(USA citizens)

ASIL-D compliant self-driving car FIT [errors in 109 h]

It is really challenging to design ASIL-D compliant self-driving systems

ASIL-D compliant self-driving cars could save a lot of lives

(85)

Radiation Issue for self-driven cars

FIT [errors in 109 h]

Soft Error Rate*

*Oliveira et al.,

Trans. Comp. 2016

Computing devices architecture is designed to improve performances, not reliability.

(86)

Neutron-induced errors

Objects Detection System:

embedded GPUs detects objects running CNNs

(87)

ChipIR Experimental Setup

neutrons

K40 – Kepler 28nm, Planar ECC available

TX1 – Maxwell

V100– Pascal 16nm, FinFET

(88)

Neutron-induced errors

Objects Detection System:

embedded GPUs detects objects running CNNs

Observed error

(89)

Examples of observed errors

99% person

Expected

(90)

Examples of observed errors

99% person 99% person

Expected Tolerable

Slight modification of detection

(91)

Examples of observed errors

99% person 99% person Nothing’s there

Expected Tolerable

Slight modification

Critical

Missing an object

(92)

Examples of observed errors

(93)

Examples of observed errors

99% elephant

False positive

(94)

Examples of observed errors

99% elephant

False positive

Unnecessary stops

(95)

Examples of observed errors

99% elephant

False positive

*SC17 paper by BCU

Classification Error

wrong object detects

(96)

Results – FIT

1 10 100 1000

Tegra X1 K40 Unhardened K40 ECC Titan X Unhardened

SDC FIT SDC Wrong Detection Crash

Normalized FIT [a.u.]

FIT: errors in 10⁹ h of operation

(97)

Results – FIT

1 10 100 1000

Crashes are always more probable than SDC (dissimilar to HPC codes).

(98)

Results – FIT

1 10 100 1000

Tegra X1 K40 Unhardened K40 ECC Titan X Unhardened

Not all SDCs affect detection!

(99)

Results – FIT

1 10 100 1000

ECC does not reduce critical SDCs (faults in unprotected resources have a great impact on CNN output)

(100)

Self-Driven Cars

Naïve (expensive) solutions in today’s self-driven cars

(101)

Self-Driven Cars

2x

4x

(102)

Self-Driven Cars

2x

4x Replication is very costly!

And it might not work always!

We need to find smarter ways to detect neutron-induced errors.

(103)

Outline

- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions

(104)

Convolutional Neural Networks

(105)

Convolutional Neural Networks

Convolution: Matrix-Multiply related operations to extract features from frames

(106)

Convolutional Neural Networks

Maxpool: only the max element is propagated

(107)

Convolutional Neural Networks

Maxpool: only the max

¾ of faults won’t propagate

(108)

Algorithm-Based Fault Tolerance

70% of CNN operations are GEMM-related 10% are the other kernels

20% CPUxGPU operations.

(109)

Algorithm-Based Fault Tolerance

70% of CNN operations are GEMM-related 10% are the other kernels

20% CPUxGPU operations.

Proposed hardening: ABFT for Matrix multiplication*

x

A B

checksum

∑

=

M

col-check

row-check row-sum

X X

*Huang and Abraham ’84 Rech et al., TNS ‘13

(110)

ABFT works!

1 10 100 1000

Tegra X1 K40

Unhardened

K40 ABFT K40 ECC Titan X Unhardened

Titan X ABFT SDC FIT SDC Wrong Detection Crash

(111)

10 100

1000 SDC FIT SDC Wrong Detection Crash

ABFT works!

Our ABFT corrects 87% of Critical SDC, outperforming NVIDIA ECC

(112)

L_8:

ISCADD R7, R5, R3, 0x2;

IADD R1, R1,-0x4;

STL [R1], R4;

IADD R4, R7, 0x0;

JCAL `(_users_function);

LDL R4, [R1];

IADD R1, R1, 0x4;

STS [R7], R2;

BAR.SYNC 0x0;

...

L_8:

ISCADD R7, R5, R3, 0x2;

STS [R7], R2;

BAR.SYNC 0x0;

MOV R0, c[0x0][0x28];

SHF.R R0, R0, 0x1, RZ;

ISETP.EQ.AND P0, PT, …

@P0 BRA `(.L_12);

Fault-Injection CAROL-FI

(113)

Smart-Pool

4,5E+6

If the value of the element to propagate is

not reasonable (10x max value of a fault-free

execution) we detect the error and discard the frame.

4 additional variables, detection in O(1)

(114)

Smart-Pool

4,5E+6

If the value of the element to propagate is

not reasonable (10x max value of a fault-free

execution) we detect the error and discard the frame.

4 additional variables, detection in O(1)

Smart-pool detects more than

90% of critical SDCs

(115)

Tolerable vs Critical SDCs

analyze SDC criticality: are there “acceptable” SDCs?

example: CLAMR (DOE workload) experimental result SDC causes a

single pixel error

SDC causes a huge error

Do we really need to protect everything?

Duplicate only what REALLY matters

(116)

Selective Hardening

- Register-level

- Source code-level - Kernel-level

Protect only the resources (or code portions)

whose corruption (significantly) affects computation

We work at different levels of abstraction:

1. Beam experiments (realistic error model) 2. Fault-Injection to detect critical resources

ARM – GPU – x86 - FPGA

3. Evaluate the efficiency of protecting resources (overhead vs benefit)

(117)

Selective Hardening

- Register-level

- Source code-level

Protect only the resources (or code portions)

whose corruption (significantly) affects computation

We work at different levels of abstraction:

1. Beam experiments (realistic error model) 2. Fault-Injection to detect critical resources

ARM – GPU – x86 - FPGA

3. Evaluate the efficiency of protecting resources (overhead vs benefit)

Better efficiency

(118)

0 25 50 75 100

0 5 10 15

Number of variables protected

Percentage [%]

Error Coverage Overhead

LavaMD

overhead:

# instructions that update the protected

variable

Error Coverage:

achieved duplicating the code portions

that update the variables to protect

(119)

0 25 50 75 100

0 5 10 15

Percentage [%]

LavaMD

50% of SDCs detected with 6 variables protected

(~1% overhead)

(120)

Mixed-Precision Data and Operations

Double (64 bit)

Single (32 bit)

Half (16 bit)

Short INT (8 bit)

(1 bit)

Mixed-precision architectures can improve performance, efficiency and reliability.

(121)

Mixed-Precision Data and Operations

Half (16 bit)

Short INT (8 bit)

(1 bit)

Reduced precision operations are particularly interesting for Neural Networks training and execution

(122)

Mixed-Precision Hardening

GPUs have dedicated functional units to execute FP64, FP32, FP16 operations and Tensor Core

(123)

Mixed-Precision Hardening

When a FP64 application is

executed, the other units are idle.

used idle

(124)

Mixed-Precision Hardening

Our idea is to run the same

code, in parallel, in the available FP32 cores.

used DWC

(125)

Mixed-Precision Hardening

Our idea is to run the same

code, in parallel, in the available FP32 cores.

Reduced-Precision RP-DWC

used DWC

(126)

Mixed-Precision Hardening

(127)

Mixed-Precision Hardening

Detection goes from 57% to 76%. As expected, lower than traditional DWC (~80-90%)

(128)

Mixed-Precision Hardening

Overhead goes from ~35% for fine grain (a lot of comparisons) to ~0.1% for coarse grain!

Detection goes from 57% to 76%. As expected, lower than traditional DWC (~80-90%)

(129)

Mixed-Precision Hardening

More than 50% of detected errors changes of at least 1% the output value

99.9% of undetected errors have a negligible impact on output

value (~0.01%)

Undetectable errors are less impactful than detected errors

(130)

Design better IPs

ARM IP

(131)

Design better IPs

ARM IP Silicon beam:

realistic fault model

(132)

Design better IPs

realistic fault model FPGA

injection

improved IPs evaluation

(133)

Design better IPs

injection uArch

architectural

(134)

Design better IPs

injection uArch

simulator

architectural vulnerabilities

(135)

Design better IPs

injection uArch

architectural

(136)

Design better IPs

injection uArch

simulator

architectural vulnerabilities

2.67 3.84 1.26

15.73 -1.06

1.22 2.81 -6.24

-1.77

1.60 -1.55

-1.65

8.39

-15 -10 -5 0 5 10 15 20 25

Susan S Susan E Susan C StringSearch Rijndael D Rijndael E Qsort MatMul Jpeg D Jpeg C FFT Dijkstra CRC32

(137)

Design better IPs

injection uArch

architectural

2.67 3.84 1.26

15.73 -1.06

1.22 2.81 -6.24

-1.77

1.60 -1.55

-1.65

8.39

-15 -10 -5 0 5 10 15 20 25

Susan S Susan E Susan C StringSearch Rijndael D Rijndael E Qsort MatMul Jpeg D Jpeg C FFT Dijkstra CRC32

(138)

Outline

- Conclusions and Future Work

(139)

Conclusions and Future Work

-Reliability is a serious issue for supercomputers and safety-critical applications

-Novel architectural solutions impact the error rate of devices.

-We need to focus on critical errors, critical variables, critical resources to have efficient hardening

-Future work: find the best trade-off between

performances and reliability. Can we go faster in a

(140)

Impact on Society Award 2020

Rutherford Appleton Laboratory, UK

(141)

Marie Curie Fellow 2020

Pursuing Efficient Reliability of Object Detection for automotive and aerospace applications” (PERIOD)

(142)

Acknowledgments

Caio Lunardi Daniel Oliveira Fernando Santos Lucas Klein

Pedro Basso

Chris Frost

Carlo Cazzaniga

Gus Sinnis

Nathan DeBardeleben Sean Blanchard

Heather Quinn Steve Wender

Timothy Tsai Siva Hari

Steve Keckler

Matteo Sonza Reorda Luca Sterpone

Pete Harrold

Reiley Jeyapaul Steve Guertin Gregory Allen Andrew Daniel

AMD

Vilas Sridharan

Sudhanva Gurumurthi

Marcelo Brandalero

(143)

From Atomic Bombs to Supercomputers and Autonomous Vehicles: a Year of Research in the Los Alamos National Labs