From Atomic Bombs to
Supercomputers and Autonomous Vehicles: a Year of Research in the
Los Alamos National Labs
Paolo Rech
INF UFRGS, 12 August 2020
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Outline
- Los Alamos National Lab - Rosen Scholar Fellow
- Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Los Alamos
Los Alamos
Los Alamos National Laboratory
Manhattan Project Site Y
LANL – Department Of Energy
Los Alamos National Lab
Los Alamos, NM: 11,000 habitants. 2,500 mt. of altitude Los Alamos National Lab: 10,100 employees.
Annual budget: 2.2 billion USD
Los Alamos National Lab
>80 instable nuclear reactors.
Some of them are in unknown locations
Los Alamos, NM: 11,000 habitants. 2,500 mt. of altitude Los Alamos National Lab: 10,100 employees.
Annual budget: 2.2 billion USD
Los Alamos National Lab
Life in Los Alamos
Life in Los Alamos
Life in Los Alamos
Life in Los Alamos
Life in Los Alamos
George R. R. Martin lives in Santa Fe
Trinity Site, 1
stSunday of April
Atomic Bombs
Atomic Bombs
Los Alamos Nat. Lab. Discoveries
Enrico Fermi
John von Neumann
Los Alamos Nat. Lab. Discoveries
Los Alamos National Laboratory
Los Alamos National Laboratory
43,000 square feet (4,000 square meters) of computing in the classified (about the size of an American football field)
~10 production HPC supercomputers
Visualization theater and virtual reality cave 9.6MW available to the computing floor
Images extracted from LA-UR-12-20056
Los Alamos National Laboratory
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Los Alamos National Laboratory
Rosen Scholar award: assigned every year to the most promising researcher for LANL mission.
Reliability of autonomous vehicles and supercomputers
Rosen Scholar Fellow
Luis Rosen
Manhattan project co-director with Oppenheimer and Von Neumann.
Creator of the first proton accelerator First director of LANSCE
Rosen Scholar Fellowship: $ for realizing a project,
Rosen Scholar Fellow
Gus Sinnis (lab director)
Rosen Scholar Fellow (Paolo)
ISR (Intelligence and Space Research) WNR (Weapon Nuclear Research)
USRC (Ultrascale Systems Research Center)
Summer School
Radiation Effects Summer School July - August 2019
- computing system reliability
- approximate computing and ANNs reliability - HPC reliability
- fault tolerance and efficiency - radiation experiments
Open to students from any country
Summer School
Radiation Effects Summer School
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Research Philosophy
Radiation tests
International collaborations Fault Injection
Efficient SW/HW hardening strategies
architecture and algorithm analyses
Reliability importance
Reliability importance
Reliability importance
neutron detectors inside Trinity Hall and ran Monte- Carlo simulations to characterize the environment.
@LANL supercomputers occupy more than 4,000mq
Supercomputer Environment
Water cooling system increases the probability of errors
Supercomputer Environment
We have modeled and simulated with MCNP the water cooling
system of supercomputers.
Thermal neutron flux can be increased up to 40%-80%.
Reliability importance
Reliability importance
Reliability is a critical issue!
One device, different reliability requirements
Cloud
Good High Very High
Reliability Requirements
Consumer Data Center HPC
automotive
Cloud
Good High Very High
ARM CPU co-processor GPU FPGA/SoC
Consumer Data Center HPC
automotive
One device, different reliability requirements
Cloud
Good High Very High
ARM CPU co-processor GPU FPGA/SoC
Consumer Data Center HPC
automotive
Reliability Requirements
One device, different reliability requirements
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm
2h) @sea level*
*JEDEC JESD89A Standard
Cosmic Ray
Cosmic rays could be so energetic to pass the Van Allen belts
Humans vs Electronics
Chernobyl - 1986 Joker – radhard robot
lasted few seconds
Humans (not just Russians) are safe for ~10mins
Radiation Effects - Soft Errors
• One or more bit-flips
Single Event Upset (SEU) Multiple Bit Upset (MBU)
Soft Errors: the device is not permanently damaged, but the particle may generate:
0110010010010011 1101001101001001 0010010010010010 1000100010000010
Radiation Effects - Soft Errors
• One or more bit-flips
Single Event Upset (SEU) Multiple Bit Upset (MBU)
0110010010010011 1101001011001001 0010010100010010 1000100010000010
IONIZING PARTICLE
Soft Errors: the device is not permanently damaged, but the particle may generate:
Radiation Effects - Soft Errors
• One or more bit-flips
Single Event Upset (SEU) Multiple Bit Upset (MBU)
Soft Errors: the device is not permanently damaged, but the particle may generate:
• Transient voltage pulse
Single Event Transient (SET) 5 + 6
IONIZING PARTICLE
0110010010010011 1101001011001001 0010010100010010 1000100010000010
IONIZING PARTICLE
4
Silent Data Corruption vs Crash
Silent Data Corruption: the application provides
wrong answers. Silent = no flag/no indication of error.
Silent Data Corruption vs Crash
Silent Data Corruption: the application provides
wrong answers. Silent = no flag/no indication of error.
Silent Data Corruption vs Crash
Silent Data Corruption: the application provides
wrong answers. Silent = no flag/no indication of error.
Silent Data Corruption vs Crash
$27.00
Silent Data Corruption: the application provides
wrong answers. Silent = no flag/no indication of error.
Silent Data Corruption vs Crash
$2,700
Silent Data Corruption: the application provides
wrong answers. Silent = no flag/no indication of error.
Silent Data Corruption vs Crash
Neutron-induced faults can also induce Application Crash or Device Reboot
Silent Data Corruption vs Crash
Neutron-induced faults can also induce Application Crash or Device Reboot
Don’t (always) blame Microsoft/Apple
Radiation Test Facilities
ChipIR @ Rutherford Appleton Lab., UK
LANSCE @ Los Alamos National Lab., USA
Experiment @ChipIR
Experiment @ChipIR
Experiment @ChipIR
Experiment @ChipIR
AMD
Experiment @ChipIR
Radiation Sensitivity
A
x
B MCross Section flux (@sea level) = Error Rate
2048 double
2048 double
Example: Matrix Multiplication on “Device A”
Radiation Sensitivity
A
x
B MCross Section flux (@sea level) = Error Rate
2048 double
2048 double
Probability for 1 neutron to generate an error
Example: Matrix Multiplication on “Device A”
Radiation Sensitivity
A
x
B M3.46104 FIT
1 error every 3,2 years Cross Section flux (@sea level) = Error Rate
2.6610-6 cm2 13 n/cm2/h =
2048 double
2048 double
Example: Matrix Multiplication on “Device A”
Probability for 1 neutron to generate an error
TITAN error rate
TITAN supercomputer has 18,000 devices:
18,000 errors every 3,2 years:
14 errors per day!
Cars in the EU
There are 268 millions cars in the European Union.
On the average, according to the EU’s motor vehicle fleet, 4% of cars are on the move: ~10.7 millions cars
Cars in the EU
1.7 x 107 errors every 3.2 years… ~380 errors per hour There are 268 millions cars in
the European Union.
On the average, according to the EU’s motor vehicle fleet, 4% of cars are on the move: ~10.7 millions cars
Cars in the EU
...there are ~465 millions mobile phones in the EU
1.7 x 107 errors every 3.2 years… ~380 errors per hour There are 268 millions cars in
the European Union.
On the average, according to the EU’s motor vehicle fleet, 4% of cars are on the move: ~10.7 millions cars
January 2012
Simulation of environmental effects in the atmosphere over time
Expert scientist (Daniel Duffy @ NASA GSFC) visually identified a high amount of sea salt aerosol in the atmosphere in the simulation
Supercomputer SDC example
January 2012
Simulation of environmental effects in the atmosphere over time
Expert scientist (Daniel Duffy @ NASA GSFC) visually identified a high amount of sea salt aerosol in the atmosphere in the simulation
Supercomputer SDC example
*courtesy Sean Blanchard, LANL
HPC bad stories
Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003
● 1,100 Apple Power Mac G5
● Couldn't boot because of the failure rate
● Power Mac G5 did not have error-correcting code (ECC) memory
● Big Mac was broken apart and sold on-line
Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute
ASCI Q – (2002 #2 in Top500 list)
● Built with AlphaServers
● 7 Teraflops
● Couldn't run more than 1h without crash
● After putting metal side it could last 6h before crash
Aerospace Applications
Radiation is certainly an issue for those developing aerospace and military applications
Levels of radiation in space and very high altitudes are much higher than at earth surface
Mars Rover Landing
14 mins of boringness
Each step of Curiosity is controlled from the Earth…
…it takes 14 mins for the signal to reach Mars!
Autonomous Vehicles for Space Exploration
autonomous helicopter to guide next Mars curiosity
Autonomous Vehicles for Space Exploration
autonomous helicopter to guide next Mars curiosity
In 2018 I was at JPL to (very very marginally)
collaborate to improve the reliability of the AI system of the Mars helicopter
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Automotive Applications
Soft Error leading to a single bit flip caused a fatal error in 2007
Automotive Applications
Soft Error leading to a single bit flip caused a fatal error in 2007
What’s the new trend in the
Automotive Market????
Self Driving Car
The new trend for automotive market is Self Driving Car!
Self Driving Car
The new trend for automotive market is Self Driving Car!
Functional Safety
ISO26262 - Automotive Safety Integrity (ASIL)
level D, which is the highest classification of injury risk 1 – Detect 99% of permanent and transient faults
2 – Error rate < 10 FIT (10 errors in 109h of operation)
Functional Safety
ISO26262 - Automotive Safety Integrity (ASIL)
level D, which is the highest classification of injury risk 1 – Detect 99% of permanent and transient faults
2 – Error rate < 10 FIT (10 errors in 109h of operation) 1 system error (Feb. 2016) in 1.5x106 miles driven
(60,000 h - 30,000 h driven)
Functional Safety
ISO26262 - Automotive Safety Integrity (ASIL)
level D, which is the highest classification of injury risk 1 – Detect 99% of permanent and transient faults
2 – Error rate < 10 FIT (10 errors in 109h of operation) 5,657,000 crashes caused by human driver “error” in 2013 in USA (6x1010 h driven).
human driver error rate:
28,582 FIT!
1 10 100 1000 10000 100000
Functional Safety - review
FIT [errors in 109 h]
It is really challenging to design ASIL-D compliant self-driving systems
1 10 100 1000 10000 100000
Functional Safety - review
Google Car Humans
(USA citizens)
ASIL-D compliant self-driving car FIT [errors in 109 h]
It is really challenging to design ASIL-D compliant self-driving systems
ASIL-D compliant self-driving cars could save a lot of lives
Radiation Issue for self-driven cars
FIT [errors in 109 h]
Soft Error Rate*
*Oliveira et al.,
Trans. Comp. 2016
Computing devices architecture is designed to improve performances, not reliability.
Neutron-induced errors
Objects Detection System:
embedded GPUs detects objects running CNNs
ChipIR Experimental Setup
neutrons
K40 – Kepler 28nm, Planar ECC available
TX1 – Maxwell
V100– Pascal 16nm, FinFET
Neutron-induced errors
Objects Detection System:
embedded GPUs detects objects running CNNs
Observed error
Examples of observed errors
99% person
Expected
Examples of observed errors
99% person 99% person
Expected Tolerable
Slight modification of detection
Examples of observed errors
99% person 99% person Nothing’s there
Expected Tolerable
Slight modification
Critical
Missing an object
Examples of observed errors
Examples of observed errors
99% elephant
False positive
Examples of observed errors
99% elephant
False positive
Unnecessary stops
Examples of observed errors
99% elephant
False positive
*SC17 paper by BCU
Classification Error
wrong object detects
Results – FIT
1 10 100 1000
Tegra X1 K40 Unhardened K40 ECC Titan X Unhardened
SDC FIT SDC Wrong Detection Crash
Normalized FIT [a.u.]
FIT: errors in 109 h of operation
Results – FIT
1 10 100 1000
SDC FIT SDC Wrong Detection Crash
Normalized FIT [a.u.]
Crashes are always more probable than SDC (dissimilar to HPC codes).
Results – FIT
1 10 100 1000
Tegra X1 K40 Unhardened K40 ECC Titan X Unhardened
SDC FIT SDC Wrong Detection Crash
Normalized FIT [a.u.]
Not all SDCs affect detection!
Results – FIT
1 10 100 1000
SDC FIT SDC Wrong Detection Crash
Normalized FIT [a.u.]
ECC does not reduce critical SDCs (faults in unprotected resources have a great impact on CNN output)
Self-Driven Cars
Naïve (expensive) solutions in today’s self-driven cars
Self-Driven Cars
Naïve (expensive) solutions in today’s self-driven cars
2x
4x
Self-Driven Cars
Naïve (expensive) solutions in today’s self-driven cars
2x
4x Replication is very costly!
And it might not work always!
We need to find smarter ways to detect neutron-induced errors.
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Convolutional Neural Networks
Convolutional Neural Networks
Convolution: Matrix-Multiply related operations to extract features from frames
Convolutional Neural Networks
Convolution: Matrix-Multiply related operations to extract features from frames
Maxpool: only the max element is propagated
Convolutional Neural Networks
Convolution: Matrix-Multiply related operations to extract features from frames
Maxpool: only the max
¾ of faults won’t propagate
Algorithm-Based Fault Tolerance
70% of CNN operations are GEMM-related 10% are the other kernels
20% CPUxGPU operations.
Algorithm-Based Fault Tolerance
70% of CNN operations are GEMM-related 10% are the other kernels
20% CPUxGPU operations.
Proposed hardening: ABFT for Matrix multiplication*
x
A B
checksum
checksum
∑
=
Mcol-check
row-check row-sum
X X
*Huang and Abraham ’84 Rech et al., TNS ‘13
ABFT works!
Normalized FIT [a.u.]
1 10 100 1000
Tegra X1 K40
Unhardened
K40 ABFT K40 ECC Titan X Unhardened
Titan X ABFT SDC FIT SDC Wrong Detection Crash
Normalized FIT [a.u.]
10 100
1000 SDC FIT SDC Wrong Detection Crash
ABFT works!
Our ABFT corrects 87% of Critical SDC, outperforming NVIDIA ECC
L_8:
ISCADD R7, R5, R3, 0x2;
IADD R1, R1,-0x4;
STL [R1], R4;
IADD R4, R7, 0x0;
JCAL `(_users_function);
LDL R4, [R1];
IADD R1, R1, 0x4;
STS [R7], R2;
BAR.SYNC 0x0;
...
L_8:
ISCADD R7, R5, R3, 0x2;
STS [R7], R2;
BAR.SYNC 0x0;
MOV R0, c[0x0][0x28];
SHF.R R0, R0, 0x1, RZ;
ISETP.EQ.AND P0, PT, …
@P0 BRA `(.L_12);
Fault-Injection CAROL-FI
Smart-Pool
4,5E+6
If the value of the element to propagate is
not reasonable (10x max value of a fault-free
execution) we detect the error and discard the frame.
4 additional variables, detection in O(1)
Smart-Pool
4,5E+6
If the value of the element to propagate is
not reasonable (10x max value of a fault-free
execution) we detect the error and discard the frame.
4 additional variables, detection in O(1)
Smart-pool detects more than
90% of critical SDCs
Tolerable vs Critical SDCs
analyze SDC criticality: are there “acceptable” SDCs?
example: CLAMR (DOE workload) experimental result SDC causes a
single pixel error
SDC causes a huge error
Do we really need to protect everything?
Duplicate only what REALLY matters
Selective Hardening
- Register-level
- Source code-level - Kernel-level
Protect only the resources (or code portions)
whose corruption (significantly) affects computation
We work at different levels of abstraction:
1. Beam experiments (realistic error model) 2. Fault-Injection to detect critical resources
ARM – GPU – x86 - FPGA
3. Evaluate the efficiency of protecting resources (overhead vs benefit)
Selective Hardening
- Register-level
- Source code-level
Protect only the resources (or code portions)
whose corruption (significantly) affects computation
We work at different levels of abstraction:
1. Beam experiments (realistic error model) 2. Fault-Injection to detect critical resources
ARM – GPU – x86 - FPGA
3. Evaluate the efficiency of protecting resources (overhead vs benefit)
Better efficiency
0 25 50 75 100
0 5 10 15
Number of variables protected
Percentage [%]
Error Coverage Overhead
LavaMD
overhead:
# instructions that update the protected
variable
Error Coverage:
achieved duplicating the code portions
that update the variables to protect
0 25 50 75 100
0 5 10 15
Percentage [%]
LavaMD
50% of SDCs detected with 6 variables protected
(~1% overhead)
Mixed-Precision Data and Operations
Double (64 bit)
Single (32 bit)
Half (16 bit)
Short INT (8 bit)
(1 bit)
Mixed-precision architectures can improve performance, efficiency and reliability.
Mixed-Precision Data and Operations
Half (16 bit)
Short INT (8 bit)
(1 bit)
Reduced precision operations are particularly interesting for Neural Networks training and execution
Mixed-Precision Hardening
GPUs have dedicated functional units to execute FP64, FP32, FP16 operations and Tensor Core
Mixed-Precision Hardening
GPUs have dedicated functional units to execute FP64, FP32, FP16 operations and Tensor Core
When a FP64 application is
executed, the other units are idle.
used idle
Mixed-Precision Hardening
GPUs have dedicated functional units to execute FP64, FP32, FP16 operations and Tensor Core
When a FP64 application is
executed, the other units are idle.
Our idea is to run the same
code, in parallel, in the available FP32 cores.
used DWC
Mixed-Precision Hardening
GPUs have dedicated functional units to execute FP64, FP32, FP16 operations and Tensor Core
When a FP64 application is
executed, the other units are idle.
Our idea is to run the same
code, in parallel, in the available FP32 cores.
Reduced-Precision RP-DWC
used DWC
Mixed-Precision Hardening
Mixed-Precision Hardening
Detection goes from 57% to 76%. As expected, lower than traditional DWC (~80-90%)
Mixed-Precision Hardening
Overhead goes from ~35% for fine grain (a lot of comparisons) to ~0.1% for coarse grain!
Detection goes from 57% to 76%. As expected, lower than traditional DWC (~80-90%)
Mixed-Precision Hardening
More than 50% of detected errors changes of at least 1% the output value
99.9% of undetected errors have a negligible impact on output
value (~0.01%)
Undetectable errors are less impactful than detected errors
Design better IPs
ARM IP
Design better IPs
ARM IP Silicon beam:
realistic fault model
Design better IPs
ARM IP Silicon beam:
realistic fault model FPGA
injection
improved IPs evaluation
Design better IPs
ARM IP Silicon beam:
realistic fault model FPGA
injection uArch
improved IPs evaluation
architectural
Design better IPs
ARM IP Silicon beam:
realistic fault model FPGA
injection uArch
simulator
improved IPs evaluation
architectural vulnerabilities
Design better IPs
ARM IP Silicon beam:
realistic fault model FPGA
injection uArch
improved IPs evaluation
architectural
Design better IPs
ARM IP Silicon beam:
realistic fault model FPGA
injection uArch
simulator
improved IPs evaluation
architectural vulnerabilities
2.67 3.84 1.26
15.73 -1.06
1.22 2.81 -6.24
-1.77
1.60 -1.55
-1.65
8.39
-15 -10 -5 0 5 10 15 20 25
Susan S Susan E Susan C StringSearch Rijndael D Rijndael E Qsort MatMul Jpeg D Jpeg C FFT Dijkstra CRC32
Design better IPs
ARM IP Silicon beam:
realistic fault model FPGA
injection uArch
improved IPs evaluation
architectural
2.67 3.84 1.26
15.73 -1.06
1.22 2.81 -6.24
-1.77
1.60 -1.55
-1.65
8.39
-15 -10 -5 0 5 10 15 20 25
Susan S Susan E Susan C StringSearch Rijndael D Rijndael E Qsort MatMul Jpeg D Jpeg C FFT Dijkstra CRC32
Outline
- Los Alamos National Lab - Rosen Scholar Fellow - Research on Reliability - Neutron-induced Errors
- Some (interesting) results on self-driven cars - Some (interesting) efficient solutions
- Conclusions and Future Work
Conclusions and Future Work
-Reliability is a serious issue for supercomputers and safety-critical applications
-Novel architectural solutions impact the error rate of devices.
-We need to focus on critical errors, critical variables, critical resources to have efficient hardening
-Future work: find the best trade-off between
performances and reliability. Can we go faster in a
Impact on Society Award 2020
Rutherford Appleton Laboratory, UK
Marie Curie Fellow 2020
Pursuing Efficient Reliability of Object Detection for automotive and aerospace applications” (PERIOD)
Acknowledgments
Caio Lunardi Daniel Oliveira Fernando Santos Lucas Klein
Pedro Basso
Chris Frost
Carlo Cazzaniga
Gus Sinnis
Nathan DeBardeleben Sean Blanchard
Heather Quinn Steve Wender
Timothy Tsai Siva Hari
Steve Keckler
Matteo Sonza Reorda Luca Sterpone
Pete Harrold
Reiley Jeyapaul Steve Guertin Gregory Allen Andrew Daniel
AMD
Vilas Sridharan
Sudhanva Gurumurthi
Marcelo Brandalero