Designing Programmable Platforms:
From ASIC to ASIP
MPSoC 2005
Heinrich Meyr CoWare Inc., San Jose
and
Integrated Signal Processing Systems (ISS),
Aachen University of Technology, Germany
Agenda
Facts & Conclusions
Heterogeneous MPSoC
» Energy Efficiency vs.Flexibility
» How to explore the Design Space?
ASIP Design
Economics of SoC Development
Conclusions
Agenda
Facts & Conclusion
Core Proposition
ASIP
ASIP based based Platforms Platforms
(heterogenousMPSoC(heterogenousMPSoC))
Agenda
Facts & Conclusions
Heterogeneous MPSoC
» Energy Efficiency vs.Flexibility
» How to explore the Design Space?
ASIP Design
Economics of SoC Development
Conclusions
Agenda
Trade-off between Flexibility and Energy -Efficiency
Heterogeneous
Heterogeneous MPSoCMPSoC
Architectural Objectives
Need more MOPS/Watt and MOPS/mm² to minimize the global performance measure for battery driven devices
Energy / decoded Bit = (Joule/Bit)
Computational Effiency vs. Flexibility
Source
Source: : T.NollT.Noll, RWTH Aachen, RWTH Aachen
Enabling MP-SoC Design
block implementation
micro architecture
domain •• RTL SynthesisRTL Synthesis
•• MatlabMatlab
•• SPWSPW
•• System StudioSystem Studio algorithm
domain
block specification
Architecture Description
Language
•• LISATek Processor SynthesisLISATek Processor Synthesis
•• ConvergenSC ConvergenSC BuscompilerBuscompiler
High-level IP block design
block implementation
micro architecture
domain •• RTL SynthesisRTL Synthesis
block specification
Architecture Description
Language
•• LISATek Processor SynthesisLISATek Processor Synthesis
•• ConvergenSC ConvergenSC BuscompilerBuscompiler
system application design
algorithmic exploration
System Level Tools I: Application & IP Creation
System application design
System Level Tools II: MP-SoC Platform Design
•• MatlabMatlab
•• SPWSPW
••System StudioSystem Studio
block implementation
micro architecture
domain •• RTL SynthesisRTL Synthesis
High-level IP block design
block implementation
micro architecture
domain •• RTL SynthesisRTL Synthesis
block specification
Architecture Description
Language
•• LISATek Processor SynthesisLISATek Processor Synthesis
•• ConvergenSC ConvergenSC BuscompilerBuscompiler
algorithmic exploration
virtual prototype
SystemC Transaction
Level
Modeling •• ConvergenSCConvergenSC Platform CreatorPlatform Creator
abstract architecture •• MPMP--SoCSoC Intermediate RepresentationIntermediate Representation algorithm
domain
MP-SoC platform design
abstract architecture
virtual prototype
SystemC Transaction
Level Modeling
•• MPMP--SoCSoC Intermediate RepresentationIntermediate Representation
•• ConvergenSCConvergenSC Platform CreatorPlatform Creator System Level Tools I: Application & IP Creation
Agenda
Facts & Conclusions
Heterogeneous MPSoC
» Energy Efficiency vs.Flexibility
» How to explore the Design Space?
ASIP Design
Economics of SoC Development
Conclusions
Agenda
Processor Design Space
MMU
Memory Peripheral
Core Cache
FEFE DCDC EXEX WBWB
• Bypass ?
• Pipeline length ?
• Shared resources ?
• Parallel execution units ?
which cache required ?
bus fast enough?
butterfly 0 butterfly 1 load/store
communication?
• Exploit regularity/parallelism in data flow/data storage
• VLIW, SIMD, ?
• Which instructions for compiler support?
• Instruction Encoding?
• How much general purpose registers?
• Area constraints met?
• Clock frequency?
Instruction Set Design Micro Architecture Design
RTL Design Soc Integration
- Instruction-Set Design - Compiler Design - Instruction-Set Design
- Compiler Design -Micro Architecture Design-Micro Architecture Design
-RTL Design
- RTL ISS Co-verification -RTL Design
- RTL ISS Co-verification
-System Integration - Embedded Software
Simulation -System Integration - Embedded Software
Simulation Optimal design requires powerful tools
and automation !
Optimal design requires powerful tools and automation !
MESCAL 2:
MESCAL 2:
Inclusively
Inclusively identify identify the the architectural
architectural space space
The purpose of an architecture description language (e.g LISA) is:
» To allow for an iterative design to efficiently explore architecture alternatives
» To jointly design “Architecture –Compiler” and on chip communication
» To automatically generate hardware (path to implementation)
» To automatically generate tools
» Assembler ,Linker, Compiler, Simulator, co-simulation interfaces
From a single model at various level of temporal and spatial abstraction
Architecture Description Language based Processor Design
MESCAL 3:
MESCAL 3:
Efficiently
Efficiently describe describe the ASIP
the ASIP
very detailed
no details
LISA 2.0 - Abstraction Levels
time
high level model
Pseudo Pseudo Instructions Instructions
Processor Processor Instructions
Instructions CyclesCycles PhasesPhases Pseudo
Pseudo Resources Resources (e.g. c
(e.g. c--variables)variables) Functional units, Functional units, Registers,
Registers, Memories Memories + Pipelines + Pipelines + IRQ, etc.
+ IRQ, etc.
instruction accurate
model
cycle accurate
model
phase accurate
model
architecture
accuracy accuracy
FFT Processor
Application
Software Tool Chain Software
Tool Chain LISATek
LISATek
Processor Processor Designer Designer
RTL RTL
Executable Executable Software Software Platform Platform
RTLRTL
SoCSoC
Integration Kit Integration Kit (e.g.:SystemC(e.g.:SystemC))
DSP Sample VLIW Sample RISC Sample Empty Model
LISATek IP LISATek IP
Samples Samples
Custom Processor
Model (LISA 2.0 language)
Generate Generate Tools Tools
Function and instruction level Function and instruction level
profiling reveals hot
profiling reveals hot--spotsspots
--> special purpose instructions> special purpose instructions Describe/Adopt
Describe/Adopt Processor Model Processor Model
Generate...
Generate...
Rapid modeling and re-targetable simulation + code-generation allows for:
joint optimization of application and architecture Rapid modeling and re
Rapid modeling and re-targetable-targetable simulation + code-simulation + code-generation allows for:generation allows for:
joint optimization of application and architecture joint optimization of application and architecture
MESCAL 3:
MESCAL 3:
Efficiently
Efficiently describe describe and and evaluate evaluate the the ASIP ASIP
MESCAL 5:
MESCAL 5:
Sucessfully
Sucessfully deploy deploy the ASIP
the ASIP
Current Work
Evaluation Results
Chip Area, Clock Speed, Power Consumption
SystemC, VHDL, Verilog Output
Gate Level Synthesis Target Architecture
LISA Description
Evaluation Results
Profile Information, Application Performance
Model Verification
& Evaluation
LISA Compiler
C-Compiler
Assembler Linker
Simulator
E X P L O R A T
I O N
I M P L E M E N T A T
I O N
Optimization HDL Generator
•Instruction Set Synthesis
•Memory architecture
•Verification
MESCAL 3:
MESCAL 3:
… … .. .. evaluate evaluate the ASIP the ASIP
JuneJune 10,10, 20042004
A Novel Approach for Flexible and A Novel Approach for Flexible and Consistent ADL
Consistent ADL - - driven ASIP Design driven ASIP Design
Gunnar Braun Gunnar Braun
Achim Nohl Achim Nohl CoWare, Inc CoWare, Inc DAC Booth #1844 DAC Booth #1844 www.CoWare.com www.CoWare.com
Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer, Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer, Hanno Scharwächter, Rainer Leupers, Heinrich Meyr Hanno Scharwächter, Rainer Leupers, Heinrich Meyr
Integrated Signal Processing Systems (ISS) Integrated Signal Processing Systems (ISS)
Aachen
Aachen University of TechnologyUniversity of Technology Germany
Germany
Introduction Introduction Architecture Description Languages (ADL)
Architecture Description Languages (ADL)
•• Automatic generation of Software ToolkitAutomatic generation of Software Toolkit (Compiler, Assembler, Linker, IS
(Compiler, Assembler, Linker, IS--Simulator)Simulator)
•• Architecture ExplorationArchitecture Exploration
•• SystemC models, RTL code, verification tools, ...SystemC models, RTL code, verification tools, ...
Challenges:
Challenges:
•• Different tools need different informationDifferent tools need different information
•• Unambiguous, redundancy-Unambiguous, redundancy-free free architecturearchitecture modelmodel (rather than
(rather than tools descriptiontools description))
•• Multiple abstraction levels (instructionMultiple abstraction levels (instruction--accurateaccurate and/or cycle
and/or cycle--accurate)accurate)
Tool Requirements: Compiler
++ rsrs rtrt
rdrd
add rd = rs, rt add rd = rs, rt
** rsrs rtrt
rdrd
mul rd = rs, rt mul rd = rs, rt
LDLD
@@
rdrd ld rd = @ ld rd = @
STST rsrs
@@ st @ = rs st @ = rs C Compiler
C Compiler C Compiler a = b + c;
a = b + c;
a = b + c; CC
add c = a, b add c = a, b
add c = a, b AssemblyAssembly
Tool Requirements: Simulator
add rd = rs, rt add rd = rs, rt
ALU_read (rs, rt);
ALU_read (rs, rt);
ALU_add ();
ALU_add ();
Update_flags ();
Update_flags ();
writeback (rd);
writeback (rd);
mul rd = rs, rt mul rd = rs, rt
MUL_read (rs, rt);
MUL_read (rs, rt);
MUL_add ();
MUL_add ();
Update_flags ();
Update_flags ();
writeback (rd);
writeback (rd);
ld rd = @ ld rd = @
LSU_addrgen();
LSU_addrgen();
data_bus.req();
data_bus.req();
data_bus.read();
data_bus.read();
writeback (rd);
writeback (rd);
st @ = rs st @ = rs
LSU_addrgen();
LSU_addrgen();
LSU_read(rs);
LSU_read(rs);
data_bus.req();
data_bus.req();
data_bus.write(rs);
data_bus.write(rs);
Simulator Simulator Simulator add r5 = r2, r1 add r5 = r2, r1 add r5 = r2, r1 Machine Code
Machine Code
ALU_read (r2, r1);
ALU_add ();
Update_flags ();
writeback (r5);
ALU_read (r2, r1);
ALU_read (r2, r1);
ALU_add ();
ALU_add ();
Update_flags ();
Update_flags ();
writeback (r5);
writeback (r5);
Simulation Code (C) Simulation Code (C)
ADL Model
C Compiler C Compiler C Compiler a = b + c;
a = b + c;
a = b + c;
add c = a, b add c = a, b add c = a, b
Simulator Simulator Simulator add r5 = r2, r1 add r5 = r2, r1 add r5 = r2, r1
add rd = rs, rt add rd = rs, rt
ALU_read (rs, rt);
ALU_read (rs, rt);
ALU_add ();
ALU_add ();
Update_flags ();
Update_flags ();
writeback (rd);
writeback (rd);
++ rsrs rtrt
rdrd
SYNTAX {
“ADD“ dst, src1, src2 }
CODING {
0b0010 dst src1 src2 }
BEHAVIOR {
ALU_read (src1, src2);
ALU_add ();
Update_flags ();
writeback (dst);
}
SEMANTICS {
src1 + src2 Ædst;
}
SYNTAX {
“ADD“ dst, src1, src2 }
CODING {
0b0010 dst src1 src2 }
BEHAVIOR {
ALU_read (src1, src2);
ALU_add ();
Update_flags ();
writeback (dst);
}
SEMANTICS {
src1 + src2 Ædst;
}
ADL Model ADL Model
ALU_read (r2, r1);
ALU_add ();
Update_flags ();
writeback (r5);
ALU_read (r2, r1);
ALU_read (r2, r1);
ALU_add ();
ALU_add ();
Update_flags ();
Update_flags ();
writeback (r5);
writeback (r5);
Problem Statement
•• Compiler and Simulator need different information:Compiler and Simulator need different information:
•• Compiler: C operation to instruction(s)Compiler: C operation to instruction(s)
WHATWHAT is the instruction good for? Purpose?is the instruction good for? Purpose?
•• Simulator: instructions to sequence of operationsSimulator: instructions to sequence of operations
HOWHOW is the instruction executed? What actions to perform?is the instruction executed? What actions to perform?
•• Architecture Designer‘s Perspective:Architecture Designer‘s Perspective:
?
? ?
?
? ? ? ? ?
src1 + src2 Ædst;
src1 + src2 Ædst;
ALU_read (src1, src2);
ALU_add ();
Update_flags ();
writeback (dst);back (dst);
ALU_read (src1, src2);
ALU_add ();
Update_flags ();
write
Examples
ASDSP FPGA Implementation
ASDSP Core Design
FPGA Implementation
9 iProve Xilinx xc2v6000
Support the Special Instruction Set for FFT Operation and the BMU Instruction Improve the Performance for OFDM Communication
9SEC 0.18um Synthesis
• Gate : 77,000
• Program Memory : 4 Kbyte, Data Memory : 8 Kbyte
• Frequency : 290MHz
•Power consumption : 0.87W (3mW/MHz)
Myjung
Myjung Sunwoo, Sunwoo, AjiouAjiou University,University,
The ICORE
A low-power ASIP for Infineon DVB-T 2nd generation Single-Chip Receiver:
• ASIP for DVB-T acquisition and tracking algorithms
(sampling-clock-synchronization, interpolation / decimation, carrier frequency offset estimation)
• Harvard Architecture
• 60 mostly RISC-like Instructions &
Special Instructions for CORDIC-Algorithm
• 8x32-Bit General Purpose Registers, 4x9-Bit Address Registers
• 2048x20-Bit Instruction ROM, 512x32-Bit Data Memory
• I2C Registers and dedicated interfaces for external communication
Increasing SW Content- but How?
The Motorola M68HC11 Architecture
The Motorola M68HC11 Architecture
Architecture Overview
M68HC11 CPU Architecture :
» 8-bit micro-controller.
» Harvard Architecture
» 7 CPU Registers.
» 6 different Addressing Modes.
» Shared data and program bus. :
» Instruction width : 8,16, 24, 32, 40 :
» 8-bit opcode : 181 instructions
» Clock speed : ~200 MHz
» Performance : :
» Area : 15K to 30K (DesignWare® Library) Hot spots
stalled data access multi-cycle fetch
non-pipelined
Architecture Development with LISA
FE DC
512Bytes int. RAM 64Bytes Conf. Reg.
3.5K ext. RAM 61K ext. RAM 16
32
16 32
0x0000
0x10000
ACCU
Index X
Index Y Stack Pointer
Condition Accu B Accu A
32 EX
32
+ pipelined architecture
+ separate program and data bus + pipelined architecture
+ separate program and data bus
Results
•Area
< 23k gates
•Clock speed
~ 200 MHz
•Execution time speed up
62 % for spanning tree application
•Mapped onto Xilinx FPGA
Architecture Development with LISA
•Studying the architecture
•Basic architecture modifications
•Grouping and coding of the instructions
•Writing the LISA model
-basic syntax and coding -behavior section
•Validation
•HDL Generation Total
4 days 2 days 1 day
4 days 6 days 4 days 2 days 23 days
Institute for Integrated Signal Processing Systems
Design of Application Specific Processor Architectures
Rainer Leupers
RWTH Aachen University
Software for Systems on Silicon leupers@iss.rwth-aachen.de
4
2005 © R. Leupers
Overview
1. Introduction
2. ASIP design methodologies 3. Software tools
4. ASIP architecture design 5. Case study
6. Advanced research topics
5
2005 © R. Leupers
1. Introduction
6
2005 © R. Leupers
Embedded system design automation
¾ Embedded systems
Special-purpose electronic devices
Very different from desktop computers
¾ Strength of European IT market
Telecom, consumer, automotive, medical, ...
Siemens, Nokia, Bosch, Infineon, ...
¾ New design requirements
Low NRE cost, high efficiency requirements
Real-time operation, dependability
Keep pace with Moore´s Law
7
2005 © R. Leupers
What to do with chip area ?
8
2005 © R. Leupers
Example: wireless multimedia terminals
¾ Multistandard radio
UMTS
GSM/GPRS/EDGE
WLAN
Bluetooth
UWB
…
¾ Multimedia standards
MPEG-4
MP3
AAC
GPS
DVB-H
…
Key issues:
• Time to market (≤ 12 months)
• Flexibility (ongoing standard updates)
• Efficiency (battery operation) Key issues:
• Time to market (≤ 12 months)
• Flexibility (ongoing standard updates)
• Efficiency (battery operation)
9
2005 © R. Leupers
Application specific processors (ASIPs)
„As the performance of conventional microprocessors improves, they first meet and then exceed the requirements of most computing
applications. Initially, performance is key. But eventually, other factors, like customization, become more important to the customer...“
[M.J. Bass, C.M. Christensen: The Future of the Microprocessor Business, IEEE Spectrum 2002]
design budget = (semiconductor revenue) × (% for R&D)
growth ≈ 15% ≈ 10%
# IC designs = (design budget) / (design cost per IC) growth ≈ 50-100%
growth ≈ 15%
[Keutzer05]
→ Customizable application specific processors as reusable, programmable platforms
10
2005 © R. Leupers
Efficiency and flexibility
Source: T.Noll, RWTH Aachen
HW Design SW
Design
Digital Signal Processors General
Purpose Processors
103 . . . 104
Log P O W E R D I S S I P A T I O N 105 . . . 106
Application Specific
ICs
Physically Optimized
ICs Field
Programmable Devices
Log F L E X I B I L I T Y
Application Specific Instruction
Set Processors
Why use ASIPs?
• Higher efficiency for given range of applications
• IP protection
• Cost reduction (no royalties)
• Product differentiation
Log P E R F O R M A N C E
12
2005 © R. Leupers
2. ASIP design
methodologies
13
2005 © R. Leupers
ASIP architecture exploration
Linker
Assembler Compiler
Simulator Profiler
Application
Linker
Assembler Compiler
Simulator Profiler
Application
initial processor architecture
Linker
Assembler Compiler
Simulator Profiler
Application
optimized processor architecture
14
2005 © R. Leupers
Expression (UC Irvine)
15
2005 © R. Leupers
Tensilica Xtensa/XPRES
Source: Tensilica Inc.
16
2005 © R. Leupers
MIPS CorXtend/CoWare CorXpert
CorExtend Module
+
Profile and identify custom instructions
H o t s p o t
1
User Defined Instruction
User Defined Instruction Replace
critical code with special instruction
2
Synthesize HW and
profile with MIPSsim
and extensions
3
17
2005 © R. Leupers
CoWare LISATek ASIP architecture exploration
¾ Integrated embedded processor development environment
¾ Unified processor model in LISA 2.0
architecture description language (ADL)
¾ Automatic generation of:
SW tools
HW models
18
2005 © R. Leupers
LISA operation hierarchy
addr cond opcode opnds
imm linear cycl control arithm move short long
add sub mul and or
main
decode
Reflects hierarchical organization of ISAs
19
2005 © R. Leupers
LISA operations structure
LISA operation
BEHAVIOR
Computation and processor state update
SYNTAX
Assembly syntax CODING
Binary coding DECLARE
References to other operations
EXPRESSION
Resource access, e.g. registers
ACTIVATION
Initiate “downstream” operations in pipe
SEMANTICS
C compiler generation
20
2005 © R. Leupers
LISA operation example
OPERATION ADD {
DECLARE {
GROUP src1, src2, dest = { Register } }
CODING { 0b1011 src1 src2 dest }
SYNTAX { “ADD” dest “,” src1 “,” src2 }
BEHAVIOR { dest = src1 + src2; } }
OPERATION Register {
DECLARE {
LABEL index;
}
CODING { index }
SYNTAX { “R” index } EXPRESSION{ R[index] } }
C/C++ Code
ADD
Register Register Register
src1src1 src2src2 destdest
21
2005 © R. Leupers
Exploration/debugger GUI
• Application simulation
• Debugging
• Profiling
• Resource utilization analysis
• Pipeline analysis
• Processor model debugging
• Memory hierarchy exploration
• Code coverage analysis
• ...
• Application simulation
• Debugging
• Profiling
• Resource utilization analysis
• Pipeline analysis
• Processor model debugging
• Memory hierarchy exploration
• Code coverage analysis
• ...
22
2005 © R. Leupers
Some available LISA 2.0 models
¾ DSP:
Texas Instruments TMS320C54x
Analog Devices ADSP21xx
Motorola 56000¾ RISC:
MIPS32 4K
ESA LEON SPARC 8
ARM7100
ARM926• VLIW:
– Texas Instruments TMS320C6x
– STMicroelectronics ST220
• µC:
– MHS80C51
• ASIP:
– Infineon PP32 NPU – Infineon ICore
– MorphICs DSP
23
2005 © R. Leupers
3. Software tools
24
2005 © R. Leupers
Tools generated from processor ADL model
Linker
Assembler Compiler
Simulator Profiler
Application
25
2005 © R. Leupers
Instruction set simulation
Interpretive:
• flexible
• slow (~ 100 KIPS) Memory
Execute Decode
Application Instruction
Run-Time Run-Time
Compiled:
• fast (> 10 MIPS)
• inflexible
• high memory consumption
Compiled Simulation Application
Compile-Time
Compile -Time Run-TimeRun-Time
Program Memory Simulation
Compiler Execute
Instruction Behavior Instruction Behavior Instruction Behavior
JIT-CCS™:
• „just-in-time“
compiled
• SW simulation cache
• fast and flexible
Compiled Simulation
Cache Run-Time
Run-Time Program
Memory
Application Decode
Instruction Instruction Behavior
Instruction
Instruction Instruction Behavior
Execute
26
2005 © R. Leupers
JIT-CC simulation performance
0 1 2 3 4 5 6 7 8 9
Compiled
Interpretive 8 16 32 64 128 256 512
102 4
204 8
4096 8192
16384 32768
0 10 20 30 40 50 60 70 80 90 100
Cache size [records]
Performance[MIPS] CacheMissRatio[%]
• Dependent on simulation cache size
• 95% of compiled simulation performance @ 4096 cache blocks (10% memory consumption of compiled sim.)
• Example: ST200 VLIW DSP
27
2005 © R. Leupers
Why care about C compilers?
¾ Embedded SW design becoming predominant manpower factor in system design
¾ Cannot develop/maintain millions of code lines in assembly language
¾ Move to high-level programming languages
28
2005 © R. Leupers
Why care about compilers?
¾ Trend towards heterogeneous multiprocessor systems-on- chip (MPSoC)
¾ Customized application specific instruction set processors (ASIPs) are key MPSoC components
¾ How to achieve efficient compiler support for ASIPs?
ASICASIC CPUCPU ASIPASIP
CPUCPU
ASIPASIP ASIPASIP
Memory
Memory MemoryMemory MemoryMemory ASICASIC CPUCPU
MemMem
29
2005 © R. Leupers
C compiler in the exploration loop
„Compiler/Architecture Co„Compiler/Architecture Co--DesignDesign““
¾ Efficient C-compilers cannot be
designed for ARBITRARY architectures!
Application Application Software
Software CompilerCompiler ProcessorProcessor ResultsResults
¾ Compiler and processor form a UNIT that needs to be optimized!
¾ “Compiler-friendliness“ needs to be taken into account during the architecture exploration!
30
2005 © R. Leupers
Retargetable compilers
source code
asm code
Compiler Compiler
processor model
Retargetable compiler
source code
asm code
Classical compiler
Compiler Compiler
processor model
31
2005 © R. Leupers
GNU C compiler (gcc)
• Probably the most widespread retargetable compiler
• Mostly used as a native Unix/Linux compiler, but may operate as a cross-compiler, too
• Support for C/C++, Java, and other languages
• Comes with comprehensive support software, e.g. runtime and standard libraries, debug support
• Portable to new architectures by means of machine description file and C support routines
“The main goal of GCC was to make a good, fast compiler for machines in the class that the GNU system aims to run on: 32-bit
machines that address 8-bit bytes and have several general registers.
Elegance, theoretical power and simplicity are only secondary.”
“The main goal of GCC was to make a good, fast compiler for machines in the class that the GNU system aims to run on: 32-bit
machines that address 8-bit bytes and have several general registers.
Elegance, theoretical power and simplicity are only secondary.”
34
2005 © R. Leupers
CoSy compiler system (ACE)
© ACE - Associated Compiler Experts
• Universal retargetable C/C++
compiler
• Extensible intermediate representation (IR)
• Modular compiler organization
• Generator (BEG) for code selector, register allocator, scheduler
36
2005 © R. Leupers
LISATek C compiler generation
Autom. analyses
Manual refinement
GUI
CoSy system CoSy system
C Compiler C Compiler
LISA
processor model
SYNTAX {
“ADD“ dst, src1, src2 }
CODING {
0b0010 dst src1 src2 }
BEHAVIOR {
ALU_read (src1, src2);
ALU_add ();
Update_flags ();
writeback (dst);
}
SEMANTICS {
src1 + src2 Ædst;
}
…
SYNTAX {
“ADD“ dst, src1, src2 }
CODING {
0b0010 dst src1 src2 }
BEHAVIOR {
ALU_read (src1, src2);
ALU_add ();
Update_flags ();
writeback (dst);
}
SEMANTICS {
src1 + src2 Ædst;
}
…
37
2005 © R. Leupers
LISATek compiler generation
Frontend Opt Backend
ASM-Code
LD R1, [R2]
ADD R1, #1 SHL R1, #3
…
C-Code
int a,b,c;
a = b+1;
c = a<<3;
…
Code- Selector
Register-
Allocator Scheduler
Instruction- Fetch
Mem
FE DE ALU EX
WB
Write- Back
Pipeline Control
Decoder
Registers
Decoder Jump
Data RAM Prog RAM
ADD …
…R[i] …
…#1
R[0..31]
ADD SUB JMP
SUB MULJMP 2 1
ADD 2 3
38
2005 © R. Leupers
Compiled code quality: MIPS example
¾LISATek generated C-Compiler
¾Out-of-the-box C-Compiler
¾No manual optimizations
¾Development time of model approx. 2 weeks
¾LISATek generated C-Compiler
¾Out-of-the-box C-Compiler
¾No manual optimizations
¾Development time of model approx. 2 weeks
gcc C-Compiler
¾gcc with MIPS32 4kc backend
¾Used by most MIPS users
¾Large group of developers,
several man-years of optimization gcc C-Compiler
¾gcc with MIPS32 4kc backend
¾Used by most MIPS users
¾Large group of developers,
several man-years of optimization
Cycles
0 20.000.000 40.000.000 60.000.000 80.000.000 100.000.000 120.000.000 140.000.000
gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2
Cycles
Size
0 10.000 20.000 30.000 40.000 50.000 60.000 70.000 80.000
gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2
Overhead of 10% in cycle count and 17% in code densitySize
Overhead of 10% in cycle count and 17% in code density
39
2005 © R. Leupers
Demands on code quality
¾ Compilers for embedded processors have to generate extremely efficient code
Code size:
» system-on-chip
» on-chip RAM/ROM
Performance:
» real-time constraints
Power/energy consumption:
» heat dissipation
» battery lifetime
40
2005 © R. Leupers
Compiler flexibility/code quality trade-off
variety of embedded processors
specialization
DSP NPU VLIW
dedicated optimization
techniques
retargetable compilation
unification
41
2005 © R. Leupers
Adding processor-specific code optimizations
¾ High-level (compiler IR)
Enabled by CoSy´s engine concept
¾ Low-level (ASM):
.C.C LISA C
Compiler LISA C
Compiler Unscheduled
.asm Unscheduled
.asm
Binary Code Generation
Assembler
Assembler LinkerLinker .out Assembly API
Optimization 3 Optimization 3 Optimization 2 Optimization 2 Optimization 1 Optimization 1
Scheduled &
Optimized .asm Scheduled &
Optimized .asm
47
2005 © R. Leupers
4. ASIP architecture
design
48
2005 © R. Leupers
ASIP implementation after exploration
49
2005 © R. Leupers
Unified Description Layer
G a t e – L e v e l Register-Transfer-Level
L I S A
HDL Generation
Gate–Level Synthesis
(e.g. SYNOPSYS design compiler)
50
2005 © R. Leupers
Challenges in Automated ASIP Implementation
Instructions
Arithmetic Control
Mul
Mac
JMP
BRC
Independent description of instruction behavior:
+ Efficient Design Space Exploration
ADL:
1:1 Mapping
HDL:
Multiplier (MUL)
Multiplier (MAC)
Independent mapping to hardware blocks:
- Insufficient architectural efficiency by 1:1 mapping
51
2005 © R. Leupers
Unified Description Layer
G a t e – L e v e l Register-Transfer-Level Unified Description Layer
L I S A Structure & Mapping
(incl. JTAG/DEBUG)
Optimizations Backend (VHDL, Verilog, SystemC) Gate–Level Synthesis
(e.g. SYNOPSYS design compiler)
52
2005 © R. Leupers
Optimization strategies
LISA: separate descriptions for separate instructions Goal: share hardware for
separate instructions
Instruction A Instruction B
LISA Operation A
LISA Operation B Mutual
Exclusiveness
+
a b
x
+
c d
y Possible Optimizations
• ALU Sharing
x,y
+
a c b d
53
2005 © R. Leupers
Optimization strategies
AddressA DataA
Register Array DataB
AddressB
LISA Operation A
LISA Operation B Instruction A Instruction B
Path PA
Path PB
… ……
LISA: separate descriptions for separate instructions Goal: same hardware for
separate instructions
Possible Optimizations
• ALU Sharing
• Path Sharing
• ...
Mutual Exclusiveness
DataA, DataB AddressA AddressB
Register Array
…
Resource Sharing
54
2005 © R. Leupers
5. Case study
55
2005 © R. Leupers
Motorola 6811
Project Goals:
• Performance (MIPS) must be increased
• Compatibility on the assembly level for reuse of legacy code
(Integration into existing tool flow)
• Royalty free design
Î compatible architecture developed with LISA using RTL processor synthesis
56
2005 © R. Leupers
Motorola 6811
6812 6811
0100101010011010 1110010110101111 0000110110110100
legacy code
?
compiler
assembly
assembler
Increase Performance!!!
(MIPS) Increase Performance!!!
(MIPS)
57
2005 © R. Leupers
Motorola 6811
0100101010011010 1110010110101111 0000110110110100
Bluetooth app.
Synthesized Architecture 6811 compiler
assembly
assembler
LISA
assembly level compatible
58
2005 © R. Leupers
Architecture Development
original 6811 Processor LISA 6811 Processor
8 bit instructions 16 bit instructions 16 bit instructions 32 bit instructions 24 bit instructions
32 bit instructions 40 bit instructions Instruction is fetched by 8 bit blocks:
Î up to 5 cycles for fetching!
Instruction is fetched by 8 bit blocks:
Î up to 5 cycles for fetching!
16 bit are fetched simultaneously:
Î max 2 cycles for fetching!
+ pipelined architecture
+ possibility for special instructions 16 bit are fetched simultaneously:
Î max 2 cycles for fetching!
+ pipelined architecture
+ possibility for special instructions
59
2005 © R. Leupers
Tools Flow and RTL Processor Synthesis
C-Application
6811 compiler
Assembly
LISA model LISA assembler
Executable
LISA tools
6811 compatible architecture generated completely in VHDL
1) VLSI Implementation:
Area: <17kGates Clock Speed: ~154 MHz 2) Mapped onto XILINX FPGA
75
2005 © R. Leupers
References
¾ R. Leupers: Code Optimization Techniques for Embedded Processors - Methods, Algorithms, and Tools, Kluwer, 2000
¾ R. Leupers, P. Marwedel: Retargetable Compiler Technology for Embedded Systems - Tools and Applications, Kluwer,
2001
¾ A. Hoffmann, H. Meyr, R. Leupers:
Architecture Exploration for Embedded Processors with LISA, Kluwer, 2002
¾ C. Rowen, S. Leibson: Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors, Prentice Hall, 2004
¾ M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal Methodology, Springer, 2005
¾ P. Ienne, R. Leupers (eds.): Customizable and Configurable Embedded Processor Cores, Morgan Kaufmann, to appear 2006