Designing Programmable Platforms:

(1)

Designing Programmable Platforms:

From ASIC to ASIP

MPSoC 2005

Heinrich Meyr CoWare Inc., San Jose

and

Integrated Signal Processing Systems (ISS),

Aachen University of Technology, Germany

(2)

Agenda

Facts & Conclusions

Heterogeneous MPSoC

» Energy Efficiency vs.Flexibility

» How to explore the Design Space?

ASIP Design

Economics of SoC Development

Conclusions

Agenda

(3)

Facts & Conclusion

(4)

Core Proposition

ASIP

ASIP based based Platforms Platforms

(heterogenousMPSoC(heterogenousMPSoC))

(5)

Agenda

ASIP Design

Conclusions

Agenda

(6)

Trade-off between Flexibility and Energy -Efficiency

Heterogeneous

Heterogeneous MPSoCMPSoC

(7)

Architectural Objectives

Need more MOPS/Watt and MOPS/mm² to minimize the global performance measure for battery driven devices

Energy / decoded Bit = (Joule/Bit)

(8)

Computational Effiency vs. Flexibility

Source

Source: : T.NollT.Noll, RWTH Aachen, RWTH Aachen

(9)

Enabling MP-SoC Design

(10)

block implementation

micro architecture

domain •• RTL SynthesisRTL Synthesis

•• MatlabMatlab

•• SPWSPW

•• System StudioSystem Studio algorithm

domain

block specification

Architecture Description

Language

•• LISATek Processor SynthesisLISATek Processor Synthesis

•• ConvergenSC ConvergenSC BuscompilerBuscompiler

High-level IP block design

micro architecture

Language

system application design

algorithmic exploration

System Level Tools I: Application & IP Creation

(11)

System application design

System Level Tools II: MP-SoC Platform Design

•• MatlabMatlab

•• SPWSPW

••System StudioSystem Studio

micro architecture

High-level IP block design

micro architecture

Language

algorithmic exploration

virtual prototype

SystemC Transaction

Level

Modeling •• ConvergenSCConvergenSC Platform CreatorPlatform Creator

abstract architecture •• MPMP--SoCSoC Intermediate RepresentationIntermediate Representation algorithm

domain

MP-SoC platform design

abstract architecture

virtual prototype

SystemC Transaction

Level Modeling

•• MPMP--SoCSoC Intermediate RepresentationIntermediate Representation

•• ConvergenSCConvergenSC Platform CreatorPlatform Creator System Level Tools I: Application & IP Creation

(12)

Agenda

ASIP Design

Conclusions

Agenda

(13)

Processor Design Space

MMU

Memory Peripheral

Core Cache

FEFE DCDC EXEX WBWB

• Bypass ?

• Pipeline length ?

• Shared resources ?

• Parallel execution units ?

which cache required ?

bus fast enough?

butterfly 0 butterfly 1 load/store

communication?

• Exploit regularity/parallelism in data flow/data storage

• VLIW, SIMD, ?

• Which instructions for compiler support?

• Instruction Encoding?

• How much general purpose registers?

• Area constraints met?

• Clock frequency?

Instruction Set Design Micro Architecture Design

RTL Design Soc Integration

- Instruction-Set Design - Compiler Design - Instruction-Set Design

- Compiler Design -Micro Architecture Design-Micro Architecture Design

-RTL Design

- RTL ISS Co-verification -RTL Design

- RTL ISS Co-verification

-System Integration - Embedded Software

Simulation -System Integration - Embedded Software

Simulation Optimal design requires powerful tools

and automation !

Optimal design requires powerful tools and automation !

MESCAL 2:

Inclusively

Inclusively identify identify the the architectural

architectural space space

(14)

The purpose of an architecture description language (e.g LISA) is:

» To allow for an iterative design to efficiently explore architecture alternatives

» To jointly design “Architecture –Compiler” and on chip communication

» To automatically generate hardware (path to implementation)

» To automatically generate tools

» Assembler ,Linker, Compiler, Simulator, co-simulation interfaces

From a single model at various level of temporal and spatial abstraction

Architecture Description Language based Processor Design

MESCAL 3:

Efficiently

Efficiently describe describe the ASIP

the ASIP

(15)

very detailed

no details

LISA 2.0 - Abstraction Levels

time

high level model

Pseudo Pseudo Instructions Instructions

Processor Processor Instructions

Instructions CyclesCycles PhasesPhases Pseudo

Pseudo Resources Resources (e.g. c

(e.g. c--variables)variables) Functional units, Functional units, Registers,

Registers, Memories Memories + Pipelines + Pipelines + IRQ, etc.

+ IRQ, etc.

instruction accurate

model

cycle accurate

model

phase accurate

model

architecture

accuracy accuracy

(16)

FFT Processor

Application

Software Tool Chain Software

Tool Chain LISATek

LISATek

Processor Processor Designer Designer

RTL RTL

Executable Executable Software Software Platform Platform

RTLRTL

SoCSoC

Integration Kit Integration Kit (e.g.:SystemC(e.g.:SystemC))

DSP Sample VLIW Sample RISC Sample Empty Model

LISATek IP LISATek IP

Samples Samples

Custom Processor

Model (LISA 2.0 language)

Generate Generate Tools Tools

Function and instruction level Function and instruction level

profiling reveals hot

profiling reveals hot--spotsspots

--> special purpose instructions> special purpose instructions Describe/Adopt

Describe/Adopt Processor Model Processor Model

Generate...

Rapid modeling and re-targetable simulation + code-generation allows for:

joint optimization of application and architecture Rapid modeling and re

Rapid modeling and re-targetable-targetable simulation + code-simulation + code-generation allows for:generation allows for:

joint optimization of application and architecture joint optimization of application and architecture

MESCAL 3:

Efficiently

Efficiently describe describe and and evaluate evaluate the the ASIP ASIP

MESCAL 5:

Sucessfully

Sucessfully deploy deploy the ASIP

the ASIP

(17)

Current Work

Evaluation Results

Chip Area, Clock Speed, Power Consumption

SystemC, VHDL, Verilog Output

Gate Level Synthesis Target Architecture

LISA Description

Evaluation Results

Profile Information, Application Performance

Model Verification

& Evaluation

LISA Compiler

C-Compiler

Assembler Linker

Simulator

E X P L O R A T

I O N

I M P L E M E N T A T

I O N

Optimization HDL Generator

•Instruction Set Synthesis

•Memory architecture

•Verification

MESCAL 3:

… … .. .. evaluate evaluate the ASIP the ASIP

(18)

JuneJune 10,10, 20042004

A Novel Approach for Flexible and A Novel Approach for Flexible and Consistent ADL

Consistent ADL - - driven ASIP Design driven ASIP Design

Gunnar Braun Gunnar Braun

Achim Nohl Achim Nohl CoWare, Inc CoWare, Inc DAC Booth #1844 DAC Booth #1844 www.CoWare.com www.CoWare.com

Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer, Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer, Hanno Scharwächter, Rainer Leupers, Heinrich Meyr Hanno Scharwächter, Rainer Leupers, Heinrich Meyr

Integrated Signal Processing Systems (ISS) Integrated Signal Processing Systems (ISS)

Aachen

Aachen University of TechnologyUniversity of Technology Germany

Germany

(19)

Introduction Introduction Architecture Description Languages (ADL)

Architecture Description Languages (ADL)

•• Automatic generation of Software ToolkitAutomatic generation of Software Toolkit (Compiler, Assembler, Linker, IS

(Compiler, Assembler, Linker, IS--Simulator)Simulator)

•• Architecture ExplorationArchitecture Exploration

•• SystemC models, RTL code, verification tools, ...SystemC models, RTL code, verification tools, ...

Challenges:

•• Different tools need different informationDifferent tools need different information

•• Unambiguous, redundancy-Unambiguous, redundancy-free free architecturearchitecture modelmodel (rather than

(rather than tools descriptiontools description))

•• Multiple abstraction levels (instructionMultiple abstraction levels (instruction--accurateaccurate and/or cycle

and/or cycle--accurate)accurate)

(20)

Tool Requirements: Compiler

++ rsrs rtrt

rdrd

add rd = rs, rt add rd = rs, rt

** rsrs rtrt

rdrd

mul rd = rs, rt mul rd = rs, rt

LDLD

@@

rdrd ld rd = @ ld rd = @

STST rsrs

@@ st @ = rs st @ = rs C Compiler

C Compiler C Compiler a = b + c;

a = b + c;

a = b + c; CC

add c = a, b add c = a, b

add c = a, b AssemblyAssembly

(21)

Tool Requirements: Simulator

ALU_read (rs, rt);

ALU_add ();

Update_flags ();

writeback (rd);

mul rd = rs, rt mul rd = rs, rt

MUL_read (rs, rt);

MUL_add ();

writeback (rd);

ld rd = @ ld rd = @

LSU_addrgen();

data_bus.req();

data_bus.read();

writeback (rd);

st @ = rs st @ = rs

LSU_addrgen();

LSU_read(rs);

data_bus.req();

data_bus.write(rs);

Simulator Simulator Simulator add r5 = r2, r1 add r5 = r2, r1 add r5 = r2, r1 Machine Code

Machine Code

ALU_read (r2, r1);

ALU_add ();

writeback (r5);

ALU_add ();

writeback (r5);

Simulation Code (C) Simulation Code (C)

(22)

ADL Model

C Compiler C Compiler C Compiler a = b + c;

a = b + c;

add c = a, b add c = a, b add c = a, b

Simulator Simulator Simulator add r5 = r2, r1 add r5 = r2, r1 add r5 = r2, r1

ALU_read (rs, rt);

ALU_add ();

writeback (rd);

++ rsrs rtrt

rdrd

SYNTAX {

“ADD“ dst, src1, src2 }

CODING {

0b0010 dst src1 src2 }

BEHAVIOR {

ALU_read (src1, src2);

ALU_add ();

Update_flags ();

writeback (dst);

}

SEMANTICS {

src1 + src2 Ædst;

}

SYNTAX {

“ADD“ dst, src1, src2 }

CODING {

0b0010 dst src1 src2 }

BEHAVIOR {

ALU_add ();

Update_flags ();

writeback (dst);

}

SEMANTICS {

src1 + src2 Ædst;

}

ADL Model ADL Model

ALU_add ();

writeback (r5);

ALU_add ();

writeback (r5);

(23)

Problem Statement

•• Compiler and Simulator need different information:Compiler and Simulator need different information:

•• Compiler: C operation to instruction(s)Compiler: C operation to instruction(s)

WHATWHAT is the instruction good for? Purpose?is the instruction good for? Purpose?

•• Simulator: instructions to sequence of operationsSimulator: instructions to sequence of operations

HOWHOW is the instruction executed? What actions to perform?is the instruction executed? What actions to perform?

•• Architecture Designer‘s Perspective:Architecture Designer‘s Perspective:

?

? ?

?

? ? ? ? ?

src1 + src2 Ædst;

ALU_add ();

Update_flags ();

writeback (dst);back (dst);

ALU_add ();

Update_flags ();

write

(24)

Examples

(25)

ASDSP FPGA Implementation

ASDSP Core Design

FPGA Implementation

9 iProve Xilinx xc2v6000

Support the Special Instruction Set for FFT Operation and the BMU Instruction Improve the Performance for OFDM Communication

9SEC 0.18um Synthesis

• Gate : 77,000

• Program Memory : 4 Kbyte, Data Memory : 8 Kbyte

• Frequency : 290MHz

•Power consumption : 0.87W (3mW/MHz)

Myjung

Myjung Sunwoo, Sunwoo, AjiouAjiou University,University,

(26)

The ICORE

A low-power ASIP for Infineon DVB-T 2^nd generation Single-Chip Receiver:

• ASIP for DVB-T acquisition and tracking algorithms

(sampling-clock-synchronization, interpolation / decimation, carrier frequency offset estimation)

• Harvard Architecture

• 60 mostly RISC-like Instructions &

Special Instructions for CORDIC-Algorithm

• 8x32-Bit General Purpose Registers, 4x9-Bit Address Registers

• 2048x20-Bit Instruction ROM, 512x32-Bit Data Memory

• I2C Registers and dedicated interfaces for external communication

(27)

Increasing SW Content- but How?

The Motorola M68HC11 Architecture

The Motorola M68HC11 Architecture

(28)

Architecture Overview

M68HC11 CPU Architecture :

» 8-bit micro-controller.

» Harvard Architecture

» 7 CPU Registers.

» 6 different Addressing Modes.

» Shared data and program bus. :

» Instruction width : 8,16, 24, 32, 40 :

» 8-bit opcode : 181 instructions

» Clock speed : ~200 MHz

» Performance : :

» Area : 15K to 30K (DesignWare^® Library) Hot spots

stalled data access multi-cycle fetch

non-pipelined

(29)

Architecture Development with LISA

FE DC

512Bytes int. RAM 64Bytes Conf. Reg.

3.5K ext. RAM 61K ext. RAM 16

32

16 32

0x0000

0x10000

ACCU

Index X

Index Y Stack Pointer

Condition Accu B Accu A

32 EX

32

+ pipelined architecture

+ separate program and data bus + pipelined architecture

+ separate program and data bus

(30)

Results

•Area

< 23k gates

•Clock speed

~ 200 MHz

•Execution time speed up

62 % for spanning tree application

•Mapped onto Xilinx FPGA

(31)

Architecture Development with LISA

•Studying the architecture

•Basic architecture modifications

•Grouping and coding of the instructions

•Writing the LISA model

-basic syntax and coding -behavior section

•Validation

•HDL Generation Total

4 days 2 days 1 day

4 days 6 days 4 days 2 days 23 days

(32)

Institute for Integrated Signal Processing Systems

Design of Application Specific Processor Architectures

Rainer Leupers

RWTH Aachen University

Software for Systems on Silicon leupers@iss.rwth-aachen.de

(33)

4

2005 © R. Leupers

Overview

1. Introduction

2. ASIP design methodologies 3. Software tools

4. ASIP architecture design 5. Case study

6. Advanced research topics

(34)

5

2005 © R. Leupers

1. Introduction

(35)

6

2005 © R. Leupers

Embedded system design automation

¾ Embedded systems

Special-purpose electronic devices

Very different from desktop computers

¾ Strength of European IT market

Telecom, consumer, automotive, medical, ...

Siemens, Nokia, Bosch, Infineon, ...

¾ New design requirements

Low NRE cost, high efficiency requirements

Real-time operation, dependability

Keep pace with Moore´s Law

(36)

7

2005 © R. Leupers

What to do with chip area ?

(37)

8

2005 © R. Leupers

Example: wireless multimedia terminals

¾ Multistandard radio

^UMTS

GSM/GPRS/EDGE

^WLAN

^Bluetooth

^UWB

^…

¾ Multimedia standards

^MPEG-4

^MP3

^AAC

^GPS

^DVB-H

^…

Key issues:

• Time to market (≤ 12 months)

• Flexibility (ongoing standard updates)

• Efficiency (battery operation) Key issues:

• Time to market (≤ 12 months)

• Flexibility (ongoing standard updates)

• Efficiency (battery operation)

(38)

9

2005 © R. Leupers

Application specific processors (ASIPs)

„As the performance of conventional microprocessors improves, they first meet and then exceed the requirements of most computing

applications. Initially, performance is key. But eventually, other factors, like customization, become more important to the customer...“

[M.J. Bass, C.M. Christensen: The Future of the Microprocessor Business, IEEE Spectrum 2002]

design budget = (semiconductor revenue) × (% for R&D)

growth ≈ 15% ≈ 10%

# IC designs = (design budget) / (design cost per IC) growth ≈ 50-100%

growth ≈ 15%

[Keutzer05]

→ Customizable application specific processors as reusable, programmable platforms

(39)

10

2005 © R. Leupers

Efficiency and flexibility

Source: T.Noll, RWTH Aachen

HW Design SW

Design

Digital Signal Processors General

Purpose Processors

10³ . . . 10⁴

Log P O W E R D I S S I P A T I O N 105 . . . 106

Application Specific

ICs

Physically Optimized

ICs Field

Programmable Devices

Log F L E X I B I L I T Y

Application Specific Instruction

Set Processors

Why use ASIPs?

• Higher efficiency for given range of applications

• IP protection

• Cost reduction (no royalties)

• Product differentiation

Log P E R F O R M A N C E

(40)

12

2005 © R. Leupers

2. ASIP design

methodologies

(41)

13

2005 © R. Leupers

ASIP architecture exploration

Linker

Assembler Compiler

Simulator Profiler

Application

Linker

Assembler Compiler

Simulator Profiler

Application

initial processor architecture

Linker

Assembler Compiler

Simulator Profiler

Application

optimized processor architecture

(42)

14

2005 © R. Leupers

Expression (UC Irvine)

(43)

15

2005 © R. Leupers

Tensilica Xtensa/XPRES

Source: Tensilica Inc.

(44)

16

2005 © R. Leupers

MIPS CorXtend/CoWare CorXpert

CorExtend Module

+

Profile and identify custom instructions

H o t s p o t

1

User Defined Instruction

User Defined Instruction Replace

critical code with special instruction

2

Synthesize HW and

profile with MIPSsim

and extensions

3

(45)

17

2005 © R. Leupers

CoWare LISATek ASIP architecture exploration

¾ Integrated embedded processor development environment

¾ Unified processor model in LISA 2.0

architecture description language (ADL)

¾ Automatic generation of:

^{SW tools}

^{HW models}

(46)

18

2005 © R. Leupers

LISA operation hierarchy

addr cond opcode opnds

imm linear cycl control arithm move short long

add sub mul and or

main

decode

Reflects hierarchical organization of ISAs

(47)

19

2005 © R. Leupers

LISA operations structure

LISA operation

BEHAVIOR

Computation and processor state update

SYNTAX

Assembly syntax CODING

Binary coding DECLARE

References to other operations

EXPRESSION

Resource access, e.g. registers

ACTIVATION

Initiate “downstream” operations in pipe

SEMANTICS

C compiler generation

(48)

20

2005 © R. Leupers

LISA operation example

OPERATION ADD {

DECLARE {

GROUP src1, src2, dest = { Register } }

CODING { 0b1011 src1 src2 dest }

SYNTAX { “ADD” dest “,” src1 “,” src2 }

BEHAVIOR { dest = src1 + src2; } }

OPERATION Register {

DECLARE {

LABEL index;

}

CODING { index }

SYNTAX { “R” index } EXPRESSION{ R[index] } }

C/C++ Code

ADD

Register Register Register

src1src1 src2src2 destdest

(49)

21

2005 © R. Leupers

Exploration/debugger GUI

• Application simulation

• Debugging

• Profiling

• Resource utilization analysis

• Pipeline analysis

• Processor model debugging

• Memory hierarchy exploration

• Code coverage analysis

• ...

• Application simulation

• Debugging

• Profiling

• Resource utilization analysis

• Pipeline analysis

• Processor model debugging

• Memory hierarchy exploration

• Code coverage analysis

• ...

(50)

22

2005 © R. Leupers

Some available LISA 2.0 models

¾ DSP:

Texas Instruments TMS320C54x

Analog Devices ADSP21xx

Motorola 56000

¾ RISC:

^{MIPS32 4K}

ESA LEON SPARC 8

^ARM7100

^ARM926

• VLIW:

– Texas Instruments TMS320C6x

– STMicroelectronics ST220

• µC:

– MHS80C51

• ASIP:

– Infineon PP32 NPU – Infineon ICore

– MorphICs DSP

(51)

23

2005 © R. Leupers

3. Software tools

(52)

24

2005 © R. Leupers

Tools generated from processor ADL model

Linker

Assembler Compiler

Simulator Profiler

Application

(53)

25

2005 © R. Leupers

Instruction set simulation

Interpretive:

• flexible

• slow (~ 100 KIPS) ^Memory

Execute Decode

Application Instruction

Run-Time Run-Time

Compiled:

• fast (> 10 MIPS)

• inflexible

• high memory consumption

Compiled Simulation Application

Compile-Time

Compile -Time Run-TimeRun-Time

Program Memory Simulation

Compiler Execute

Instruction Behavior Instruction Behavior Instruction Behavior

JIT-CCS™:

• „just-in-time“

compiled

• SW simulation cache

• fast and flexible

Compiled Simulation

Cache Run-Time

Run-Time Program

Memory

Application Decode

Instruction Instruction Behavior

Instruction

Instruction Instruction Behavior

Execute

(54)

26

2005 © R. Leupers

JIT-CC simulation performance

0 1 2 3 4 5 6 7 8 9

Compiled

Interpretive 8 16 32 64 128 256 512

102 4

204 8

4096 8192

16384 32768

0 10 20 30 40 50 60 70 80 90 100

Cache size [records]

Performance[MIPS] CacheMissRatio[%]

• Dependent on simulation cache size

• 95% of compiled simulation performance @ 4096 cache blocks (10% memory consumption of compiled sim.)

• Example: ST200 VLIW DSP

(55)

27

2005 © R. Leupers

Why care about C compilers?

¾ Embedded SW design becoming predominant manpower factor in system design

¾ Cannot develop/maintain millions of code lines in assembly language

¾ Move to high-level programming languages

(56)

28

2005 © R. Leupers

Why care about compilers?

¾ Trend towards heterogeneous multiprocessor systems-on- chip (MPSoC)

¾ Customized application specific instruction set processors (ASIPs) are key MPSoC components

¾ How to achieve efficient compiler support for ASIPs?

ASICASIC CPUCPU ASIPASIP

CPUCPU

ASIPASIP ASIPASIP

Memory

Memory MemoryMemory MemoryMemory ASICASIC CPUCPU

MemMem

(57)

29

2005 © R. Leupers

C compiler in the exploration loop

„Compiler/Architecture Co„Compiler/Architecture Co--DesignDesign““

¾ Efficient C-compilers cannot be

designed for ARBITRARY architectures!

Application Application Software

Software CompilerCompiler ProcessorProcessor ResultsResults

¾ Compiler and processor form a UNIT that needs to be optimized!

¾ “Compiler-friendliness“ needs to be taken into account during the architecture exploration!

(58)

30

2005 © R. Leupers

Retargetable compilers

source code

asm code

Compiler Compiler

processor model

Retargetable compiler

source code

asm code

Classical compiler

Compiler Compiler

processor model

(59)

31

GNU C compiler (gcc)

• Probably the most widespread retargetable compiler

• Mostly used as a native Unix/Linux compiler, but may operate as a cross-compiler, too

• Support for C/C++, Java, and other languages

• Comes with comprehensive support software, e.g. runtime and standard libraries, debug support

• Portable to new architectures by means of machine description file and C support routines

“The main goal of GCC was to make a good, fast compiler for machines in the class that the GNU system aims to run on: 32-bit

machines that address 8-bit bytes and have several general registers.

Elegance, theoretical power and simplicity are only secondary.”

“The main goal of GCC was to make a good, fast compiler for machines in the class that the GNU system aims to run on: 32-bit

machines that address 8-bit bytes and have several general registers.

Elegance, theoretical power and simplicity are only secondary.”

(60)

34

CoSy compiler system (ACE)

• Universal retargetable C/C++

compiler

• Extensible intermediate representation (IR)

• Modular compiler organization

• Generator (BEG) for code selector, register allocator, scheduler

(61)

36

LISATek C compiler generation

Autom. analyses

Manual refinement

GUI

CoSy system CoSy system

C Compiler C Compiler

LISA

processor model

SYNTAX {

“ADD“ dst, src1, src2 }

CODING {

0b0010 dst src1 src2 }

BEHAVIOR {

ALU_read (src1, src2);

ALU_add ();

writeback (dst);

}

SEMANTICS {

src1 + src2 Ædst;

}

…

SYNTAX {

“ADD“ dst, src1, src2 }

CODING {

0b0010 dst src1 src2 }

BEHAVIOR {

ALU_read (src1, src2);

ALU_add ();

writeback (dst);

}

SEMANTICS {

src1 + src2 Ædst;

}

…

(62)

37

LISATek compiler generation

Frontend Opt Backend

ASM-Code

LD R1, [R2]

ADD R1, #1 SHL R1, #3

…

C-Code

int a,b,c;

a = b+1;

c = a<<3;

…

Code- Selector

Register-

Allocator Scheduler

Instruction- Fetch

Mem

FE DE ALU EX

WB

Write- Back

Pipeline Control

Decoder

Registers

Decoder Jump

Data RAM Prog RAM

ADD …

…R[i] …

…#1

R[0..31]

ADD SUB JMP

^SUB ^MUL

JMP 2 1

ADD 2 3

(63)

38

Compiled code quality: MIPS example

¾LISATek generated C-Compiler

¾Out-of-the-box C-Compiler

¾No manual optimizations

¾Development time of model approx. 2 weeks

¾LISATek generated C-Compiler

¾Out-of-the-box C-Compiler

¾No manual optimizations

¾Development time of model approx. 2 weeks

gcc C-Compiler

¾gcc with MIPS32 4kc backend

¾Used by most MIPS users

¾Large group of developers,

several man-years of optimization gcc C-Compiler

¾gcc with MIPS32 4kc backend

¾Used by most MIPS users

¾Large group of developers,

several man-years of optimization

Cycles

0 20.000.000 40.000.000 60.000.000 80.000.000 100.000.000 120.000.000 140.000.000

gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

Cycles

Size

0 10.000 20.000 30.000 40.000 50.000 60.000 70.000 80.000

gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

Overhead of 10% in cycle count and 17% in code densitySize

Overhead of 10% in cycle count and 17% in code density

(64)

39

Demands on code quality

¾ Compilers for embedded processors have to generate extremely efficient code

Code size:

» system-on-chip

» on-chip RAM/ROM

Performance:

» real-time constraints

Power/energy consumption:

» heat dissipation

» battery lifetime

(65)

40

Compiler flexibility/code quality trade-off

variety of embedded processors

specialization

DSP NPU VLIW

dedicated optimization

techniques

retargetable compilation

unification

(66)

41

Adding processor-specific code optimizations

¾ High-level (compiler IR)

Enabled by CoSy´s engine concept

¾ Low-level (ASM):

.C.C LISA C

Compiler LISA C

Compiler Unscheduled

.asm Unscheduled

.asm

Binary Code Generation

Assembler

Assembler LinkerLinker .out Assembly API

Optimization 3 Optimization 3 Optimization 2 Optimization 2 Optimization 1 Optimization 1

Scheduled &

Optimized .asm Scheduled &

Optimized .asm

(67)

47

4. ASIP architecture

design

(68)

48

ASIP implementation after exploration

(69)

49

Unified Description Layer

G a t e – L e v e l Register-Transfer-Level

L I S A

HDL Generation

Gate–Level Synthesis

(e.g. SYNOPSYS design compiler)

(70)

50

Challenges in Automated ASIP Implementation

Instructions

Arithmetic Control

Mul

Mac

JMP

BRC

Independent description of instruction behavior:

+ Efficient Design Space Exploration

ADL:

1:1 Mapping

HDL:

Multiplier (MUL)

Multiplier (MAC)

Independent mapping to hardware blocks:

- Insufficient architectural efficiency by 1:1 mapping

(71)

51

Unified Description Layer

G a t e – L e v e l Register-Transfer-Level Unified Description Layer

L I S A Structure & Mapping

(incl. JTAG/DEBUG)

Optimizations Backend (VHDL, Verilog, SystemC) Gate–Level Synthesis

(e.g. SYNOPSYS design compiler)

(72)

52

Optimization strategies

LISA: separate descriptions for separate instructions Goal: share hardware for

separate instructions

Instruction A Instruction B

LISA Operation A

LISA Operation B Mutual

Exclusiveness

+

a b

x

+

c d

y Possible Optimizations

• ALU Sharing

x,y

+

a c b d

(73)

53

Optimization strategies

Address_A Data_A

Register Array Data_B

Address_B

LISA Operation A

LISA Operation B Instruction A Instruction B

Path P_A

Path P_B

… ……

LISA: separate descriptions for separate instructions Goal: same hardware for

separate instructions

Possible Optimizations

• ALU Sharing

• Path Sharing

• ...

Mutual Exclusiveness

Data_A, Data_B Address_A Address_B

Register Array

…

Resource Sharing

(74)

54

5. Case study

(75)

55

Motorola 6811

Project Goals:

• Performance (MIPS) must be increased

• Compatibility on the assembly level for reuse of legacy code

(Integration into existing tool flow)

• Royalty free design

Î compatible architecture developed with LISA using RTL processor synthesis

(76)

56

Motorola 6811

6812 6811

0100101010011010 1110010110101111 0000110110110100

legacy code

?

compiler

assembly

assembler

Increase Performance!!!

(MIPS) Increase Performance!!!

(MIPS)

(77)

57

Motorola 6811

0100101010011010 1110010110101111 0000110110110100

Bluetooth app.

Synthesized Architecture 6811 compiler

assembly

assembler

LISA

assembly level compatible

(78)

58

Architecture Development

original 6811 Processor LISA 6811 Processor

8 bit instructions 16 bit instructions 16 bit instructions 32 bit instructions 24 bit instructions

32 bit instructions 40 bit instructions Instruction is fetched by 8 bit blocks:

Î up to 5 cycles for fetching!

Instruction is fetched by 8 bit blocks:

Î up to 5 cycles for fetching!

16 bit are fetched simultaneously:

Î max 2 cycles for fetching!

+ pipelined architecture

+ possibility for special instructions 16 bit are fetched simultaneously:

Î max 2 cycles for fetching!

+ pipelined architecture

+ possibility for special instructions

(79)

59

Tools Flow and RTL Processor Synthesis

C-Application

6811 compiler

Assembly

LISA model LISA assembler

Executable

LISA tools

6811 compatible architecture generated completely in VHDL

1) VLSI Implementation:

Area: <17kGates Clock Speed: ~154 MHz 2) Mapped onto XILINX FPGA

(80)

75

References

¾ R. Leupers: Code Optimization Techniques for Embedded Processors - Methods, Algorithms, and Tools, Kluwer, 2000

¾ R. Leupers, P. Marwedel: Retargetable Compiler Technology for Embedded Systems - Tools and Applications, Kluwer,

2001

¾ A. Hoffmann, H. Meyr, R. Leupers:

Architecture Exploration for Embedded Processors with LISA, Kluwer, 2002

¾ C. Rowen, S. Leibson: Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors, Prentice Hall, 2004

¾ M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal Methodology, Springer, 2005

¾ P. Ienne, R. Leupers (eds.): Customizable and Configurable Embedded Processor Cores, Morgan Kaufmann, to appear 2006

Designing Programmable Platforms: