Integrating dataflow and non-dataflow real-time application models on multi-core platforms

(1)

FACULDADE DE

ENGENHARIA DA

UNIVERSIDADE DO

PORTO

Integrating Dataflow and Non-Dataflow

Real-time Application Models on

Multi-core Platforms

Hazem Ismail Abdelaziz Ali

Programa Doutoral em Engenharia Eletrotécnica e de Computadores Supervisor: Prof. Dr. Luís Miguel Rosário da Silva Pinho

Co-Supervisor: Dr. Kjell Benny Åkesson

(2)

(3)

Integrating Dataflow and Non-Dataflow Real-time

Application Models on Multi-core Platforms

Hazem Ismail Abdelaziz Ali

Programa Doutoral em Engenharia Eletrotécnica e de Computadores

Approved by :

President : Dr. José Alfredo Ribeiro da Silva Matos External Referee: Dr. Sander Stuijk

External Referee: Dr. Johan Eker

Internal Referee : Dr. Luís Miguel Pinho de Almeida Internal Referee : Dr. Mário Jorge Rodrigues de Sousa Supervisor : Dr. Luís Miguel Rosário da Silva Pinho

(4)

(5)

Abstract

Day by day, gradually and steadily, applications in all segments of computing, including embedded systems, are getting more complex, because of the increased range of functionality they offer. This complexity requires platforms with increased performance that satisfies such growing computa-tional demands. This need has driven the adoption of multi-core processors in embedded systems, since they allow performance to be increased at a reasonable energy consumption.

Future real-time embedded systems will increasingly incorporate mixed application models with timing constraints running on the same multi-core platform. These application models are dataflow applications with timing constraints and traditional real-time applications modelled as in-dependent arbitrary-deadline tasks. Examples of such mixed embedded systems are Autonomous Driving Systems and Unmanned Ariel Vehicles. These systems require guarantees that all running applications execute satisfying their timing constraints. Also, to be cost-efficient in terms of de-sign, they require efficient mapping strategies that maximize the use of system resources to reduce the overall cost.

This work proposes a complete approach with a main goal to integrate mixed application models (dataflow and traditional real-time applications) with timing requirements on the same multi-core platform. This approach guarantees that the mapped applications satisfy their timing constraints and maximize utilization of the platform resources. Three main algorithms to achieve the main goal. The first algorithm is called slack-based merging, which is an offline dataflow graph reduction technique that aims to decrease the complexity of dataflow applications, and thereby their analysis time. The algorithm reduces the run-time of our approach with 82% to 90%, com-pared to when it is not used. The experimental evaluation with real application models from the SDF3_{benchmark shows that the reduced graph: 1) respects the timing constraints, i.e. throughput} and latency, of the original application graph and 2) when the throughput constraint is relaxed with respect to the maximal throughput of the graph, the merging algorithm is able to achieve a larger reduction in graph size.

The second algorithm is called Timing Parameter Extraction, which extracts timing param-eters, i.e. offsets, periods and deadlines, of dataflow applications with timing constraints, i.e. throughputand latency, converting them into periodic arbitrary-deadline tasks. These tasks exe-cute in a way that preserve the dependencies of the original dataflow application using the offset parameter, while satisfying its timing constraints using the period and deadline parameters. This algorithm is a means to unify the two mixed application models into a single real-time task set. The main advantage of this algorithm is that the extraction of the timing parameters is independent of the specific scheduler being used, of other applications running in the system and the details of the particular platform. In addition, the experimental evaluation shows that the reduced-size dataflow graphs generated by the slack-based merging algorithm, in particular for applications that do not need to execute at maximum throughput, help speeding up the extraction of the timing parameters. The third algorithm is called communication-aware mapping, which allocates the mixed ap-plication models on a 2D-Mesh multi-core platform after unifying them. The mapping process is

(6)

algorithm is based on a novel mapping heuristic called Sensitive-Path-First, which surpasses the well-known First Fit bin-packing heuristic in terms of number of allocated applications and run-time by up to 28% and 22%, respectively. The experimental evaluation reveals a direct relation between the number of allocated applications and the availability of communication resources, which demonstrates the importance of considering communication cost. We also show that ignor-ing communication cost, as frequently done in existignor-ing work, allows 76% more applications to be mapped, although the applications in the system are no longer guaranteed to satisfy their timing constraints.

Together, these three important algorithms successfully achieve the main goal of this thesis and play a part in allowing embedded real-time systems to map and schedule mixed application mod-els. The complete approach and the three algorithms presented in this thesis have been validated through proofs and experimental evaluation.

(7)

Resumo

À semelhança do que acontece noutros domínios da computação, os sistemas embebidos estão cada vez mais complexos, devido ao aumento e diversidade das funcionalidades que fornecem, o que tem levado à necessidade de plataformas com maior desempenho. Esta exigência tem levado à cada vez maior adoção de plataformas multi-núcleo de processamento (multi-core) neste tipo de sistemas, permitindo o aumento de desempenho com custos razoáveis de energia.

Os sistemas embebidos do futuro integrarão na mesma plataforma multi-núcleo aplicações com diferentes modelos de computação, e com requisitos temporais. Entre estas é expectável a necessidade de integrar aplicações tradicionais de tempo-real (modelizadas por tarefas inde-pendentes) com aplicações modelizadas por fluxos de dados (dataflow). Exemplos podem ser encontrados em sistemas de condução autónoma ou veículos aéreos sem piloto, sistemas que re-querem a garantia de cumprimentos dos prazos temporais de todas as aplicações. Para além disso, são sistemas em que é fundamental a existência de estratégias automatizadas de mapeamento da computação que maximizem a utilização dos recursos disponibilizados pela plataforma.

Esta dissertação propõe uma metodologia completa para a integração numa só plataforma multi-núcleo de aplicações com modelos computacionais distintos (fluxo de dados e tradicionais tempo-real) e com requisitos temporais. Esta metodologia permite garantir que as aplicações cumprem com os seus requisitos temporais, ao mesmo tempo que maximiza a utilização dos re-cursos do sistema. Para este efeito, a metodologia inclui três algoritmos diferentes.

Num primeiro passo, é utilizado um algoritmo, slack-based merging, para reduzir a complex-idade dos grafos de fluxo de dados com que são modelizadas as aplicações que utilizam este modelo computacional, o que permite reduzir o tempo de análise das mesmas. Este algoritmo permite reduzir o tempo de processamento do processo de 82% a 90%. A avaliação experimental com modelos de aplicações reais, do benchmark SDF3_{demonstra que o grafo reduzido: 1) respeita} os requisites temporais do grafo original, i.e., o desempenho (throughput) e a latência (latency), e 2) quando se relaxa o requisito de desempenho em relação ao máximo permitido pelo grafo, o algoritmo permite uma maior redução do tamanho do grafo.

O segundo algoritmo, Timing Parameter Extraction, permite extrair as características tempo-rais tradicionais de uma aplicação de tempo-real, i.e., períodos (periods), prazos (deadlines) e deslocamentos (offsets), a partir dos modelos de fluxo de dados com requisitos de desempenho (throughput) e latência (latency), convertendo assim estes fluxos em tarefas periódicas indepen-dentes. Estas tarefas executam de forma a preservar as dependências do modelo de fluxo de dados original através do deslocamento da ativação de tarefas consequentes, satisfazendo os requisitos de processamento e latência através dos períodos de ativação e prazos temporais. Este algoritmo permite assim unificar os dois modelos distintos de computação, num só conjunto de tarefas de tempo-real. A vantagem principal deste algoritmo é que esta extração de parâmetros é indepen-dente do escalonador utilizado, de outras aplicações que executam no sistema, e dos detalhes da plataforma. A avaliação experimental também demonstra que o tempo de processamento desta extração é reduzido pela redução dos grafos obtida pelo algoritmo anterior, particularmente para

(8)

usam os dois modelos de computação, após unificação, em plataforma multi-núcleo com co-municação em 2 dimensões entre núcleos (2D-Mesh). O mapeamento é efetuado considerando os requisites temporais das aplicações, e maximiza a utilização dos recursos computacionais da plataforma, tendo em consideração os potenciais custos de comunicação. Este algoritmo é baseado numa noval heurística, Sensitive-Path-First, a qual obtém melhores resultados que a heurística First-Fit, tanto em termos de número de aplicações mapeadas como em tempo de processamento (28% e 22% melhor, respetivamente). A avaliação experimental mostra uma relação direta entre o número de aplicações mapeadas e a disponibilizada de recursos de Comunicação, o que demonstra a importância da consideração destes custos durante o mapeamento. Também mostramos que, ig-norando os custos de comunicação, como é habitualmente feito em trabalhos semelhantes, permite mapear até 76% mais aplicações, embora sem conseguir garantir a satisfação dos seus requisitos temporais.

Em conjunto, estes três algoritmos importantes permitem atingir com sucesso o objetivo prin-cipal desta dissertação, potenciando o mapeamento e integração em sistemas embebidos de tempo-real de aplicações com modelos computacionais distintos. A metodologia complete e os três algo-ritmos apresentados na dissertação foram validados por provas e avaliação experimental.

(9)

Acknowledgements

Undertaking this PhD has been a truly life-changing experience. Like most research work, this PhD is the result of a curious and inquisitive spirit, coupled with plenty of hard work and per-sistence. Naturally, it was difficult at times, but overall, the fulfilling moments far exceeded the hardship ones. This research would not be possible to do it without the support and guidance that I received from a lot of people, to whom I will always be grateful.

First, I would like to express my sincere gratitude to my supervisor Prof. Luís Miguel Pinho for believing in me and giving me the chance to work with him. His continuous guidance, pa-tience, motivation, and support through my entire PhD studies helped me in all time of research. Also, I wish to extend a sincere and heartfelt thanks to my co-supervisor Dr. Benny Akesson on both professional and personal level. On professional level, for his dedication and comprehensive assistance through my entire research journey. His sharp insights, valuable feedback and detailed discussions with him, helped in shaping up my research till this final outcome. On the personal level, Benny is one of the friendliest persons that you forget that he is actually your supervisor. He always maintains a personal relation with his students where he socialize and involve in dif-ferent activities. I will never forget our interesting long runs, where we had fun and enjoyable discussions.

Second, a huge thank goes to Dr. Stefan Markus Petters. Although we did not work directly together, he was one of the main reasons to join CISTER research group. He was kind enough to listen to my counter argument, after he sent an email not accepting me for the PhD position. This normally does not happen in applying for PhD positions. I am really grateful for him.

Third and most important, none of this achievements would have been possible without the love and patience of my family. My parents, Ismail Abdelaziz Ali and Somaia Mohamed Elsayad, have been a constant source of love, concern, support and strength all these years. Especially my mother, Somaia Mohamed Elsayad, for the long hours she invested teaching me mathematics, algerbra and geometry that made me like engineering. I owe her what I am right now. Also, I would like to express my heartfelt gratitude to my brothers and sister, Mohamed Abdelaziz, Ahmed and Reham, for their continuous encouragement during my long research journey that started in 2008, going to sweden for doing my masters degree.

Fourth, my dear friend and CISTER companion Borislav Nikoli´c. We have spent more than six years together at CISTER, where we shared a very memorable moments of happiness, success and lifetime achievements. His valuable advice along with his cheerful and funny spirit made my PhD life easier. I deeply thank him very much. I will never forget such times and I wish you all the best in your life and career.

Fifth, Prof. Eduardo Tovar for creating an outstanding work environment in CISTER Re-search Center. I have always enjoyed the working environment in our office, with great office mates. Especially Muhammad Ali Awan and Claudio Maia for being good friends and colleagues. During these six years, we have had all the interesting discussions covering a variety of topics, such as technology, sports, culture etc. I would like to add that I feel fortunate to have known

(10)

Harrison Kurunathan during these years. Last but not the least, I extend my gratitude to all the staff members at CISTER Research Center, who have made these years more enjoyable.

This work was partially supported by FCT (Fundação para a Ciência e Tecnologia) under the individual doctoral grant SFRH/BD/79872/2011.

(11)

List of Publications

Articles Included in this Thesis

• Hazem Ismail Ali, Luís Miguel Pinho and Benny Akesson, "Critical-Path-First based allocation of real-time streaming applications on 2D mesh-type multi-cores," in IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications, Taipei, 2013, pp. 201-208. doi: 10.1109/RTCSA.2013.6732220, URL:

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber= 6732220&isnumber=6732192

• Hazem Ismail Ali, Benny Akesson and Luís Miguel Pinho, "Generalized Extraction of Real-Time Parameters for Homogeneous Synchronous Dataflow Graphs," in 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, 2015, pp. 701-710. doi: 10.1109/PDP.2015.57, URL:

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber= 7092796&isnumber=7092002

• Hazem Ismail Ali, Sander Stuijk, Benny Akesson, and Luís Miguel Pinho. "Reducing the complexity of dataflow graphs using slack-based merging,". in ACM Transactions on Design Automation of Electronic Systems, 22, 2, Article 24 (January 2017), 22 pages. ISSN 1084-4309. doi: 10.1145/2956232. URL:

http://dx.doi.org/10.1145/2956232

• Hazem Ismail Ali, Benny Akesson and Luís Miguel Pinho. Combining dataflow

applications and real-time task sets on multi-core platforms. In Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems

(SCOPES ’17), Sander Stuijk (Ed.). ACM, New York, NY, USA, 60-63. doi:

10.1145/3078659.3078671 URL:https://doi.org/10.1145/3078659.3078671

List of Figures

1.1 Dataflow application. . . 5

1.1(a) . . . 5

1.1(b) . . . 5

1.2 Example of SDF and HSDF graphs. . . 6

1.2(a) SDF graph . . . 6

1.2(b) HSDF graph . . . 6

1.3 Examples of embedded systems running streaming applications. . . 8

1.3(a) Smartphones [Kenya Tech News,2015]. . . 8

1.3(b) Autonomous driving systems [Daily Autonomous Car News,2015] . . . . 8

1.4 Examples of Interconnection Networks (IN) [Sanchez et al.,2010]. . . 10

1.4(a) 2D Mesh . . . 10

1.4(b) Fat Tree . . . 10

1.4(c) Flattened Butterfly . . . 10

1.5 Problem to be addressed . . . 12

1.6 Solution outline. . . 13

3.1 Real-time task parameters. . . 28

3.2 An SDF graph and its HSDF representation. . . 32

3.2(a) SDF graph. . . 32

3.2(b) HSDF graph. . . 32

3.3 An SDF graph and its HSDF representation with finite-size buffers. . . 35

3.3(a) SDF graph . . . 35

3.3(b) HSDF graph . . . 35

3.4 TILE64™_{block diagram [}_{Bell et al.}_,₂₀₀₈_{]. . . .} ₃₆

3.5 A TDM frame with frame size F of 6 where 2 allocated slots κ1to application A1 for continous slot assignment policy [Akesson et al.,2015]. . . 38

4.1 An SDF graph and its HSDF representation. . . 42

4.1(a) SDF graph . . . 42

4.1(b) HSDF graph . . . 42

4.2 An SDF graph and its HSDF representation with finite-size buffers. . . 43

4.2(a) SDF graph . . . 43

4.2(b) HSDF graph . . . 43

4.3 A safe merge operation of two independent firings (vij, vkl) into a new cluster V. . 44

4.4 HSDF graph after adding s and t. . . 45

4.5 Example of slack-based merging. . . 51

4.5(a) Merging of vb0 and vb1 . . . 51

4.5(b) Merging of vc₀ and vc₁ . . . 51 xv

(20)

5.2 Enumeration of time-constrained paths. . . 61

5.3 Partial path classes for offsets setting . . . 62

5.3(a) Class Head partial path . . . 62

5.3(b) Class Tail partial path . . . 62

5.3(c) Class Middle partial path . . . 62

5.4 HSDF example. . . 65

5.4(a) HSDF application. . . 65

5.4(b) HSDF timing diagram. . . 65

5.4(c) Actors’ timing parameters. . . 65

5.5 h263encoder results. . . 70

5.5(a) Results in terms of number of actors . . . 70

5.5(b) Results in terms of run-time . . . 70

5.5(c) The percentage of change in the CP execution time of Gmcompared to Gh 70 5.6 h263decoder results. . . 71

5.6(c) The percentage of change in the CP execution time of Gmcompared to Gh 71 5.7 satellite results. . . 72

5.7(c) The percentage of change in the CP execution time of Gmcompared to Gh 72 5.8 modem results. . . 73

5.8(c) The percentage of change in the CP execution time of Gmcompared to Gh 73 6.1 Initial modelling of communication. . . 76

6.1(a) HSDF graph . . . 76

6.1(b) HSDF graph with message actors Gcom . . . 76

6.2 Core selection methodology . . . 80

6.2(a) spiral_move . . . 80

6.2(b) find_nearest_core . . . 80

6.3 Partial path classification used by SPF heuristic. . . 85

6.3(a) Class Head partial path . . . 85

6.3(b) Class Tail partial path . . . 85

6.3(c) Class Middle partial path . . . 85

6.4 Effect of reservation bandwidth R. . . 89

6.5 Evaluation of the mapping heuristic. . . 90

6.5(a) Results in terms of average number of mapped applications. . . 90

6.5(b) Results in terms of run-time. . . 90

6.6 Mapping results for merged and original HSDF graphs. . . 92

6.6(a) Results in terms of average number of mapped applications. . . 92

(21)

List of Tables

4.1 SDF3_{benchmark applications. . . .} ₅₂

4.2 Run-time (seconds) of the algorithm. . . 53

4.3 Number of actors before and after merging. . . 54

6.1 General configuration of the experimental setup. . . 88

6.2 SDF3_{benchmark applications. . . .} ₈₈

(22)

(23)

List of Algorithms

1 Quick convergence Processor-demand Analysis (QPA) [Zhang and Burns,2009a]. 31

2 Slack-based merging . . . 50

3 Extracting timing parameters of HSDF . . . 63

4 Complete approach for integrating mixed application models on the same platform Π . . 79

5 Communication-aware mapping . . . 81

6 Sensitive-Path-First (SPF) . . . 84

(24)

(25)

List of Abbreviations

ADF Affine Dataflow BFS Breadth First Search bps bits per second CPF Critical-Path-First CP Critical Path

CSDF Cyclo-Static Dataflow DAG Directed Acyclic Graph dbf demand bound function DCG Directed Cyclic Graph DF Dataflow

DM Deadline Monotonic DSP Digital Signal Processing

DVFS Dynamic Voltage and Frequency Scaling EDF Earliest Deadline First

FF First Fit

FIFO First In First Out

GEDF Global Earliest Deadline First

HSDF Homogeneous Synchronous Dataflow ILP Integer Linear Programming

IN Interconnection Network MCM Maximum Cycle Mean MLLF Modified Least Laxity First MPAG Max-Plus Automaton Graph NDF Non-Dataflow

NoC Network on Chip

PEDF Partitioned Earliest Deadline First P/C Production/Consumption

RM Rate Monotonic

SCC Strongly Connected Component SDF Synchronous Dataflow

SDM Space Division Multiplexing SADF Scenario-Aware Dataflow SPF Sensitive-Path-First SPP Static-Priority Preemptive TGFF Task Graphs For Free TDM Time Division Multiplexing TDMA Time Division Multiple Access TPE Timing Parameters Extraction

(26)

(27)

List of Symbols

Task Parameters τ A task set.

U A task set τ utilization. τi The ithtask.

ai The offset of τi(seconds). Ci The WCET of τi(seconds). Ti The period of τi.

Di The relative deadline of τi(seconds). Di The absolute deadline of τi(seconds). Si The arrival time of τi(seconds). Fi The finish time of τi(seconds). Ui The utilization of τi(seconds). ρi The density of τi.

Ji The job of τi.

Ri The response time of τi (seconds). Feasibility Analysis

dbf (t0, t1) The demand bound function within the time interval [t0,t1]. H Hyperperiod (seconds).

ta The upper bound on schedulability (seconds).

tb The synchronous busy period of the processor (seconds). h(t) The processor demand at time t (seconds).

Dataflow Applications

G A Synchronous Dataflow graph (SDF). V Set of nodes (actors) in an SDF graph. E Set of edges (channels) in an SDF graph.

Gh A Homogeneous Synchronous Dataflow graph (HSDF). Vh Set of nodes (actors) in an HSDF graph.

Eh Set of edges (channels) in an HSDF graph. Gp A graph representing a pipeline application. Vp Set of nodes in a pipeline graph.

Ep Set of edges in a pipeline graph. Gm A merged HSDF graph.

Gcom An HSDF graph with message actors.

d Set of initial tokens in a dataflow (SDF/HSDF) graph. Γ Topology matrix of an SDF graph.

~q Repetition vector of an SDF graph. xxiii

(28)

A channel starting from actor v to actor v . ζ Throughput requirement.

Pi The ithtime-constrained path in an HSDF graph. P_ip Partial path of the time-constrained path Pi. LP_ip List of partial paths in Pi.

γi The sensitivity of path Pi.

D End-to-end deadline constraint (seconds).

Dxy Latency constraint of a time-constrained path from vxto vy(seconds). ε laxity on a time-constrained path P (seconds).

δ Task slack (seconds).

Ω(vij) Set of predecessor firings of the firing vij.

Φ(vij) Set of successor firings of the firing vij.

ϑij Earliest start time of a firing vij (seconds).

θij Latest finish time of a firing vij (seconds).

ˆV Topologically ordered set of actors. V _{Merged cluster of HSDF actors.}

P Path cover for a DAG component of a Gh. O The set of cycles in Gh.

σij The slack of a firing vij (seconds).

Ck The execution time of a cycle k (seconds). β A constant has a value with range [1,∞).

P The set of all time-constrained paths between actors with latency constraints in an HSDF graph.

Succ(vx) The list of successor actors for the actor vx. System Model

Ψ The system.

Π The multi-core platform. πi The ithcore in the platform Π.

n One of the dimensions of the multi-core platform Π. lsw The router switching latency of a single flit (seconds). lt The transfer latency of a single flit (seconds).

L The link capacity of IN of the platform Π (Gbps).

Ri The fraction of the link capacity L reserved for an application Ai(percentage). h The number of hops of a packet p.

ˆh The maximum number of hops on IN of any packet on the platform Π. F The TDM frame size (slots).

κi Number of allocated slots for application Ai(slots). p The packet size (bits).

f The flit size (bits). A The application set.

Ai The ithapplication of the application set A. Ap A pipeline application.

m The size of the application set A.

Ci,p The time spent by a packet p of application Aitraversing the IN (seconds). C_i,piso The isolation time of a packet p of application Aitraversing the IN (seconds).

(29)

List of Symbols xxv

ˆ

Ci,p The initial value of the WCET of a message actor (seconds).

I_iT DM The TDM interference time of any packet from application Ai traversing the IN (sec-onds).

I_iT DM.co The TDM interference time of any packet from application Aitraversing the IN, assum-ing continuous slot assignment policy (slots).

(30)

(31)

Chapter 1 Introduction

We are living the golden age of ubiquitous computing. If we look around, we will find ourselves surrounded by computing devices embedded in systems that help or serve us in our daily life. These systems ranges from simple portable gadgets, e.g. smartphones, cameras, gaming consoles, to large complex systems, e.g. airplanes, cars, industrial automation. These systems are called embedded systems.

An embedded system can be broadly defined as a computing system that performs a dedicated function within a larger system [Jiménez et al.,2014]. This dedicated function is not designed to be programmed by the end user as functions in general purpose computing [Heath,2002]. The concept of computing systems performing dedicated functions is old going back in time preceding the concept of a general-purpose computer [Jiménez et al.,2014]. If we look at the earliest forms of computing devices, they adhere better to the definition of an embedded system (in terms of performing a dedicated function) than to that of a general-purpose computer. An example of these devices is the Colossus computer [Copeland,2006], which refers to a series of computers developed by British code-breakers in 1943-1945. Colossus dedicated function was to help in the cryptanalysis of the German teleprinter messages during World War II.

At early stages, embedded system designs used microcontrollers as a main processing unit, since the application demands were simple. Following the rise in application demands and growing complexity, many embedded systems incorporate multi-core processor architectures for satisfying the increasing demands of its applications, since the need for high processing power at a low power budget is a great concern for such systems [Kim et al.,2010]. A real life example of this trend is the cellular phone. At the beginning, the first generation of cellular phones incorporated a single core digital signal processor chip [PratapSingh and Kumar Jain,2014], since its main dedicated function was making phone calls. However, the latest generations feature at least a quad-core multi-processor at least, e.g. Samsung Galaxy S7 smartphone incorporating Qualcomm® Snapdragon™ 820 processor [Qualcomm,2016]. This is because the cellular phone has become a portable computer, multimedia and connectivity device.

The trend of the growing functionality of embedded systems can be demonstrated by the vari-ous types of applications that run simultanevari-ously on the system [Jiménez et al.,2014]. These

(32)

with computationally intensive ones, such as multimedia and gaming applications. The fact that embedded systems run various applications with different requirements can mean different ap-plications may be represented using different computational models. In such systems running mixed computational models, guarantees are required to assure stratifying requirements (compu-tational demands or timing constraints) and the correct execution of the system, especially in case of safety-critical applications. A current example of such systems is high-end cars, which may run an advanced multimedia entertainment system (that requires huge computational resources) along with the autonomous driving function (safety-critical application) that allow self-driving on the highways, i.e. Tesla Model S, X and 3 [TESLA,2016].

Embedded system running mixed computational models is an increasing futuristic trend, since embedded systems are included in almost every device. In this thesis, we are concerned with embedded systems that incorporate mixed computational models with timing constraints running on the same multi-core platform. These computational models are dataflow with timing constraints and traditional real-time task sets, since they represent a wide range of applications running on top of embedded systems. The dataflow computational model represents Digital Signal Processing (DSP), Streaming and multimedia applications, while traditional real-time computational model covers a wide range of time-constrained applications with different levels of criticality. Example of future embedded systems that run these two computational modes are Autonomous Driving Systems [Elliott et al., 2014] and Unmanned Air Vehicles [Zhou and Wu, 2006]. These kind of systems require real-time guarantees that all running applications will execute safely without missing their deadlines. Also, they require efficient use of system resources to minimize the overall cost of the system.

We begin this thesis by briefly introducing the two computational models considered in this thesis. They are the real-time computational model (Section1.1) and the dataflow computational model (Section1.2), where we detail the parameters and the properties of each model. Then we follow by presenting an overview of processing platforms and architectures in Section1.3. After these introductory sections, we introduce our problem statement in Section 1.4, followed by a detailed proposed solution explaining its functionality in Section1.5. Finally, we end this chapter by summarising our thesis contributions and providing the thesis organisation in Sections 1.6

and1.7, respectively.

1.1 Real-time Computational Model

A real-time computational model is a computing paradigm used to define a certain set of ap-plications that have to respond to externally generated input stimuli within a finite and specified period of time [Buttazzo, 2004, Krishna, 1996]. The main characteristic that distinguishes real-time computing from other types of computation is real-time, because the correct execution of the applications of such computational model depends not only on the logical result but also on the

(33)

1.1 Real-time Computational Model 3

time it is delivered. The instant when a result must be produced is called a deadline. Failure to respond within the specified timing interval or a delayed response could be useless or even have fatal consequences. Based on these consequences, the real-time computational model classifies its applications into three categories [Buttazzo,2004,Krishna,1996]:

Hard real-time: An application is considered hard real-time if missing its deadline during exe-cution may cause catastrophic consequences on the system under control, surrounding en-vironment or people.

Firm real-time: An application is considered firm real-time if missing its deadline during exe-cution is useless for the system, but does not cause any damage.

Soft real-time: An application is considered soft real-time if missing its deadline during execu-tion has still some utility for the system, although causing performance degradaexecu-tion. These are the three basic categories of applications according to the real-time computational model. There exist other classifications that branch from these basic categories. Whatever their category, all the applications in this computational model are called real-time applications. In the following section, we will shed more light on real-time applications and its different criteria classifications.

1.1.1 Real-time Applications

Real-time applications are wide-spread in daily life systems, e.g. telecommunications, aviation, nuclear reactors, autonomous driving systems , industrial automation. A real-time application can be modelled as a finite set of simple, highly repetitive entities that are recurrent in nature called real-time tasks [Baruah and Goossens,2004]. Each instance of a task is a basic unit of work that executes on the processing platform and is called a job [Liu,2000]. A real-time task has different classifications based on its timing parameters. In the following section we discuss that in details. Real-time task classification:

A real-time task has several classifications that vary based on the criteria used. In this thesis, we are concerned with two criteria in real-time task classification. First, the frequency of which a task instantiates its jobs (task periodicity) classifies a real-time task into three categories [Isovi´c and

Fohler,2000]:

Periodic tasks: A task that releases its jobs periodically after a fixed time interval is defined as a periodic task. The fixed duration between the two consecutive jobs releases is called the period of the task.

Sporadic tasks: A task that releases its jobs at some arbitrary time instant but two consecutive jobs of a task are always separated by at least a predefined time interval called the minimum inter-arrival time.

(34)

Periodic tasks are the most well-known model in real-time systems. Sporadic tasks can be con-verted into periodic tasks with a predefined minimum interarrival time [Buttazzo,2004]. Aperiodic tasks can be handled using periodic server-based systems with budget. The server is modelled as a periodic task. The server can serve aperiodic tasks until the budget expires. The budget can be replenished every period [Sprunt,1990].

Second, real-time tasks are always constrained with a timing requirement. A task should com-plete its execution within a predefined time interval called the relative deadline. The relative deadline of a task depends on the nature of an application. For example, the object recogni-tion/detection application in an autonomous driving system has a relative deadline in terms of a few microseconds, while a room temperature monitoring application in an air conditioning sys-tem can have a relative deadline in terms of a few seconds. The relative deadline of a real-time task, whether it is periodic, sporadic or aperiodic, can be categorized into three main categories

[Buttazzo,2004,Krishna,1996]:

Implicit-deadline task model: has a relative deadline equal to its period or minimum inter-arrival time.

Constrained-deadline task model: may have a relative deadline less than or equal to its period or minimum inter-arrival time.

Arbitrary-deadline task model: has a relative deadline that has no relation with the period or minimum inter-arrival time of a task. This means that the relative deadline can be set to any value regardless the value of the task’s period.

In this thesis, we are concerned with real-time systems running periodic arbitrary-deadline tasks. 1.1.2 Worst-Case Execution Time

The execution time of a real-time task is an important parameter that defines its temporal be-haviour. Different jobs of a task exhibit variation in their execution time depending on the hard-ware characteristics, structure of the softhard-ware, input data and different behaviour of the environ-ment with which the jobs are interacting. In order to guarantee the temporal correctness, the upper bound on the execution time of a task, referred to as the Worst-Case Execution Time (WCET), is specified. The WCET of a task is a safe upper bound greater than or equal to the longest execution of any job released by the task, under worst-case input conditions without interference from other tasks. Any miscalculation of WCET may cause a system failure depending on, whether or not, the system is a hard real-time. There are several methodologies and techniques to determine the WCET of a task detailed in [Puschner and Burns,2000,Wilhelm et al.,2008] for further reading. Real-time system designers consider the WCET of tasks while designing a system to guarantee the timing properties. However, different jobs of a task may execute for less than their WCET

(35)

1.2 Dataflow Computational Model 5

(a) (b)

Figure 1.1: Dataflow application.

leaving behind unused computing resources. This bound is almost always pessimistic to be safe. Jobs hence typically execute faster.

1.2 Dataflow Computational Model

The dataflow computational model [Chamberlin,1971, Estrin and Turn, 1963,Rodrigues,1969,

Shields, 1997] is a well-known, simple, and powerful model of parallel computation. In this

model, there is no notion of a single point or locus of control corresponding to the conventional sequential computing. However, it models an application as a set of tasks with data dependencies. It is a very useful specification mechanism for signal processing systems since it captures the intuitive expressiveness of block diagrams, flow charts, and signal flow graphs, while providing the formal semantics needed for system design and analysis tools.

1.2.1 Dataflow Applications

A dataflow application is a directed graph, where the vertices represent computation tasks and edges represent First-In First-Out (FIFO) queues that direct data values from the output port of one computation task to the input port of another. Hence, a dataflow application can be consid-ered a set of computation tasks with dependencies. The graphs’ vertices (computation tasks) are called actors, while its edges (FIFO queues) are called channels. Channels thus represent data dependencies between actors.

A dataflow application executes by performing the functions defined by its actors. An actor can be a single instruction, or a sequence of instructions, since the dataflow model does not imply a limit on the size or complexity of actors. Initially, an actor is an idle task. Its execution is triggered once the required amount of data arrives on its input ports. The amount of input data is specified by each actor according to its functional requirements. Many actors may be ready to execute simultaneously, and thus represent many asynchronous concurrent computation events. An actor starts execution by consuming data from its corresponding input ports, performing computations, and then produce a certain amount of data on its output ports. The execution process of an actor is called a firing, while the data produced or consumed in the firing process are referred to as tokens. Figure1.1shows an example of a dataflow graph, that consists of actors (a, b) and the channel between them represented as a FIFO queue that direct tokens from the output port of actor a to the input port of actor b. Initially, actors a and b are idle. Once the required tokens are available

(36)

(a) SDF graph (b) HSDF graph

Figure 1.2: Example of SDF and HSDF graphs.

on the input port of actor a, it consumes them, starting the firing process, then produces tokens on its output port. The tokens produced are transferred to the input port of actor b through the FIFO channel, triggering its firing process that results in producing tokens on its output port similar to actor a. The functions performed by the actors define the overall function of the dataflow graph. For example, Figure1.1could represent a water level control system, where actor a is measuring the current level of water in a tank and send signals to actor b that controls the operation of the water pump.

A dataflow application has three important timing parameters, they are:

Execution time of its actors: an actor may have different values of execution time. This may be due to different tokens consumed, which triggers different functions to be executed inside the actor. Also, it may be due to the same reasons a real-time task faces that are mentioned previously in Section 1.1.2. However, for predictable execution behaviour and analysis purposes, the execution time determined for each actor represents an upper bound (WCET) to all of its firing modes. The calculation of WCET is mentioned earlier in Section1.1.2. Throughput: is an important constraint and crucial indicator of performance for dataflow

appli-cations. The throughput of a dataflow application refers to how often an actor produces an output token. To compute throughput, the WCET of the firing of each actor has to be measured and an execution scheme must be defined. The execution scheme is the self-timed execution of actors, where each actor fires as soon as all of its input data are available

[Sriram and Lee,1997].

Latency: is a timing constraint that defines a time bounded interval between firings of two actors in the dataflow application. It can be realised as a relative deadline for the firings that happen between these specific two firings.

There exist several dataflow computational models, e.g. Synchronous Dataflow (SDF), Homo-geneous Synchronous Dataflow (HSDF) [Lee and Messerschmitt,1987b], Cyclo-static Dataflow (CSDF) [Bilsen et al., 1995], Scenario-Aware Dataflow (SADF) [Theelen et al., 2006], where each model have its own specifications and rules that enable capturing wide range of applications. However, we focus on those that can be described by SDF and HSDF [Lee and Messerschmitt,

1987b].

SDF: is useful for modelling and analysis of Digital Signal Processing (DSP) and concurrent multimedia applications [Lee and Messerschmitt,1987b,Poplavko et al.,2003,Sriram and

(37)

1.2 Dataflow Computational Model 7

Bhattacharyya, 2000,Wiggers et al.,2007], where they represent computations on an

in-definitely long data sequence. This is because of the ability to obtain periodic schedules for the SDF execution where actors fires a determined number of times with a specific or-der, in a cyclic manner, where each cycle called an iteration. Every actor in an SDF graph consumes/produces a fixed number of tokens every time it fires. The SDF graphs are accom-panied with several timing analysis techniques, which are used for evaluating performance metrics of such applications, most importantly throughput. Figure1.2(a)shows an example of an SDF graph that consists of two actors a and b. Actor a represents a source task that produces two tokens every time it fires (denoted on its output port), while actor b represents a sink task that consumes a single token every time it fires (denoted on its input port). The periodic schedule for such SDF graph is (a,b,b), because actor a produces two tokens that triggers actor b to fire twice consuming a single token each.

HSDF: is a more restricted model of SDF, where actors consume/produce a single token every time they fire. Each actor in an HSDF graph fires once during an iteration of the graph. This restriction allows HSDF graph to reveal the parallelism hidden in applications repre-sented using more expressive models, e.g. SDF, CSDF. For example, Figure1.2(b)shows an HSDF graph representation of the SDF graph shown in Figure1.2(a). As we notice, the HSDF graph reveals the parallelism hidden in the SDF graph by showing actor b firing twice simultaneously (b0, b1). Many dataflow graphs expressive models, e.g. SDF, CSDF, can be converted to an equivalent HSDF graph by using a conversion algorithm, such as the one presented in [Sriram and Bhattacharyya,2000]. Although transformation to HSDF allows revealing the parallelism in dataflow applications, it can lead to an exponential in-crease in the size of the original dataflow graph [Lee and Messerschmitt,1987a,Sriram and

Bhattacharyya, 2000], which may result in a significant increase in the run-time of many

dataflow analysis algorithms, e.g. throughput analysis, as described in the following chap-ters. Further details on SDF and HSDF are given in Chapter3.

1.2.2 Streaming Applications

Streaming applications constitute a huge application space for embedded systems. They are be-coming increasingly important and widespread, since they run on many common devices and systems that affect our daily life. A common well-known example of this in daily life is the smartphone, as shown in Figure1.3(a). It is a multi-purpose (i.e., communication, entertainment, navigator, etc.) embedded system that runs several streaming applications with different purposes that ranges from communication to entertainment. Another example considered as safety-critical is Autonomous driving systems, shown in Figure1.3(b), that have started to be integrated in many car driving systems (e.g. Google, Tesla, Mercedes, etc.). These systems enable cars to sense their environment, navigate without human input and stay connected to the Internet [Gehrig and Stein,

1999]. Both of these example systems process audio and video streams on which streaming ap-plications perform functions like audio/video encoding and decoding, object recognition, object

(38)

(a) Smartphones [Kenya Tech News,2015]. (b) Autonomous driving systems [Car News,2015] . Daily Autonomous

Figure 1.3: Examples of embedded systems running streaming applications.

detection and image enhancement on the streams [Elliott et al.,2014,Salunkhe et al.,2014,

Siy-oum et al.,2011]. These kind of streaming applications have high processing requirements and

timing constraintsthat must be satisfied, especially in case of safety-critical applications.

The high processing requirements raises the need for a parallelization model to enable appli-cations to use massive computational power [Pankratius et al.,2009], which the dataflow model of computation is able to achieve for streaming applications [Lee and Messerschmitt,1987a]. This is because dataflow model is inherently parallel and can work well in decentralized systems. Fur-thermore, since these applications are basically a series of transformations that are applied to a data stream, the dataflow model is a natural paradigm for representing them for concurrent imple-mentation on multi-/many-core processors [Lee and Messerschmitt,1987a].

The streaming applications’ timing constraints require guarantees that they will be satisfied during applications execution. Recently, several works applied real-time scheduling and analy-sis techniques on dataflow applications [Bamakhrama and Stefanov,2011,2012, Di Natale and

Stankovic,1994,Kao and Garcia-Molina,1997,Lipari and Bini,2011,Liu et al.,2014,Saifullah

et al.,2011]. However, they are limited to dataflow applications represented as Directed Acyclic

Graphs (DAG) or implicit-deadline task models, which discards a wide range of dataflow applica-tions.

1.3 Processing Platform

This section aims to discuss different processing platform architectures and features of intercon-nection network. The main goal is to explain the specifications of the processing platform assumed in this thesis.

The processing platform refers to the hardware responsible for running applications in the real-time embedded system. There is a paradigm shift towards multi-/many-cores in the design process of processing platforms. Presently, increasing the number of cores is the current way to

(39)

1.3 Processing Platform 9

improve the performance for high-end processors rather than increasing the clock speed for single processors. One of the reasons why the clock rate gains of the past cannot any more be continued is the unsustainable level of power consumption [Vajda,2011].

Architecture:

A multi-/many-core platform has more than one core or processor. These cores can be similar or completely different in architecture. Consequently, multi-/many-core platforms can be categorised into two main types based on the relation between the cores on a given platform:

Homogeneous Architecture: in this architecture type all cores in the platform are identical and have exactly the same properties in terms of computation (e.g. instruction set, frequency and cache size) and the cores are interchangeable. The execution time and energy con-sumption of a task remains the same on all cores on such a platform. These platforms are also sometimes called symmetric multi-processor platforms (SMP). Many platforms man-ufactured and deployed today in embedded systems fall under this category. For example, Cortex-A17 [Cor] from ARM (used in smart phones, tablets, smart TV’s, etc.) has four identical cores on a same die.

Heterogeneous Architecture: this architecture type features at least two different kinds of cores that may differ in both the instruction set architecture, frequency and cache size. The most widespread example of a heterogeneous multi-core architecture is the Cell BE architecture, jointly developed by IBM, Sony and Toshiba [Gschwind et al.,2006] and used in areas such as gaming devices and computers targeting high performance computing.

Interconnection Networks (IN):

Since increasing the number of cores in multi-/many-core platforms is the current trend to increase the performance, there should be an efficient communication network to connect them, called Interconnection Networks (IN). The IN between multiple cores may be a performance bottleneck, since it is responsible for transferring and routing of data between different cores. These data are in the form of packets with headers that contain information about its destination. Data transfer between distant cores can increase latency and consume extra power. In the following paragraphs, we look at traditional IN topologies.

2D-Mesh: shown in Figure1.4(a), is a common topology that uses routers that are connected to other routers as well as a number of cores. Advantages include design simplicity and short links. Disadvantages include a potentially high number of hops.

Fat Tree: shown in Figure1.4(b), is a tree topology where the cores are located at leaves of a tree and internal nodes are routers. Data travels upward in the tree until a common ancestor is found between source and destination. The number of links increases towards the root of the tree. Advantages include high bandwidth because of the increased number of links

(40)

(a) 2D Mesh (b) Fat Tree (c) Flattened Butterfly

Figure 1.4: Examples of Interconnection Networks (IN) [Sanchez et al.,2010].

as data moves towards the root. Disadvantages include the need for more complex routers, again because of the increased number of connections toward the root.

Flattened Butterfly: shown in Figure1.4(c), is a modified butterfly network that is essentially a mesh network with additional links. Advantages include a small number of hops. Disad-vantages include complex routers and increased chip area due to the large number of links. Routing:

In all IN topologies, except fully connected topology, not all the router-pairs are directly con-nected. Therefore, in such cases, depending on the position of the sender and the receiver, packets may need to travel across multiple intermediate links and routers. A set of traversed network ele-ments (routers and links) is called the route, while the number of traversed links is usually referred to as the number of hops.

The process of transferring packets from source to destination is called routing, which is the responsibility of the routers. Once packets reach the router, it decides in which direction they will be forwarded. The logic inside the router that is responsible for making this decision is called the routing algorithm. There exist numerous criteria based on which the routing decisions can be made. For example, the minimal routing class algorithms [Ni and McKinley,1993] which aim to minimise the route, and hence derive routing decisions such that the packets always traverses the minimal possible number of hops. Moreover, the deterministic routing class algorithms, which always routes packets between the same source and destination on the same path. Alternatively, the adaptive routing class algorithms [Bolotin et al.,2004] makes routing decisions at runtime based on the status and load of individual links. Adaptive routing can improve the performance of the system (the average case behaviour) by reducing the average communication time, however, at the expense of predictability. Conversely, deterministic routing is predictable and much easier to implement, but may cause an inefficient utilisation of the NoC resources, where some links may be heavily congested, and others may be completely idle.

The selection of the routing mechanism depends on the purpose of the system. As already mentioned, in the real-time embedded domain the predictability of the system is essential, because

(41)

1.3 Processing Platform 11

it allows to analyse the temporal behaviour of the system with significantly less pessimism. Thus, in the real-time domain, the deterministic routing techniques are a preferable option.

One class of popular minimal deterministic routing algorithms in 2D-mesh IN is the dimension-ordered routing. Assuming these schemes, the packets are firstly routed along one dimension of the IN, and after reaching the coordinate of the destination, if needed, continue the transfer along the other dimension. One of the most popular routing algorithms of this class is X-Y routing, where the horizontal axis of the platform is usually denoted with the letter X and the vertical axis is denoted with the letter Y. The X-Y routing policy is deadlock free [Hu and Marculescu,2003].

Switching:

Switching defines how packets are transmitted from source to destination. When the IN resources are free, packets traverse routers and links on their route towards the destination. However, in the presence of other traffic, it may happen that one of the links on its route is busy transferring other packets. In such cases, switching mechanisms resolves the situation. One of these mechanisms is the store-and-forward switching [Tanenbaum,2002], where the router stores the full packet before forwarding it to the next router on the route. In this mechanism, one must ensure that the buffer size at each router is sufficient to store the whole packet, otherwise it will be stalled. Another well-known mechanism is wormhole switching [Ni and McKinley,1993], where the router makes the routing decision and forwards the packet as soon as the header arrives. The subsequent payload is split into smaller containers called flits. These flits follow the header as they arrive. This reduces the latency within the router, but in case of packet stalling, many links risk to be locked at once.

Arbitration:

The main responsibility of IN is to transfer and route communication data between different cores. During the process of data transfer, significant contention may occur due to accessing the IN shared medium, e.g. links and routers. Several approaches, called arbitration mechanisms, have been proposed to manage such contention. These mechanisms are provided by the IN to allow the multiplexing of several streams of data over the same physical medium (link). Common schemes are Space Division Multiplexing (SDM) [Banerjee et al.,2009,Lusala and Legat,2011,Marchal

et al.,2005,Modarressi et al.,2009], Time Division Multiplexing (TDM) [Goossens et al.,2005,

Liu et al., 2004, Wang et al.,2008,Zhang et al.,2010] either in the conventional slot allocation

approach or in an arbitrated (e.g. round-robin, priority) link time sharing scheme. TDM is a commonly used arbiter for management of communication resources in multi-core platforms. The reasons for its popularity is that it is conceptually easy to understand and analyze and has efficient implementations both in hardware and software [Akesson et al., 2015]. Moreover, it provides temporal isolation between clients when used in a non-work-conserving manner [Goossens et al.,

2013a]. Several platforms relying extensively on TDM for a variety of resources management are

(42)

Figure 1.5: Problem to be addressed

In this thesis, we are concerned with homogeneous architecture processing multi-core plat-forms that incorporates a 2D Mesh IN operated using X-Y routing, wormhole switching and using TDMas arbitration mechanism.

1.4 Problem Statement

In this thesis, we address the problem of real-time embedded systems incorporating mixed ap-plication models with timing constraints running on the same multi-core platform. These mixed application models are dataflow applications with timing constraints (latency and throughput) and traditional real-time applications, as shown in Figure 1.5. The design of such systems require guarantees that all running applications mapped on the platform will execute safely satisfying their timing constraints.

As shown in Figure1.5, the traditional real-time applications are modelled as independent tasks. Each task is characterised with specific parameters, e.g. WCET, deadline and period. In contrast, dataflow applications are basically graphs of communicating tasks, which are actors. These actors are defined by a different set of parameters, e.g. WCET, Production/Consumption rate (P/C) of tokens. A dataflow application has timing constraints, i.e. latency and throughput requirements (Section1.2.1), that must be satisfied. This leads to the main question of the thesis: How can future real-time embedded systems safely incorporate mixed application models, data-flow and traditional real-time tasks, with timing constraints onto multi-core platforms, such that their timing constraints are satisfied?

(43)

1.5 Solution Overview 13

Figure 1.6: Solution outline.

1.5 Solution Overview

In this section, we present an outline of our proposed solution to the stated problem outlined in Section 1.4. The main goal of this solution is to provide guarantees for the mixed application model executing on the multi-core platform, such that timing constraints are satisfied.

To implement this kind of systems, we have to address how to map and schedule such mixed application model on the multi-core platform. Different solutions in mapping and scheduling have been proposed for each application model independently. The mapping problem has previously been tackled in several works from a high-performance point-of-view [Ennals et al.,2005,Evans

and Kessler,1992,Liu et al.,2007,Lo,1988,Ma et al.,1982], where all applications are

repre-sented either as graphs or independent tasks. However, using these approaches in the mapping of real-time applications does not guarantee satisfying their timing constraints. Another

(44)

map-applying approaches that satisfy timing constraints and use FF, such as [Bamakhrama and

Ste-fanov,2011], results in over-dimensioned systems, as our experimental evaluation shows in [Ali

et al.,2013] and Chapter6. Moreover, such work [Guo and Bhuyan,2006] does not consider the

communication cost and its effect on the schedulability of the system.

The scheduling problem has been studied extensively for traditional real-time applications through introducing several real-time scheduling algorithms either onto uniprocessors, e.g. Fixed Priority (FP) [Liu and Layland,1973], Earliest Deadline First (EDF) [Liu and Layland,1973], or multi-processor Partitioned EDF (PEDF) [López et al.,2004] and Hierarchical scheduling [

Ca-landrino et al., 2007, Easwaran et al., 2009, Leontyev and Anderson, 2008, Zhu et al., 2011].

However, dataflow applications mostly use static scheduling, i.e. TDMA. Static scheduling works well in case of systems that only run dataflow applications. In contrast, in case of systems that run mixed real-time applications, a dynamic real-time scheduling algorithm may have a higher schedu-lability success rate than static scheduling, but it is not currently available for mixed systems. Furthermore, real-time scheduling algorithms can enable efficient real-time analysis techniques for such mixed systems. Recently, several works scheduled dataflow applications using real-time scheduling algorithms [Bamakhrama and Stefanov,2011,2012,Di Natale and Stankovic,1994,

Kao and Garcia-Molina, 1997, Lipari and Bini, 2011, Liu et al., 2014, Saifullah et al., 2011].

However, they are either limited to dataflow applications represented as Directed Acyclic Graphs (DAG), or they are represented as implicit-deadline tasks.

The proposed system runs two types of application models, traditional real-time and dataflow applications. The traditional real-time applications are a set of independent periodic arbitrary-deadline real-time tasks. These tasks are characterised by timing parameters that define their temporal behaviour in execution, e.g. WCET, period and relative deadline. Independent real-time tasks have a set of well-established real-time scheduling and analysis techniques in the literature that allow satisfying their timing constraints. The main idea is to use these techniques and methods and apply them on dataflow applications to get the same guarantees. However, these techniques cannot be applied directly on dataflow applications, because they miss the appropriate task model parameters to allow using them. Therefore, a unified model for both types of application models is needed to apply traditional real-time scheduling and analysis techniques on the system, thereby guaranteeing that timing constraints are satisfied.

The unified modelling is a process that transforms the dataflow applications into traditional real-time tasks. This transformation is done using the timing parameter extraction algorithm shown in Figure 1.6and detailed in Chapter 5. However, before sending the dataflow graph to the timing parameter extraction algorithm, it has to go through two processes. First, is the graph reductionprocess, discussed in Chapter4. It generates a reduced-size HSDF graph from the orig-inal HSDF graph. This is because transformation to HSDF graphs can result in an exponential explosion in the graph size, which slows down the timing parameter extraction algorithm when applied on them. Therefore, the graph reduction process speeds up the overall design process, as

(45)

1.6 Thesis Contributions 15

the experiments show in Chapter 5. Second, is the communication modelling process, where it models the communication in the reduced-size HSDF graph, generating an extended HSDF graph that accounts for the communication cost. The extended communication-aware graph is then used as input to the mapping algorithm, as explained in Chapter6. Following these two steps, the timing parameter extraction algorithm takes the HSDF graph with modelled communication as an input, transforming it into a set of independent arbitrary-deadline tasks.

Now, we reached the stage where we have a unified set of arbitrary-deadline real-time tasks. This enables applying traditional real-time scheduling and analysis techniques while mapping them on the platform. The mapping algorithm, shown in Figure1.6, allocates the task set on the platform guaranteeing that all applications satisfy their timing constraints. Also, the proposed mapping algorithm is communication-aware, which means that it considers the communication overhead resulting from the token exchange between different actors in the dataflow applications. The communication-aware mapping algorithm, detailed in Chapter6, is able to do that because of the communication modelling of the HSDF graph that happened in the early stages in the solution.

1.6 Thesis Contributions

As highlighted in the problem statement (Section 1.4), the main goal of this thesis is to allow future real-time embedded systems to map and schedule mixed application models with timing constraints on the same multi-core platform guaranteeing that timing constraints are satisfied. To achieve this goal we proposed the solution outline, discussed in Section 1.5 and shown in Figure1.6, that consists of three main contributions. They are:

1. An offline dataflow graph reduction algorithm, called slack-based merging, that aims to speed-up the process of timing parameter extraction and finding a feasible real-time sched-ule, thereby reducing the overall design time of the real-time system. To achieve this goal, the algorithm combines two main concepts:

(a) The slack, which is the difference between the WCET of the SDF graph’s firings and its timing constraints.

(b) The safe merge, which is a novel merging concept that we prove cannot cause a live HSDF graph to deadlock.

The output is a reduced-size HSDF graph that satisfies the throughput and latency con-straints of the original application graph.

2. A timing parameter extraction algorithm that extracts timing parameters of HSDF graphs with timing constraints, converting them into periodic arbitrary-deadline tasks. This algo-rithm provides a method to unify mixed application models into a single real-time task set. A main advantage of our proposal is that the extraction of the timing parameters is indepen-dent of the specific scheduler being used, of other applications running in the system and the details of the particular platform. The proposed algorithm:

(46)

(b) Captures overlapping iterations, which is a main characteristic of the execution of dataflow applications, by modelling actors as tasks with arbitrary-deadlines.

3. A mapping algorithm, called communication-aware mapping, dedicated for allocating HSDF graphs on 2D-Mesh multi-core platforms. The algorithm is based on a novel map-ping heuristic called Sensitive-Path-First. This heuristic allocates first, for each HSDF, the most critical paths (a path consists of a set of tasks) in terms of schedulability, maximizing path parallelism when possible. The mapping process is done taking into account satisfy-ing applications time constraints and maximizsatisfy-ing resource utilization of the platform, while accounting for the communication cost.

Together, these three important contributions successfully achieve the main goal of this thesis and play a part in allowing embedded real-time systems to map and schedule mixed application models.

1.7 Thesis Organization

This thesis addresses the problem of mapping and scheduling mixed application models with tim-ing constraints runntim-ing on the same multi-core platform in real-time embedded systems. The thesis is organized as follows:

• Chapter 2discusses the state of the art in three main topics that represent the three main contributions of this thesis. These three main topics are dataflow graph analysis, timing parameter extraction techniques and mapping methodologies.

• Chapter3provides a background on topics and terminology essential for understanding the research problem and the system model.

• Chapter4introduces the proposed graph reduction technique for dataflow applications called slack-based merging. It provides a detailed explanation of the algorithm assisted with proofs, examples and experiments that show its validity and functionality.

• Chapter5presents the timing parameter extraction algorithm that transforms dataflow ap-plications into independent real-time tasks. The chapter starts by discussing similar mecha-nisms for timing parameter extraction for pipelines. Then, it shows how these mechamecha-nisms are incorporated in the proposed algorithm to extended its functionality to cover dataflow graphs. We present proofs, examples and experiments that shows the validity and function-ality of our proposed algorithm. Moreover, the experiments show the speed-up effect of the graph reduction technique on the timing parameter extraction process.

• Chapter6describes the proposed mapping algorithm called communication-aware mapping. It begins by presenting the mechanism for communication modelling in dataflow graphs.

(47)

1.7 Thesis Organization 17

Then, it lists and describes the components of the communication-aware mapping algorithm. Especially, its main mapping heuristic called Sensitive-Path-First, which is inspired from the Critical-Path-First (CPF) mapping heuristic proposed in [Ali et al.,2013]. In addition, the chapter provides a full view of our proposed solution by integrating the three algorithms together. This allows experimenting both communication-aware mapping algorithm and the whole system.

(48)

Integrating dataflow and non-dataflow real-time application models on multi-core platforms

FACULDADE DE

ENGENHARIA DA

UNIVERSIDADE DO

PORTO

Integrating Dataflow and Non-Dataflow

Real-time Application Models on

Multi-core Platforms

Hazem Ismail Abdelaziz Ali

Integrating Dataflow and Non-Dataflow Real-time

Application Models on Multi-core Platforms

Hazem Ismail Abdelaziz Ali

Programa Doutoral em Engenharia Eletrotécnica e de Computadores

Approved by :

Abstract

Resumo

Acknowledgements

List of Publications

Articles Included in this Thesis

Other Articles

Contents

List of Figures

List of Tables

List of Algorithms

List of Abbreviations

List of Symbols

Chapter 1

Introduction

1.1

Real-time Computational Model

1.2

Dataflow Computational Model

1.3

Processing Platform

1.4

Problem Statement

1.5

Solution Overview

1.6

Thesis Contributions

1.7

Thesis Organization