An aI based tool for networks-on-chip design space exploration

(1)

Department of Informatics and Applied Mathematics Graduate Program in Systems and Computing

Master’s Degree in Systems and Computing

An AI based Tool for Networks-on-Chip

Design Space Exploration

Jefferson Igor Duarte Silva

Natal-RN August 2018

(2)

An AI based Tool for Networks-on-Chip Design Space

Exploration

Master’s thesis proposal presented to the Graduate Program in Systems and Comput-ing of the Department of Informatics and Ap-plied Mathematics at the Federal University of Rio Grande do Norte as a partial require-ment for the degree of Master in Systems and Computing.

Supervisor:

Prof. PhD. Márcio Eduardo Kreutz

Graduate Program in Systems and Computing (PPgSC) Department of Informatics and Applied Mathematics (DIMAp)

Center of Exact and Earth Sciences (CCET) Federal University of Rio Grande do Norte (UFRN)

Natal-RN August 2018

(3)

Silva, Jefferson Igor Duarte.

An AI based tool for networks-on-chip design space exploration / Jefferson Igor Duarte Silva. - 2018. 91f.: il.

Dissertação (mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2018.

Orientador: Márcio Eduardo Kreutz.

1. Inteligência artificial Dissertação. 2. Redes em chip -Dissertação. 3. Exploração de espaço de projeto - -Dissertação. I. Kreutz, Márcio Eduardo. II. Título.

RN/UF/CCET CDU 004.8

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

(4)

(5)

(6)

I would like to thank my family - including here, my girlfriend, Miss Thuanny Gomes - for supporting me in everything that I do and need, including my angry moments. Also to all my friend, without them this master course could not be possible. For each moment of despair or wrong simulation, I always had someone to help me and made joke about the situation.

This course allowed me to make many friendships, to know other realities, to explore other possibilities and to open my mind. So, i am so grateful to the Federal University through the Department of Informatics and Applied Mathematics and my supervisor, Mr. Márcio Kreutz, for all the support. I had many classes, and in all of them I grew up. I grew up such as a person, student and teacher. So, one more time, I am so grateful for this Institution.

I thank the time that I spent in the laboratory (LASIC), where I could live with excellent people. Full of knowledge and humility, everyday I was able to learn from each one of them.

I need to cite the manufacturing Lab and those amazing people (LABMAN/UFRN), where I learned the practical meaning of the term "research". Where after each difficulty or doubt, I went to open my mind and listen to other points of view. 80% of my results were obtained via simulations made there (even without any link with DIMAp) and therefore without them this work and the publications involved would not have been performed.

A specific person is mandatory to quote by name, Mr. Adilson Oliveira. With him I learned to do anything with the utmost care and dedication as possible.

At least, I am sure that a forgot some people and I am sorry for it, but I can not forget all the help that I received during these twenty months.

(7)

(8)

Exploration

Author: Jefferson Igor Duarte Silva Advisor: PhD. Marcio Eduardo Kreutz

Abstract

With the increasing number of cores in Systems on Chip (SoCs), bus architectures have suffered some limitations regarding performance. As applications demand more bandwidth and lower latencies, busses could not comply with such requirements due to longer wires and increased capacitancies. Facing this scenario, Networks-on-Chip (NoCs) emerged as a way to overcome limitations found in bus-based systems. NoCs are composed of a set of routers and communication links. Each component has its own characteristics. Fully exploring all possible NoC characteristics settings is unfeasible due to the huge design space to cover. Therefore, some methods to speed up this process are needed. In this work, we propose the usage of Artificial Intelligence techniques to optimize NoC architec-tures. This is accomplished by developing an AI based tool to explore the design space in terms of area, latency, and power prediction for different NoCs components configuration. Up to now, nine classifiers were evaluated. To evaluate this tool, tests were performed on Audio/Video applications with Bit-Reversal, Butterfly, Uniform, Perfect Shuffle, and Transpose Matrix traffic patterns, with four different communication requirements. The first result show an accuracy up to 88% and to 100%, using Decision Trees to predict latency and area/power values, respectively. As second step, a Genetic Algorithm was applied to explore the design space and the reached results ratify that the solutions found are valid and adequate to the constraints of the designer.

(9)

1 Example of SoC . . . p. 17

2 Example of NoC topologies . . . p. 24

3 Example of Decision Trees . . . p. 28

4 Example of Logistic Regression . . . p. 29

5 Example of SVM . . . p. 29 6 Non-linear SVM . . . p. 30 7 Example of SVR . . . p. 30 8 Example of Crossover . . . p. 31 9 Example of mutation . . . p. 32 10 Example of GA flow . . . p. 33

11 workflow for entire solution, first it checks the values, if it is correct, the

GA is called and starts its execution until meet the requirements. . . . p. 39

12 Class diagram for this work. There are three predictors, one for each

NoC metric and its base class (Predictor) and two classes for Genetic

Algorithm (Population and GA). . . p. 40

13 Example of generated file from the solution, the name represents its

char-acteristics. . . p. 41

14 Example of NoC Chromosome. . . p. 45

15 Example of two points crossover where three attributes were copied from

mother and father. . . p. 46

16 Dissertation workflow, each step follows chronological order. . . p. 49

17 Weka dataset created with ten attributes, all are numeric attributes. . . p. 50

(10)

20 Result of CfsSubsetEval analysis about the attributes. The higher the

percentage, the more relevant is the attribute. . . p. 53

21 CfsSubsetEval using the Evolutionary Search as method . . . p. 53

22 CfsSubsetEval using the Using the PSOSearch method . . . p. 54

23 Area dataset with feature selection . . . p. 56

24 Three variations of the application, keeping the same number of packets

and size, but varying the required bandwidth. . . p. 67

25 Now, with 8192 packets, the same three variations of the application,

keeping the same number of packets and size, but varying the required

bandwidth. . . p. 68

26 Now, with 24000 packets per communication flow, the same three

vari-ations of the application, keeping the same number of packets and size,

but varying the required bandwidth. . . p. 69

27 Now, with 256 packets per communication flow, but size equals to 14x14

and the same three variations of the application, keeping the same

num-ber of packets and size, but varying the required bandwidth. . . p. 70

28 Now, with 8192 packets per communication flow, the same three

varia-tions of the application, keeping the same number of packets and size,

but varying the required bandwidth. . . p. 71

29 Now, with 24000 packets per communication flow with size equals to 14,

the same three variations of the application, keeping the same number

of packets, but varying the required bandwidth. . . p. 71

30 NoC configuration suggested. . . p. 72

31 For both sizes, 10x10 and 14x14, GA with self-adaptive reached better

results (lower latency values). . . p. 73

32 Analysis of GAP in accuracy for many sizes of NoC, in worst case the

error percentage was less than 7%. . . p. 74

(11)

1 Comparison among cited papers and brief description of them. . . p. 36

2 All possibilities to Design Space Exploration in this work. . . p. 44

3 Comparison between cited papers . . . p. 48

4 Classifiers obtained results . . . p. 57

5 Comparison between results with all attributes and three methods to

at-tribute selection. The better results were obtained by M5P and Random

Forest in all situations. . . p. 59

6 t-test statistical test execution over all classifiers. Four classifiers obtained

the same statistical significance. . . p. 59

7 Classifiers configuration used during the experiments. . . p. 61

8 Comparison among classifiers for area Prediction . . . p. 63

9 Classifiers configuration used in evaluation . . . p. 63

(12)

IC – Integrated Circuits SoC – System on Chip

MPSoC – Multiple Processor System on Chip NoC – Network on Chip

GA – Genetic Algorithm ML – Machine Learning

ILP – Integer Linear Programming NI – Network Interface

AI – Artificial Intelligence

TLM – Transaction-level Modeling RTL – Register-Transfer Level QoS – Quality of Service P2P – Point-to-Point RA – routing algorithm TTL – Time To Live LR – Logistic Regression

SVM – Support Vector Machine RBF – Radial Basis Function kernel EC – Evolutionary Computing ACO – Ant Colony Optimization PSO – Particle Swarm Optimization LS – Local Search

(13)

LS – Local Search

MOSA – Multi-Objective Simulated Annealing

MOPSO – Multi-Objective Particle Swarm Optimization KPN – Kahn Process Network

NoP – Number of Packets RB – Required Bandwidth

DSENT – Design Space Exploration of Networks Tool MLP – Multilayer Perceptron

BFGS – Broyden-Fletcher-Goldfarb-Shanno RF – Random Forest

CC – Correlation Coefficient MAE – Mean Absolute Error RMSE – Root Mean Squared Error RAE – Relative Absolute Error RRSE – Root Relative Squared Error PCA – Principal Component Analysis SA – Self-Adaptation

(14)

1 Introduction p. 16

1.1 Motivation . . . p. 18

1.2 Goals . . . p. 19

1.3 Contributions of This Work . . . p. 19

2 Background p. 21 2.1 Network on Chip . . . p. 22 2.1.1 Router . . . p. 23 2.1.2 Communication Problems . . . p. 23 2.1.3 Topology . . . p. 23 2.1.4 Communication Mechanisms . . . p. 24 2.2 Artificial Intelligence . . . p. 27 2.2.1 Methods overview . . . p. 27 3 Related Works p. 34

3.1 Design Exploration tools . . . p. 34

3.2 Paper analysis . . . p. 36

4 Methodology p. 38

4.1 Solution description . . . p. 38

4.2 Latency dataset . . . p. 41

4.3 Area and Power datasets . . . p. 42

(15)

4.6 Genetic Algorithm . . . p. 45

4.6.1 Genetic operators . . . p. 46

4.6.2 Selection methods . . . p. 47

4.6.3 Genetic algorithm configuration . . . p. 47

4.7 Comparison with related works . . . p. 47

4.8 Dissertation workflow . . . p. 49

5 Results p. 50

5.1 Methodology Experiments . . . p. 50

5.2 Latency Predictor . . . p. 52

5.2.1 Features Selection . . . p. 53

5.2.2 Latency Predictor Classifiers . . . p. 54

5.3 Area and Power Predictor . . . p. 55

5.3.1 Feature analysis . . . p. 55

5.3.2 Classifiers evaluated . . . p. 56

5.4 Analysis of classifiers . . . p. 56

5.4.1 Latency Predictor . . . p. 57

5.4.2 Area and Power Predictor . . . p. 62

5.5 Genetic Algorithm . . . p. 64

5.5.1 Time comparison . . . p. 65

5.5.2 Parameterization of genetic algorithm . . . p. 65

5.5.3 Comparison between optimized and non-optimized GA . . . p. 66

5.5.4 Exploring GA self-adaptation . . . p. 72

5.6 Discussion . . . p. 73

(16)

References p. 79

Appendix A -- Systematic Review p. 89

A.1 Research questions . . . p. 89

A.2 Search protocol . . . p. 89

A.3 Criteria for selecting articles . . . p. 90

A.4 Evaluation of selected studies . . . p. 90

A.5 Data extraction . . . p. 90

(17)

1 Introduction

The evolution of lithograph process is allowing smaller circuits to appear, which makes Integrated Circuits (IC) more complex. These systems can be named System on Chip

(SoC) (ZEFERINO, 2003; JAIN et al., 2015). But this integration is a constant process,

and it allows, therefore, a continuous increase in the complexity of the circuits involved. In determined moment, emerged systems with multiple cores in an unique silicon wafer,

creating the Multiple Processor System on Chip (MPSoC) (WOLF; JERRAYA; MARTIN,

2008). Along with this evolution, some limitations were evidenced. One, in particular, was communication architectures.

Initially, the bus architecture was the better solution (NI; MCKINLEY, 1993), but due

to physical characteristics, it has shown insufficient performance to novel high MPSoC architectures. That way, researchers tried to propose new communication architectures, focused on high bandwidth, low latency, components reuse and scalability as mentioned by Guerrier e Greiner (2000) and Benini e Micheli (2002). As result, the concept of Network on Chip (NoC) was created, which uses some elements from Ethernet networks, such as

routers and links (BJERREGAARD; MAHADEVAN, 2006;ABBAS et al., 2014).

However this new approach brought challenges and opened a new research field. It happens because there are a wide number of possibilities to optimize, someone can change the operation frequency of routers and achieve a better result, for instance. From now on, what and how to optimize are important questions, but optimization can require evaluations and analysis , which both demands time - it is a combinatorial problem, this

problem belongs to NP-hard problem class on literature (PAPADIMITRIOU; STEIGLITZ,

1998; BELLO et al., 2016). Consequently, there is no known algorithm that solves this

problem in polynomial time (JUAN et al., 2015).

This problem leads to high number of possibilities to evaluate. Just imagine only four attributes and each one having four options, it results in two hundred fifty-six possibilities; six attributes with four options, four thousand and ninety-six possibilities. In this way, the

(18)

greater the problem is, higher will be the number of possibilities and it is not be feasible to evaluate it completely.

Trying to find a method to overcome this restriction and to explore adequately the design space many techniques are employed, such as: Genetic Algorithms (GA), Machine

Learning (ML), Integer Linear Programming (ILP), Heuristics (FARAGARDI; SHOJAEE;

YAZDANI, 2012; OST et al., 2013; TEI et al., 2013; WU et al., 2017). Nevertheless, each way of exploring the design space can lead to the need to evaluate each case individually to decide which technique to apply. In this context, optimization are made always directed to decrease the communication latency (increasing the performance), in order improve the energy efficiency, or reduce costs (area, manufacturing or design costs).

Figure 1 presents an example of SoC.

Figure 1: Example of SoC, the yellow blocks represent the SoC components, the rectangles the routers of the NoCs and the arrows the links.

Figure 1 shows an example of SoC, where the yellow blocks are the SoC components, the blue rectangles are the NoC routers, each red line represents a connection from the Network Interface (NI) component to a router and the black lines interconnect routers.

(19)

1.1 Motivation

NoC design requires several choices, such as topology definition, routing strategy

choosing, application mapping to cores (MARCULESCU; HU; OGRAS, 2005; SAHU;

CHAT-TOPADHYAY, 2013), and so on. In the literature, there are many studies concentrated to optimize NoCs as discussed in Le et al. (2017) and Manokaran e Khalid (2017) and as it can be seen below:

• Proposed architectures: novel architectures, routing algorithm;

• Energy efficient: Reduction of clock frequency, creation of irregular topology; • Fault Tolerance: Addition of extra components, additional links;

• Communication: Wireless links, Optical NoCs, Photonic NoCs, Mapping problem.

Besides these categories, many authors paid attention to the communication problem, which is one of the performance indicators, as decreasing the latency of communication,

the performance tends to have an improvement (QIAN et al., 2014).

Faced with this restriction, different approaches have been tried to minimize these

costs such as the use of Artificial Intelligence (AI) methods (SANGAIAH; HEMPSTEAD;

TASKIN, 2015; ZHANG et al., 2016). In this way, the goal is to find a good solution, but we can not affirm that it is the best solution, because this statement requires to evaluate the whole design space and we want avoid it. As previously stated, it is not a polynomial problem, hence, it can demand an unfeasible time to evaluate all possibilities. Thus, some methods must be used to speedup these tests.

Traditional approaches rely on simulations to run the architectures and get, for in-stance, performance figures. However, precise results (at cycle level) can only be achieved when simulations run at lower abstraction levels, which in turn, take too much time, due to the high computation effort needed to simulate all components at each clock cycle. To overcome this situation, many approaches rely on simulations performed at higher abstrac-tion levels. Although this brings faster simulaabstrac-tions times, results could be compromised

due to lower accuracy (HSU et al., 2015). As example, only due using Transaction-level

Modeling (TLM) abstraction level (inaccurate), the simulation is 81 times faster than Register-Transfer Level (RTL) version of a 8x8 mesh according Lehtonen, Salminen e Hämäläinen (2010).

(20)

Therefore, the present work investigates how to choose an optimized NoC configu-ration for a given application, without needing to rely on RTL simulations, aiming at minimizing the average latency.

1.2 Goals

In this work, we aim to investigate the possibility to reach optimized NoC configu-rations without having to simulate at cycle precision, but keeping the highest possible accuracy. We believe that this could be achieved by using Artificial Intelligence to both, to learn the behaviour of NoC’s characteristics and the characteristics of the applications, and, based on this result, predict the average latency for other similar architectures. In order to achieve these goals, we can list the following specific goals:

• Design a NoC model to describe the network characteristics;

• Analyze the application communication graph to extract some features to use as input feature;

• Acquire the application instances to training set;

• Train several classifiers to compare the prediction accuracy; • Validate the results comparing with the RTL simulator RedScarf;

• Implement a Genetic Algorithm to search the best NoC solution (latency-based). • Investigate the required time for design space exploration using the proposed tool

and RedScarf/DSENT simulation tool.

1.3 Contributions of This Work

This work contributes on optimizing the exploration of the design space, intended to reduce the needed time to find a solution that meets the applications time requirements or the best solution that it can provide. Our proposed technique provides an optimized solution in terms of physical components and, indirectly, power.

A classifier is used to predict the application average latency based on the commu-nication graph of the application and the characteristics of the desired NoC. And, as a second step, a GA is used to choose the NoC characteristics aiming at reducing the

(21)

average latency. This approach does not novel in the literature, Section 3 presents some works with aiming similar goals. According to the Qian et al. (2016), there are three open problems, all of which are addressed in our approach:

• Assumptions in current queuing models (analytical models) are very tight, not supporting a wide range of traffic pattern: this work proposed the use of seven NoC characteristics and two application attributes to predict a reliable average latency;

• Efficient resources management strategies: could be reached through genetic algorithms by changing NoC parameters to comply with the application latency requirements;

• Scalability challenge in NoC simulations. The authors affirm that simu-late NoCs with more than one hundred routers is complicated due the time required: the proposed software was trained with up to 256 routers, and is still possible to increment this number, because with our approach, it is not necessary to simulate the NoC behaviour on every single iteration of the Genetic Algorithm, just what is required to feed the classifier.

Some other restrictions in this approach are the use of only static task mapping, thus, all applications are known at design time; and, besides it, the experiments used regular and homogeneous NoCs, optimizing only physical resources.

This work is organized as follow. Chapter 2 describes Networks on Chip and Artificial Intelligence concepts. Chapter 3 introduces the Related Works about AI and NoC, focused in optimization. Chapter 4 presents the experiments Methodology. Chapter 5 introduces the obtained results. Chapter 6 shows our conclusion and proposal future works.

(22)

2 Background

The evolution of lithograph process, which allowed the appearance of smaller circuits, has sustained the Moore’s law. This advance has allowed to incorporate more circuits (core, memory, buses, digital signal processor) in an unique die. To these systems, the

name of System on Chip (ZEFERINO, 2003) is given. The benefits of the SoC approach are

numerous, including improvements in system performance, cost, size, power dissipation,

and design turn-around-time (LAHIRI; RAGHUNATHAN; DEY, 2004;MAQSOOD et al., 2015).

SoCs, traditionally, perform specific tasks. Another characteristic is the design guided

by the principle of consuming the lowest possible energy (BENINI; MICHELI, 2002; YANG

et al., 2017). Despite this, is common that it has hard restrictions about performance and Quality of Service (QoS).

Although the hardware evolution, the applications are also growing up (requiring more memory, faster processing). So that the hardware needs to continue to evolve to meet the requirements. In this context, rise the Multiprocessor System-on-Chip that use multiple programmable processors as system component. MPSoCs are commonly used in:

communications, networking, and multimedia among other applications (WOLF; JERRAYA;

MARTIN, 2008).

In both cases, SoC and MPSoC, the communication is a requirement. This commu-nications occur over bus architectures, point-to-point (P2P) or multipoint. According to the Carara, Calazans e Moraes (2014) the performance of a system, especially a multipro-cessor one, heavily depends upon the efficiency of its bus architecture. In System on Chip, the bus architecture can be devised with advantages such as shorter propagation delay

(resulting in a faster bus clock) and multiple buses (RYU; SHIN; MOONEY, 2001; YANG;

ANDRIAN, 2015).

According to the Zeferino (2003), when using a multipoint bus, the longer the bus wire, higher is its energy consumption. Beyond it, the parasitic capacitance also increase with the bus length, and a major capacitance impact directly in the operation frequency.

(23)

P2P buses offers as advantage a shorter length and, hence, major operation frequency. Due the these characteristics listed above, the bus architectures have some restrictions, such as low reuse, because that each MPSoC has its characteristics and requirements. In this way, a bus applied to a MPSoC used in biomedical area could be very different to the bus applied to a MPSoC used in an unmanned aerial vehicle. Making it difficult to reuse this component. A drawback is the lack of scalability, as the degradation is unacceptable

when the number of cores in a SoC exceeds a dozen (BERTOZZI; BENINI, 2004; PARK et

al., 2017).

This scenario has led to the emergence of alternative technologies, such as Networks

on Chip. NoCs emerged as a way to overcome the limits of bus architecture (BENINI;

MICHELI, 2002;REZAEI et al., 2016). If compared with bus-approach, NoCs support mul-tiple simultaneous executions, resulting in a more efficient network resource utilization (TAYAN, 2009). NoCs share many concepts from the traditional ethernet networks. The

main components in this architecture are the router and the links (ZEFERINO, 2003), it

is the element that connect the core to network. Thereby, each router can be have one or more cores connected to itself, but also there are other interfaces to connect to the network.

2.1 Network on Chip

Networks on Chip is a new paradigm for designing core based System on Chip which

supports high degree of reusability and is scalable (LEI; KUMAR, 2003;GAUR et al., 2015).

According to the Kumar et al. (2002) and Budiarto, Siregar e Stiawan (2017), NoCs give to the designer two major advantages, possible to develop the hardware’s resources indepen-dently as stand-alone blocks and create the NoC by connecting the blocks as elements in the network and flexible platform that can be adapted to the needs of different workloads, while maintaining the generality of application development methods and practices.

A NoC consists in a set of resources that are connected using channels, so this in-frastructure provides communication among the cores. It has basically three components: router, core and links. Each piece and its physical organization are discussed in detail in the next sections.

(24)

2.1.1 Router

Routers have like function forward packets/messages, they are composed of a crossbar,

some methods to handle the routing process, arbitration, and ports (ZEFERINO, 2003).

Ports are used to connect with other routers to feasible the communication, in this point the routing algorithm is used to handle the all process to forward packets.

A Routing Algorithm (RA) must implement some technique, for instance, XY

(de-terministic), West-First (partially adaptive) or Odd-even (QIAN et al., 2016; CHIU, 2000).

Also have a wide number of papers focused in to optimize the routing process (AHMED;

ABDALLAH, 2012;EBRAHIMI et al., 2012; SCHLEY et al., 2016; VALINATAJ; SHAHIRI, 2016;

ZHOU et al., 2016). Besides it, RA can cause communication problems, as can be seen in the next subsection.

2.1.2 Communication Problems

Routers can cause three problematic situations in communications: starvation, dead-lock and livedead-lock. Starvation happens when there are many packets requiring determined output port, but due the priorities, one message never is send. Thereby, the communica-tion will not occur. To fix this problem, an adequate arbiter policy can be used.

On the one hand, deadlock is when the packets in a router can not be forwarded due the follow router also depends of the other router. Like this, the communication not occurs, because a router depends of another, forming a cycle.

On the other hand, livelock is the opposite situation. The packets are forwarded indefinitely, because the better path is not available, so, it follows alternatives path, but never comes to destination. In traditional Ethernet networks, there is a field in the IP header called Time To Live (TTL) used for prevent it, the field value starts in 255 and, for each forwarding, the router decreases this value in 1. Therefore, when the value of TTL reaches 0, the packet is discarded.

The set of routers and links creates a topology. As can be seen in the next subsection.

2.1.3 Topology

Network topologies define how the nodes are interconnected. If every node is connected

directly to every other node, the network topology is fully connected (NI; MCKINLEY, 1993;

(25)

(a) Spin (b) Mesh 2D (c) Torus

(d) Double Torus (e) Octagon (f) Fat Butterfly Tree

Figure 2: Different representations of NoCs. Images from Concer (2008).

Figure 2 exposes three NoC components:

• Processor Element, or core: are the white rectangle in the Image;

• Router: routers are connected with the cores. Are the black rectangle in Figure; • Communications link: are the lines connecting the routers in the figure. Each

link could have a different bandwidth.

Therefore, any network has, at least, these elements, which may vary in the form of connection, quantity or physical organization. Figure 2 shows six examples of topologies. Spin, for instance, a router can have one or more cores connected to it. In Torus, each router needs to have five ports (four to connect with other routers and one to local core) with each node connected to nearest neighbors, and corresponding nodes on opposite

edges connected (SILVA; OLIVEIRA; MORAES, 2014).

In the next section are approached the internal router components, such as arbiter, buffer and switching techniques.

2.1.4 Communication Mechanisms

Flow control deals with the allocation of links and buffers to a packet as it travels along a path through the network. A good flow control policy avoid link congestion while

(26)

reducing the network latency (NI; MCKINLEY, 1993;BASU et al., 2017). The main techniques

for flow control in NoC are the packet-based flow control or flit-based flow control (KIM

et al., 2016).

Inside the router, there are more two internal structures: arbiter and buffer. The arbiter handles the input packets, which can be forwarded or wait in the buffer, and can be centralized or distributed, it is a design choice. The arbiter choice will impact

in the performance directly (WANG et al., 2009; JAIN et al., 2015). One of the functions

of the arbiter is to resolve all conflicting requests for the same output port. A wrong

arbiter policy choice may lead to performance degradation. (WISSEM et al., 2011; KAMAL;

AROSTEGUI, 2016).

According to the Nair e Habeeb (2016), an efficient design arbiter in NoC should have these characteristics:

• Avoid starvation problem;

• Decrease the average packet delay in NoC;

• Reduce the requirement for the buffer length and consume as few resources as pos-sible.

Some main kinds of arbiter are listed below.

• Centralized: each router has an arbiter to all ports. When a packet arrives in some input buffer, the arbiter is contacted to handle the packet. If the output port is free, the packet is forwarded to this port; if the output port is busy, the packet will wait

to be send (SILVA; OLIVEIRA; MORAES, 2014);

• Distributed: each port of router has an arbiter. Thus, when a packet arrives, the arbiter will receive a signal to operate. If the desired output port is free, the packet will be forwarded, but if the output port is busy, the packet will wait in the buffer (some topologies implement buffers in output and input ports. In this case, the

packet will be wait in the output buffer)(NAIR; HABEEB, 2016).

Another arbiter role is the control of priority. Due the NoC can be used to hard real time applications, the packets must meet their deadlines. To solve this question, there are two main arbiter categories, Fixed Priority Arbiter and Round-Rodin Arbiter (SUBHASAKTHE; MANOJ; MITHUNRAJ, 2014).

(27)

Fixed Priority Arbiter uses the simplest form of arbiters which has a determined priority (determined in design time) order to grant access to a shared resource. The

problem of this approach is the possibility to occur starvation (WANG et al., 2009; JAIN

et al., 2015). Other option to control the priorities is the use of Round-Robin arbiter, it

provides a high degree of fairness among the agents (LIU; JIN; LAI, 2013).

According to the Zoni, Flich e Fornaciari (2016), buffer is a memory region used such as data temporary storage. Each application may requires a buffer size. On the other hand, the increase of the buffer size means the increase of the area needed for the router. For instance, by increasing the buffer size at each input channel from 2 to 3 flits, the

router area of a 4x4 NoC increases by 25% or more (ZEFERINO, 2003; ZONI; FLICH;

FOR-NACIARI, 2016). Another example is that according to the Kundu (2006), router buffers are responsible for 46 percent of router power. Thus, the overall use of buffering resources

has to be minimized to reduce the implementation overhead in NoCs (MARCULESCU;

HU; OGRAS, 2005; DITOMASO et al., 2013). For the buffer size allocation problem, there

are some algorithms, but are valid only certain circumstances. (HU; MARCULESCU, 2004;

OVEIS-GHARAN; KHAN, 2015).

Similarly to the Ethernet networks, the NoC also have different ways to forward the packets over the network. The switching technique determines when the routing decisions are made, how the switches inside the routers are set/reset, and how the packets are

transferred along the switches (MARCULESCU; HU; OGRAS, 2005). The schemes approached

in this dissertation are by circuit and wormhole switching.

• By circuit: before the transmission, its reserves the complete path used to trans-mit the message from source to destination. Consequently, no one can transtrans-mit in the same path while during the transmission, then it is truly reserved. When the entire message arrives the destination, the channel is free for other messages (PAKDAMAN; MAZLOUMI; MODARRESSI, 2015). The telephonic networks used this

method, but there are researches in nowadays using this scheme for NoCs (

HANS-SON; GOOSSENS; RĂDULESCU, 2007; JERGER; PEH; LIPASTI, 2008; LIU; JANTSCH;

LU, 2012). Therefore, circuit switching is a alternative, despite it is implementation

complexity and static nature (MARCULESCU; HU; OGRAS, 2005; LOTFI-KAMRAN;

MODARRESSI; SARBAZI-AZAD, 2016);

• Wormhole Packet Switching: a packet is divided into a number of flits for trans-mission. The header flit governs the route. As the header advances along the specified route, the remaining flits follow in sequence. If the flit header encounters a channel

(28)

already in use, it is blocked until the channel becomes available (NI; MCKINLEY,

1993; ABDALLAH et al., 2015). According to the Marculescu, Hu e Ogras (2005), in

data networks, wormhole routing, under dynamic traffic, has a better performance than the circuit switching.

Besides that, the wormhole switching technique has a drawback. While the entire message do not arrive to the destination, the channel is blocked. Only when all the packet is transmitted and arrived at the destination core the used links will be used to other

communication (distinct messages cannot be interleaved over a physical channel) (PANDE

et al., 2005).

2.2 Artificial Intelligence

This section introduces some concepts and categories, such as supervised and non supervised techniques, after are showed some examples of classifiers and how it works.

2.2.1 Methods overview

In attempt to find methods to optimize NoCs, many researches have been conducted, as can seen in Section 1.1, a possibility is the use of Artificial Intelligence to try attend the demands. In categorical terms, there are two techniques class: supervised and unsupervised (COATES; NG; LEE, 2011).

Supervised techniques require a set for training. In this way, there is the input and the expected output, the function of classifiers is to learn how to reach the expected result. In

optimal scenario the classifier will correctly determine for unseen instances (SOMVANSHI;

CHAVAN, 2016). Other point is that, the more instances, the better the learning tends to be. Some supervised methods are listed below.

Decision Trees (DT) uses a decision model (and its possible consequences, for in-stance, sum 870 to the value and follows in the tree). And, for each question answered, the algorithm makes a choice and calculate. One advantage is that they are simple to

understand and interpret (BARROS et al., 2012; BUSCHJÄGER; MORIK, 2017). According

to the Vlahovic (2016), Decision trees are inspired on mathematical concepts of a graph.

As can be seen in Figure 3 there is a chain of questions1_{, each question leads to another}

decision making, in the end, it obtains a class. In this case, the R759 attribute have two

(29)

Figure 3: Example of Decision Tree, where there is a flow to follow, each attribute is evaluated and, depending the option chosen, the instance is classified.

conditions: less than, greater than or equals to 33738.2. If the value is greater or equal, the instance is classified as “SES”, if the value is smaller, the method will analyze other attributes, all attributes will be analyzed. Despite this, this method is relatively faster to classify.

Naive Bayes Classification is a probabilistic classifier based on the Bayes’ theorem,

it assumes strong independence among features (TURHAN et al., 2009; POON et al., 2017).

According to the Gong e Yu (2010), Naive Bayes has very high learning efficiency and it can estimate all the probability just need a scan of the training data.

Logistic Regression (LR) is a regression model that can manage the situation where

dependent variable is categorical (WANG et al., 2016).

Figure 4 shows an example of linear Logistic Regression. The line divides the points in two groups (above or below the line). Also exist other type of LR, the non-linear Logistic Regression, but it is not showed in this work.

Another supervised method is Support Vector Machines (SVM). It is a binary algo-rithm can be used for classification or regression. Using a dimensional space, each point is a data. Then, the algorithm traces a line separating two types. Finally, the method tries to maximizing the distance of line for the two structures, while it minimizes punishments (BHASKAR; SINGH; NAYAK, 2014;FENG et al., 2016).

(30)

Figure 4: Example of LR, in a plane there are several data and the classifier traces a line trying to divide them into two classes.

Figure 5: Example of a SVM applied to a plane with many data points and the method generates a line to split the data in two groups: above or below the line. In this case, the line can not separate correctly all points.

Figure 5 shows a linear SVM. The line divides in two groups (green and red). In this case, the method may to categorize some points incorrectly (such as can be seen in this Figure). It happens due the fact that the model can use only a line. The Figure 6 shows an example of non-linear SVM.

(31)

Figure 6: A non linear SVM applied to a plane with several data points. As can be seen, the classifier used one line, but not linear, to try to split the groups of data.

In the figure 6, can be seen two point groups (blue and brown points). Beyond the points, there are two lines that cuts the plane by dividing the points. Because of it, this model could have a better result than the linear model.

Figure 7: A regressive classifier with three methods applied: RBF, linear and polynomial model. The RBF is the more accurate option (the line is more near the points), while the linear and polynomial can not split the data correctly.

When used for regression, give the name of Support Vector Regression, this method implements some kernels, Figure 7 presents three examples. Is possible to notice that polynomial model does not adequate to this dataset, on the other hand, the Radial Basis Function kernel (RBF) shows itself more fit than linear model.

Other category is the Unsupervised techniques, this algorithm class needs to find relationships, patterns or categories from the introduced data and transform it into output. An example of unsupervised method is Clustering. In that, the instances do not have an implicit label, but only the features. The classifier function finds the differences between

(32)

the instances and gives a precise output (MUKHOPADHYAY et al., 2014;LU et al., 2016). A

example of clustering algorithm is K-means (XIONG et al., 2016).

Evolutionary Algorithms are a technique inspired on biological evolution and

behav-iors of organisms (FOGEL, 2006). Similar to evolution theory, where the stronger survives,

in this case, the stronger tends to be the best solution to determined problem. Another name to this class of methods is Evolutionary Computing (EC). Some examples those methods are: Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO),and

Genetic Algorithm. From those, the most used is the GA (RAHMAT-SAMII; GIES;

ROBIN-SON, 2003). According to the Chou, Jiau e Huang (2016), the benefits of PSO include

easy implementation and effective solution convergence.

GA receives as input a population, where each element is coded to represent a solution to the problem given the population. The encode process discretizes the features using 0 and 1 for the presence or absence of the feature. Not necessarily, the gene represents all features and a chromosome is composed by a set of genes. The algorithm defines the way which genetic information (of all) is combined to generate a new set of chromosomes (RAHMAT-SAMII; GIES; ROBINSON, 2003).

The combination could be made by using crossover or mutation, as showed in Figures 8

and 9. The selection, mutation and reproduction are known as genetic operators (PEREIRA;

GRASSI; NABETA, 2015).

Figure 8: A third chromosome is created based on crossover operation between two other individuals.

In the Figure 8, there are two initials chromosomes (the red and green bars) after the crossover, exist a third, that is partially red and partially green, that is the result of a combination between the two initials chromosomes.

In Figure 9 there is an example of mutation: a random change occurred in only one chromosome, instead of two. Usually, occurs less mutation than crossover in a population,

(33)

Figure 9: An unique individual change one or more characteristic in attempt to reach a good solution.

mutation is responsible for new individuals, without necessarily inherit some characteris-tic, it is done for the purpose of trying to make a big change.

The learning process occurs over the adaptation, and the adaptation results in opti-mization. For each iteration, the chromosomes are submitted to the fitness function. This function defines how good is each possible solution and select the individuals for a new population, each population needs to be better than the previous population (evolution process). After it, all chromosomes now have a value, in which all of them are ranked. The lower ranked ones will suffer some method of adaptation for try improve the fitness

value in the next evaluation (QIU et al., 2015).

Figure 10 summarizes the Genetic Algorithm flow. First the population is initialized. If, the desired result is reached, it stops, if not, apply the fitness function and evaluate the population. The follow step is the population reproduction and variation, where happen the crossover and mutation. The following steps are optional. The Algorithm Adaptation phase involves to analyze the search process and decide the values of algorithm parameters (ROSCA; BALLARD, 1994). The Local Search (LS) proposes to amend a solution based on neighborhood. After these steps, the solution found is analyzed and the algorithm may stop or execute more iterations.

Usually, the Evolutionary Algorithms (EA) have been used for automatic processing

of large amounts of raw data to discover significant information (MUKHOPADHYAY et al.,

2014). In this context, the first step to use EAs is to code the problem in chromosome format. This chromosome could be represented by a vector, where each position is a

(34)

Figure 10: The image represents the operation GA flow.

(35)

3 Related Works

This chapter aims to discuss works related to tools for NoC design space exploration through the usage of Artificial Intelligence methods. Appendix A presents the protocol for systematic review.

3.1 Design Exploration tools

The main goal of this dissertation is to develop a machine learning-based tool for NoC design space exploration. That said, this section aiming to show some papers that have the same main goal. Every cited paper only uses regular topologies, irregular topologies are out of scope.

Wang et al. used an ACO to optimize the number of virtual channels for a given regular NoC. This optimization is made after an analysis of application graph and trying to detect waste of resources. Besides it, this work restricts the NoC configuration to XY routing algorithm and does not support varying in other NoC attributes. Aiming to analyze the obtained results, they adopted MPEG4 and VOPD applications to compare the performance and resource saving. The gain was up to 33% and 24%, respectively for

each evaluated application (WANG et al., 2013).

Another work that optimizes virtual channel number is Rout et al., focusing in op-timize virtual channels number, developed a solution that analyzes the traffic for each router and determines how much virtual channels are needed. This implementation was made in hardware, does not using any machine learning resource, but using two units to handle it: Power Management Controller (PMC) and Utilization Computation Unit

(UCU) (ROUT et al., 2018). For comparing their technique with a baseline router was

used Noxim simulator with six traffic patterns (Bit-reversal, Butterfly, Random, Shuffle, Transpose1, and Transpose2) and the achieved results presented a power reduction up to 83% with only 4.2% of throughput penalty.

(36)

Sangaiah, Hempstead, and Taskin implemented a regressive model to explore design space. The main reached goal was can find the best NoC configuration for a determined CPU (mono or multicore). This works has two steps. First, the solution explores the CPU architecture, and after, it found the optimized NoC. Due the CPU cache memory to influence the overall system performance (codependent impact on performance), the authors desired to find a trade-off among the quantity of each resources. The proposed solution uses nine attributes for predict the performance. For accounting the area required, it used the Cacti (memory cache) and Orion 2.0 (NoC) using the 65nm technology.

The achieved accuracy is until 88.2%. Another obtained result was the reduction in simulation time, from 1370 years (using gem5) to 180 hours (using the proposed work), but not was cited the use of validation techniques or which NoC attributes was optimized (SANGAIAH; HEMPSTEAD; TASKIN, 2015).

Another work is the of Zhang et al. they created an approach for explore the design

space of NoC using Congestion Matrix (ZHANG et al., 2016). This technique is based on

congestion of the routing path and the diameter of network. It is based on Leighton et al. proof that the upper limit of latency in any packet routing network can be estimated

from the congestion of the routing path and the diameter of network (LEIGHTON; MAGGS;

RAO, 1994).

Moreover, this framework provides a circuit area and power consumption (static and

dynamic), both use Orion 2.0 (KAHNG et al., 2009) and DSENT tool (SUN et al., 2012). For

the calculation of matrix congestion is used a Local Search (LS) algorithm to find target configurations. Each configuration has seven common parameters of NoC.

All data used to evaluate the solution were collected from Booksim 2.0 simulator. Eight traffic patterns were used and nine benchmark to drive the simulator. The experiments were based on a 64-node (8x8) NoC with 65nm manufacture technology and 1GHz working frequency. The achieved results showed until 92% of accuracy. According to the authors, when the injection increases from 0.10 to 0.55 the reliability fall to 63%, using ANN and SVM. But, in the same experiments, they reported that reliability based on congestion

increases from 52% to 76% (ZHANG et al., 2016).

Sinaei and Fatemi compared three multi objective approaches to explore the NoC design space. The first two, Objective Simulated Annealing (MOSA) and Multi-Objective Particle Swarm Optimization (MOPSO), were developed by themselves, and the third is the NSGA-II algorithm. Such as metrics Energy consumption, Performance,

(37)

Each metric was modeled as a Kahn Process Network (KPN) to estimate the indi-vidual metric cost. As case study, was used a M-JPEG encoder application and the ar-chitecture model is defined a general purpose CPU, DSPs, ASICs, SRAMs, and DRAM. Authors verified that MOSA accuracy is better than MOPSO and NSGA-II in all gener-ations. Besides it, first, MOPSO had a solution distribution more evenly, it means more diversified solutions.

Obaidullah and Khan used a Hybrid Multi-swarm Optimization to improve the NoC configuration. They focused in optimize task mapping and NoC architectural. It had two steps, first, the solution optimizes the task mapping, after, the NoC configuration is optimized. The proposed application explores to provide the best configuration which

results in optimal communication cost, power, and chip area (OBAIDULLAH; KHAN, 2017).

They combined the two tasks (task mapping and NoC configuration) at the same swarm (each one is a sub-swarm), it helps to find the more optimal final solution, al-though this design choice expand the search space dimensions than only mapping or noc configuration. Four applications were used to evaluate it (PIP, DVOPD, VOPD, and MPEG4). Analyzing the obtained results, the area and power were reduced in the range of 30-120% and 8-40%, respectively.

Each cited paper in this section is presented in next section by means of a table for comparison and description.

3.2 Paper analysis

Table 1 shows a brief about all exhibited paper in last section.

Paper Litograph Process Benchmark Used Simulator Goal

Synthetic Real applications

Wang et al. (2013) X Not specified Optimize virtual channel number

Sangaiah, Hempstead e Taskin (2015) 65nm X Cacti 6.5 and Orion 2.0 Optimize CPU architecture

Zhang et al. (2016) 65nm X BookSim 2.0 Optimize four NoC attributes

Sinaei e Fatemi (2016) 65nm X Not specified Optimize power, performance and area

Obaidullah e Khan (2017) 90nm X Not specified Optimize task mapping and NoC configuration

Rout et al. (2018) 32nm X Noxim Optimize virtual channel number

Table 1: Comparison among cited papers and brief description of them.

Two main informations can be see in this Table. First, all papers used 32nm, 65nm or 90nm for manufacturing technology and not supports predictions about it, all require a library. Second is that three of four papers adopted synthetic applications for their benchmarks. It can be explained due to the ease of creating applications that have certain

(38)

characteristics, such as specific number of packets or number of tasks. Concerning to employed simulators, neither all papers presented this information, but who explained this question used a high-level simulator.

In relation to advantages and disadvantages of each approach, this work overcome all quoted papers in number of attributes. This characteristic is important because more attributes allow us to consider router’s internal components behavior, which bring us more precise results. In the other hand, other approaches only support area or power prediction when tight to a particular lithograph process supported by library. In our work this is overcome since we can deal with different predictions for these metrics. Currently, the main drawback of our work regards on applying only mono-objective solutions.

(39)

4 Methodology

This work aims find the adequate NoC for a given application. Focused in minimizing the amount of physical resources, but taking into account the time requirements. Those sections below describes the developed solution, starting with a solution overview.

4.1 Solution description

This work has two steps: first, AI methods (classifiers) are used to predict the area, latency, and power; second, a Genetic Algorithm performs tests with the attributes of NoC to find for a proper NoC configuration.

First, the designer inserts the NoC characteristics, such as number of packets (NoP), required bandwidth (RB), size, and average latency value maximum expected. The ap-plication checks the entered values. If the values are corrected (for instance, any negative number or zero) the genetic algorithm is initialized and proceeds with the execution; if not, the designer is informed and the execution is finished. Other possibility to use this tool is to inform a latency value equals to one, in this way, the tool will execute until the maximum number of evaluations and will show the best NoC configuration found.

Based on average latency value informed by the designer, the GA will execute many iterations to comply this restriction, in other words, the GA will find for a adequate NoC configuration. Every individual (NoC configuration) has its latency value evaluated, but the power and area values only are calculated for the final individual. This restriction was implemented to speedup the evaluation process, a generation may have thousands of individuals, hence calculating these values improve the required time to execution all steps. Also, this work proposes a mono-objective algorithm having latency as a metric, therefore the values of power and area are not essential to GA operation - they only are informed at the end of execution.

(40)

and in the end of execution, the NoC configuration as well as latency, area, and power values are displayed to the designer. All quoted operations are presented in Figure 11.

Figure 11: workflow for entire solution, first it checks the values, if it is correct, the GA is called and starts its execution until meet the requirements.

The Figure 11 presents the steps "Compute Latency" and "Compute Area and Power values", where those calculations will be carried out by the implemented classifiers. About the simulators used, in this workflow both are not presented because they only was used to feed the classifiers dataset.

All the adequate classifiers were applied to this problem, aiming to find the better

option. The tests were done using the Weka1 software, version 3.8, and using the datasets

specified in Sections 4.2 and 4.3. Statistical test t-test was applied to validate the results2

(HSU, 1938).

Another figure that describes the solution is exhibited below.

1_{https://www.cs.waikato.ac.nz/ml/weka/}

(41)

Figure 12: Class diagram for this work. There are three predictors, one for each NoC metric and its base class (Predictor) and two classes for Genetic Algorithm (Population and GA).

Figure 12 exhibits the work’s class diagram. Predictors to area, latency, and power in-herit from its base class, Predictor, methods that involve prediction, training and dataset manipulations. NoC class consists in a vector (representing a NoC, each field is one char-acteristic as presented in Figure 14), the predictors and two auxiliary methods: isValid and random. The first, checks a NoC to know if it is a valid possible solution or not, being used mainly in repairing stage inside GA operation. The later, creates NoCs based on NoC model. It is important because all individuals of the population needs to follow the designer restrictions (number of packets, required bandwidth, size, and topology).

Still about Figure 12, the classes about GA operation are: Population and GA. Popu-lation controls the access to a popuPopu-lation using the methods getNoC and saveNoC, where both handle the vector of NoCs (get and save in a specific position), called nets, and manage the size of vector and NoC model. While the GA class control the parameters about GA execution (operators, selection methods). Its attributes define mutation rate (mutationRate attribute), a flag (flag attribute) to inform if it should use roulette wheel method or tournament, a weight vector (weight attribute) needed by roulette technique, and if it must use an elitist strategy. Two special methods are: fitness and isEqual. The first one is a method that implements an objective function for the problem, and, the second, for checking if two individuals are equals. It is important because in the same population should not have two or more equal individuals.

(42)

this.

Figure 13: Example of generated file from the solution, the name represents its character-istics.

Filename informs all characteristics used by the experiments, see Figure 13 that shows that the file stores results from a experiment that used 8192 packets, a required bandwidth of 16384, a NoC of that has size 8, the second 8 is the size of the tournament (Selection Method used by GA for this experiment), and the evaluated population was 50. In all experiments were used 2D Mesh topology.

Next are approached the dataset, starting with latency dataset.

4.2 Latency dataset

Each instance has a network size between 2x2 to 16x16. These sizes were needed because the non linear behaviour in some circumstances, such as latency on 16x16 size using non deterministic routing algorithm, does not obey to any linear formula. Again there is the possibility to occur contention, or deadlocks. As a result, the created dataset contains 686 instances.

Beyond the NoC attributes, were used two informations about the application, number of packets and required bandwidth. The NoP chosen values were 128, 1024, 2000 and 8192 per communication flow. About the RB, it was used four values: 64, 512, 1024, and 2000Mbps. All experiments were performed using all values to the NoP and RB. This variety is necessary for improving the classifiers learning process (more different instances) in attempt to show the non linear values and it can represent these values.

Latency dataset was fed with Bit-Reversal, Butterfly, Uniform, Perfect Shuffle, and Transpose Matrix traffic patterns. All used instances are from audio/video and signalling applications, and for simulation was defined a channel bandwidth of 3200Mbps, frequency of 100MHz, and the simulation executes until all packets are delivered. It is necessary to specify because another application could result in better or worse classifier accuracy.

All instances were collected from results of simulations carried out in the RedScarf

tool (SILVA et al., 2017). It is a simulation environment focused in performance evaluation

(43)

pa-rameters to be controlled independently of the network itself. This tool was implemented using SystemC RTL.

Next, power and area datasets are explained in details.

4.3 Area and Power datasets

Initially, both datasets are similar, differing only the last attribute (power or area attribute). Each dataset contains seven attributes that are listed below (along with its respective range of values evaluated):

• Injection rate: 0 to 0.99;

• Number of Buffer per Virtual Channel: 4 to 64; • Number of Virtual Channels: 1 to 16;

• Number of bits per flit: 16 to 64; • Frequency: 100Mhz to 10GHz; • Litography 11, 22, and 45nm; • Area/Power.

The first attribute, injection rate, is important to consider due to the power con-sumption and injection rate are directly proportional. The second attribute is important because the size of the buffer reflects directly in the ammount of area and power required, as see in Marculescu, Hu e Ogras (2005) and Cilardo e Fusella (2016), only increase the size from two to three in a 4x4 mesh NoC, the router area increases 30%. Hence, this parameter should influence the area and power values. According to Mello et al. virtual channels may be adopted to reduce the contention (increase the throughput) or help to comply the deadlines. Both of them uses their impacts on power and area required (be-cause each virtual channel has its buffers, hence, a major number of virtual channel leads

to a major number of buffer) (MELLO et al., 2005).

The width of the network channels, called as Number of bits per flit, impact directly on performance, because the bandwidth is the product between channel frequency and

channel width (MARCULESCU; HU; OGRAS, 2005; CILARDO et al., 2015). In this way, it is

(44)

away change in power consumption. Lithography has an effect in area and power required, both metrics are directly proportional to resource usage.

Lastly, area or power attribute is the result of the combination of the previous at-tributes. With the intention to standardize the datasets, all attributes were kept. Fre-quency, for instance, does not impact directly in required area by the router, but for aiming consolidate, it was included in both datasets.

Instances generation was in charge of the Design Space Exploration of Networks Tool (DSENT), it uses analytical models to calculate area and power required by routers

(in-cluding, optical and electrical links) (SUN et al., 2012). Sun et al. (2012) compared to

its software with ORION 2.0 tool (KAHNG et al., 2009) and SPICE simulations (RASHID;

RASHID, 2005), both achieved an average error rate of 949% and 9%, respectively. Faced with this result, it was used DSENT to generate the instances, in total, 108,180 instances were collected.

Next, Section 4.5 covers the DSE for this work in which some NoC characteristics are also explored to reduce the communication latency.

4.4 Attribute selection

When we model the datasets we use attributes that we think they are important, but it could be not the best way to extract informations from these characteristics. In this way, a common options is the use of attribute selection methods, such as Best First or

Evolutionary Search (XUE et al., 2016). According to the Wang, Wang e Chang (2016)

these methods can lead to reduce the required time for create the model, but when it is used incorrectly can lead to accuracy degradation. Therefore is necessary observing the impact of feature selection in accuracy, in the optimal scenario the number of features

is reduced, but the accuracy is the same when compared with all attributes (LIU et al.,

2017).

These methods act realizing that one attribute can be inferred from another. Thus,

redundant information exists in the data set (SHEIKHPOUR et al., 2017). For instance,

considering two attributes that have a positive correlation equal to 1, when applied to a feature selection method, only one attribute will remain in the data set, since the

second attribute can be inferred from it (DERNONCOURT; HANCZAR; ZUCKER, 2014;HIRA;

(45)

4.5 Network on Chip Space Design

Considering all possibilities of NoCs, the total number of possibilities is showed in Table 2.

Characteristic Number of possibilities

Topology 2

Size 15

Routing Protocol 6

Virtual Channels 16

Input Buffer Depth 28

Output Buffer Depth 33

Arbiter Type 4

Total 10,977,120

Table 2: All possibilities to Design Space Exploration in this work.

Table 2 shows the number of possibilities (options for each attribute) for each charac-teristic and the total number, we can conclude that there is a huge number of possibilities and, hence, a high number of simulations to deplete all options. Let us assume that each simulation will be executed in fifteen seconds, it would be necessary 45,738 hours to sim-ulate all NoC options (sequential simulations). Therefore, the DSE is huge and is almost unfeasible to evaluate all possibilities in a reasonable time. This is also another problem: varying an attribute can impact in other (chain reaction), thereby, it exploration requires some method to speed up the process, to become able analyze more NoC configuration options, this work will use an AI method for this proposal in order to achieve faster design space exploration with the higher accuracy as possible.

Table 2, Virtual Channels and Output Buffers were limited to 16 and 33 respectively, because they were the maximum value tested (1 to 16 and 0 to 32). Simulation with more virtual channels than 16 would need a large amount of RAM memory (even in intermediate size scenarios), more than 5Gb, being more than 3.5Gb only to each simulation thread.

Routing Protocols options were limited by the Simulator and to implement more protocols is outside the scope of this work. The same justification is applied to arbiter and size characteristics (limited by simulator). The topologies used will be Mesh and Torus, both 2D. Use of 3D topologies is also outside the scope.

(46)

4.6 Genetic Algorithm

For optimizing, and to reduce required time for design exploration, the design space exploration was in charge of Genetic Algorithm. GAs have been successfully used as a solution for a large set of minimization problems, including mapping problems, and they

are able to deal with the hard constrains as performance or power (STRUM; CHAU et al.,

2015). This technique uses each predictor presented earlier to avoid simulations. Thereby, it is expected that the execution to be even faster than the design space exploration using simulations. An important issue in GA algorithms is the objective function, for this work was defined that latency average value is used such as objective function, the goal is to minimize this value.

Whole tool was developed using Java language, specifically, version 8. As presented in Figure 11, GA will be executed until reach one of these conditions:

• Maximum number of evaluations; • Expected latency.

This list is used to limit the execution, for avoiding an execution loop, and not waste time and resources going beyond the designer requirement. Other important character-istic is the size of population. Each individual of population is a potential solution for the problem, but in this case, there are many invalid NoC configurations. For instance: a NoC without buffer or without routing algorithm. Therefore a repair strategy was re-quired in these cases. This strategy checks each vector field (NoC characteristic) and, detecting something wrong, it modifies the value using a random function, respecting the restrictions.

Figure 14 details the defined chromosome for this problem.

Figure 14: Example of NoC Chromosome.

Figure 14 shows that the first four fields, or position in vector, are allocated by de-signer, it composes a NoC model. From fifth to ninth, these attributes are operated by

(47)

GA and, the last, is calculated based on previous field. Based on this information, a pop-ulation is created following the informed model by designer. After the start of evolution process, two genetic operators may occur: mutation and crossover, as explained in Section 2.2, and in the next subsection both are presented.

4.6.1 Genetic operators

About the first, mutation, it is applied when is required to improve the diversity of population or when many individuals have the same allele, so, trying to overcome

this problem, mutation is executed (SASTRY; GOLDBERG; KENDALL, 2014). One possible

improvement is a self-adaptation, according to the Karafotias, Hoogendoorn, and Eiben it allows the EA to use appropriate parameter values in different stages of the search process (KARAFOTIAS; HOOGENDOORN; EIBEN, 2015).

In the first moment, mutation rate is 1%, but if the GA seems that it is limited to local optimal (sub-optimal solution), it increments (5% by time) this rate aiming search in whole design space, not only a point. This technique allows to reach better results and

major population diversity (ZAMUDA; BREST, 2015).

In relation to crossover, it was implemented using two points crossover technique, these points are selected randomly, it is showed in Figure 15.

Figure 15: Example of two points crossover where three attributes were copied from mother and father.

In Figure 15 is presented a crossover with two points to split. Three attributes were copied from mother individual (identified as "M" in image) and three from father ("F") to child (described as "C"), the red line delimits this segmentation, where inside the lines it is the region copied from the father.

(48)

Selection methods adopted in this work are detailed below.

4.6.2 Selection methods

For selection method, this work gives to designer the option to choose between tour-nament or roulette wheel method. When tourtour-nament is selected, the designer needs to provide the size of tournament. Both are traditional methods to selection, but there is no

guarantee that good individuals will be selected (YADAV; SOHAL, 2017).

Despite this, roulette wheel tends to have a better results because it assumes that the chromosomes are selected based on their probabilities that are proportional to their fitness value. Its selecting principle is similar to that of a roulette wheel, while tournament may

lead to minor diversity and, hence, a sub-optimal solution (SHUKLA; PANDEY; MEHROTRA,

2015).

As no articles were found in the literature that analyzing both methods for this specific problem, experiments with these two methods will be analyzed in Section 5.

Next subsection approach the choice of parameters values to genetic algorithm exe-cution.

4.6.3 Genetic algorithm configuration

Another issue is the choice of GA parameter values. There are two options to guide this process: test all possibilities to each parameter, analyze all results and find out the best configuration or use some automatic algorithm configuration. This method exploits the configuration options using statistical tests to verify which are the best configurations (solutions) without having to perform all tests.

In this work was used the tool called irace. It implements an iterated racing procedure

which is a extension of Iterated F-race (I/F-race) (LÓPEZ-IBÁÑEZ et al., 2016).

4.7 Comparison with related works

Table 3 exposes informations about all cited papers and this work. Except this work and the first, all are multi-objective, at least area and performance, such as Sangaiah, Hempstead e Taskin (2015). Only Sinaei e Fatemi (2016) and Obaidullah e Khan (2017) used Particle Swarm Optimization, but with different strategies: while the first uses a

(49)

Kahn Process Network for modelling to mapping application, the second uses a PSO and SA to do it.

Another difference is that one was developed focused on MPSoC and, the second, is focused on NoC. On the other hand, Zhang et al. (2016) supports a major number of attributes for design space exploration (eight, being five topologies). In relation to Wang et al. (2013), our approach is more comprehensive to NoC attributes, whereas the cited paper is focused on optimize only virtual channel number.

Table 3: Comparison between cited papers

Works Employed technique Objective function Architecture

Regression mo del Congestion Matrix P article Sw arm Sim ulated Annealing Genetic Algorithm Random F orest MultiLa y er P erceptron Area Performance Po w er MPSoC NoC Wang et al. (2013) X X X Sangaiah, Hempstead e Taskin (2015) X X X X Zhang et al. (2016) X X X X Sinaei e Fatemi (2016) X X X X X X Obaidullah e Khan (2017) X X X X X Rout et al. (2018)3 _X _X

This work X X X X X

This work differs from previous works in two points:

• Support many NoC attributes.

• Support different lithographic processes

First, this solution supports nine attributes for latency and six for area/power, includ-ing kind of arbiter, output buffer, number of bits per flit, and semi-adaptive routinclud-ing pro-tocol (OddEven, in this case). Comparing with Obaidullah e Khan (2017), our approach supports four additional attributes. Secondly, area and power predictors are trained with three different lithograph process, and they are capable to predict the correct values for others process as designer needs, differently from Zhang et al. (2016) and Sinaei e Fatemi (2016) that only support 65nm manufacture technology. In practical terms, our approach can handle with kind of arbiters, output buffer and routing protocols more than these presented papers.