Monte Carlo νετρονικού κώδικα TRIPOLI-4

Monte Carlo codes are widely used in various reactor physics applications as they can provide an excellent platform for modeling neutron transport calculations. The simulations are performed by following particle paths that are independent, either throughout the problem (radiation shielding calculations) or on a per-generation basis (criticality calculations). For complex problems such as criticality problems in reactor models, the required computation time can be extremely large.

By using parallel computation, the amount of time required for large/complex transport computation can be reduced by more than an order of magnitude according to the number of processors used in each run. In the present thesis, the parallel performance of TRIPOLI-4 for a specific criticality problem has been investigated. Performance metrics such as speedup, efficiency and scalability have been used for the performance evaluation.

The problem used is a simulation of a thermal reactor with high complexity that requires a large amount of computational time.

Introduction

Object

Structure of Thesis

Background

Flynn's Taxonomy

Single Instruction, Single Data (SISD)
Single Instruction and Multiple Data stream (SIMD)
Multiple Instruction, Single Data (MISD)
Multiple Instruction and Multiple Data stream (MIMD)

Memory Structure

Shared Memory
Distributed memory
Hybrid memory

Problem decomposition

Data Parallelism
Task Parallelism

Overhead

Effect of Communication
Effect of Load balancing
Effect of Synchronization

Parallel performance metrics

Speedup
Efficiency
Scalability
Amdahl's Law

Beowulf Cluster

Thales Cluster

In shared-memory parallel computers, each processor has access to the memory at any time. The main advantage of these machines is their ease of programming due to the fact that there is no communication between the processors. OpenMP uses a portable, extensible model that gives programmers a simple and flexible interface to develop parallel applications for platforms ranging from the standard desktop computer to the supercomputer.[10]

The most popular message passing technology is the Message Passing Interface (MPI), a message passing library for C and Fortran. In hybrid shared/distributed memory parallelism, a cluster of shared memory parallel machines use distributed memory parallelism to distribute a computation among each other, while each uses shared memory parallelism to compute its part of the solution.[12] Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with the threading model (OpenMP).

The advantage of (distributed) shared memory is that it provides a unified address space in which all data can be found. The computation is considered as a series of operations on these distributed data structures and concurrency is achieved by acting simultaneously on different parts of the data [5,17]. The compiler converts the program into standard code and function calls to a message passing library for distributing the data to the processes.[9].

In a task parallelism model, problem execution is distributed across multiple tasks (processors). In this model, the program is split into a number of tasks, each assigned to a specific processor. Parallel programs require communication at the stage of distribution of the subproblems to processors and also during processing. This is called delay and can be avoided by carefully choosing the number of processors so that each processor has enough work in relation to the amount of communication.

Synchronization problems can occur due to poor preload balancing, forcing threads that have completed their work to wait until the last one reaches the barrier, devoting that time to the parallel program. The speedup of a program that uses many processors in parallel computation is limited by the time required for the next fraction of the program. Parallel efficiency is a performance metric with a value between zero and one, evaluating the percentage speedup achieved by the algorithm compared to linear speed.

The complement of the efficiency indicates how much effort is wasted due to parallelization overhead. Scalability expresses the ability of a program/system to improve its performance after hardware is added, proportional to the capacity added.

Figure 2: Tightly (left) and Loosely (right) couple systems [5]

Parallelism scheme of TRIPOLI -4 code

Monte Carlo

Monte Carlo Criticality Problem

Monte Carlo Implementation on TRIPOLI-4

Parallel Implementation on TRIPOLI-4

Communication Library which takes care of providing processes with communication objects: Message type and content, storage, send/receive actions in block/block mode, XDR formatting, etc. a specific programming interface (MPI, PVM, TLI, BSD sockets), performing unitary operations of processes and managing commands/data flow.

TRIPOLI-4 Results

Methodology and Results

As the number of processors increases, the deviation between the actual and the perfect one is getting bigger and bigger. As can be seen from the curve in Figure 10, the efficiency drops sharply when we use two processors. This poor performance (almost 50%) is due to the communication time between the two processors, which seems longer than the computation time.

However, adding more processors improves performance, speedup, and efficiency, up to the case of 16 processors where the highest performance value is observed. After this point, performance begins to decrease as more processors are added, while speedup increases. Further multi-processor runs were performed to gain more insight into how performance and speedup behave with increasing number of processors.

Since a total of 88 processors were available, four sets of runs were performed with the number of processors increasing from 70 to 85 with 85 processors). As more processors are added, the deviation between the actual and ideal speed line becomes more intense (Fig. 11) and the efficiency continues to decline (Fig. 12). However, it should be noted that dividing the number of processors into 85 tasks is more efficient in parallel than dividing it into two processors.

To evaluate the scalability of the code in the reduced problem, the initial input was reduced to half the batch number while keeping the size the same. As expected, the serial case required half the computation time as the relative one. The execution time for the dual processor chassis is once again greater than the relative serial one.

From the plots, one can observe that the scalability of the second case is obviously worse. From the efficiency -#CPUs graph (Fig. 14.)- it can also be seen that the efficiency is worse for the reduced problem. In both cases, as the number of processors increases, the communication per processor is the same, but the amount of work per processor is less.

Figure 10: Blue line: The speedup versus the number of processors (10000 cycles). Red line: Perfect speedup

Conclusions

Leppänen, “Development of a new Monte Carlo reactor physics code” Ph.D. Work, Helsinki University of Technology, 2007. 4] TRIPOLI-4 Project Team, TRIPOLI-4 Version 8.1, 3D Continuous Energy General Purpose Monte Carlo Transport Code, Database/Computer Program Services, 2013 [5] Massingill, "Integrating Task and Data Parallelism", Masters Work, California Institute of. 6] http://en.wikipedia.org/wiki/Flynn's_taxonomy. 7] https://computing.llnl.gov/tutorials/parallel_comp/parallelClassifications.pdf [8] https://computing.llnl.gov/tutorials/parallel_comp/#Flynn.

Soviany, Data Embedding and Task Parallelism in Image Processing Applications, Ph.D. 10] http://en.wikipedia.org/wiki/OpenMP. Parallelize or not to Parallelize, Speed Up Issue", International Journal of Distributed and Parallel Systems, IJDPS, Vol. On the Parallel Efficiency and Scalability of the Correntropy Coefficient for Image Analysis" Journal of the Brazilian Computer Society.

25] L. Russell, "Simulation of time-dependent )eutron populations for reactor physics applications using the GEA)T Monte Carlo Toolkit" , Master's thesis, McMaster University, 2012. 34;TRIPOLI-4 :A 3D continuous energy Monte Carlo Transport code" , First International Conference on Physics and Technology of Reactors and Applications.