Universidade de Aveiro Departamento deElectr´onica, Telecomunica¸c˜oes e Inform´atica, 2019
Miguel ˆ
Angelo
Crespo Ferreira
Teste de desempenho de padr˜
oes de Vulkan com
path-tracing
Testing Vulkan pattern perfomance with
path-tracing
Universidade de Aveiro Departamento deElectr´onica, Telecomunica¸c˜oes e Inform´atica, 2019
Miguel ˆ
Angelo
Crespo Ferreira
Teste de desempenho de padr˜
oes de Vulkan com
path-tracing
Testing Vulkan pattern perfomance with
path-tracing
Disserta¸c˜ao apresentada `a Universidade de Aveiro para cumprimento dos requisitos necess´arios `a obten¸c˜ao do grau de Mestre em Engenharia de Computadores e Telem´atica, realizada sob a orienta¸c˜ao cient´ıfica de Prof. Dr.-Ing. Joaquim Jo˜ao Estrela Ribeiro Silvestre Madeira, Professor Auxiliar da Universidade de Aveiro.
o j´uri / the jury
presidente / president Prof. Doutor Tom´as Ant´onio Mendes Oliveira e Silva
Professor Associado da Universidade de Aveiro
vogais / examiners committee Prof. Doutor Ant´onio Fernando Vasconcelos Cunha Castro Coelho
Professor Associado com Agrega¸c˜ao da Faculdade de Engenharia da Universidade do Porto (Arguente)
Prof. Dr.-Ing. Joaquim Jo˜ao Estrela Ribeiro Silvestre Madeira
palavras-chave Vulkan, path-tracing, threading, mem´oria, compute shaders, guias.
resumo Desempenho e baixo consumo s˜ao as vantagens dadas pelas API gr´aficas modernas, como a API Vulkan, quando comparadas com as vers˜oes ante-riores. Em contrapartida, tamb´em oferecem um maior grau de complexi-dade. ´E preciso tempo para perceber todos os elementos envolvidos, como tamb´em nem sempre ´e percet´ıvel quais os benef´ıcios e perdas do uso desses mesmos elementos. O foco da disserta¸c˜ao ´e criar uma perspetiva n´ıtida das vantagens e desvantagens em torno de determinados padr˜oes em Vulkan. Esta disserta¸c˜ao avalia os efeitos de multi-threading, ordem de comandos e manipula¸c˜ao de mem´oria, com base num m´etodo consistente, via um path-tracer feito unicamente de Compute shaders com Vulkan. Ainda junta toda a experiˆencia de uso e conclus˜oes num conjunto de guias e abstra¸c˜oes, a fim de ajudar a reduzir a curva de aprendizagem ou dar ideias sobre poss´ıveis melhorias em aplica¸c˜oes existentes.
keywords Vulkan, path-tracing, threading, memory, compute shaders, guidelines.
abstract Performance and low overhead are the selling points of modern graphics APIs, such as Vulkan API, when compared to previous versions. However, these come with added complexity over previous APIs. Understanding all involved API elements takes time and, sometimes, there is no clear percep-tion of the benefits and drawbacks in each. The thesis focus is to draw a clear view of a subset of Vulkan patterns and layout their pros and cons. The thesis features a path-tracer implemented in Vulkan uniquely out of Compute shaders. It is used to evaluate multi-threading, command schedul-ing and memory management effects with a consistent test method. Fur-thermore, it also features a set of guidelines and abstractions created from Vulkan’s work experience. These can be used to flatten other API users learning curve or provide some improvement lines on existing applications.
i
Contents
Contents i
List of Figures iii
List of Tables iv 1 Introduction 1 1.1 Motivation . . . 3 1.2 Goal . . . 3 1.3 Structure . . . 4 2 Background 5 2.1 Rendering . . . 5 2.1.1 Rasterisation . . . 5 2.1.2 Ray-Tracing . . . 6 2.1.3 Path-Tracing . . . 8 2.1.4 Overview . . . 9
2.2 Modern graphics APIs . . . 10
2.2.1 Modern vs old paradigm . . . 10
2.2.2 Spir-V . . . 11 2.2.3 Vulkan . . . 12 2.3 Related Work . . . 14 3 Path-Tracer 15 3.1 Overview . . . 15 3.1.1 Pipeline . . . 15 3.1.2 Commands . . . 16 3.2 Resources . . . 17 3.2.1 Arrangement criteria . . . 17 3.2.2 Layout . . . 17 3.3 Rendering call . . . 18 3.3.1 Pre-processing . . . 18 3.3.2 Sampling . . . 19 3.3.3 Post-processing . . . 20 4 Benchmark 21 4.1 Overview . . . 21
ii CONTENTS 4.1.1 Tested patterns . . . 21 4.1.2 Method . . . 22 4.2 Tests . . . 23 4.2.1 Results . . . 23 4.2.2 Analysis . . . 25
5 Conclusion and Future Work 27 5.1 Summary . . . 27 5.2 Future work . . . 28 6 References 29 A Initial Version 31 A.1 Overview . . . 31 A.2 Issues . . . 31 A.3 Results . . . 31 B Vulkan Guidelines 33 B.1 Abstraction . . . 33 B.2 Generic . . . 33
B.3 Shaders and Resources . . . 34
iii
List of Figures
1.1 2D rendering example. . . 1
1.2 3D rendering example. . . 1
1.3 Path-tracing and rasterisation example. . . 2
1.4 Real-time ray-tracing and rasterisation example. . . 2
2.1 A triangle and an example of its raster image. . . 5
2.2 Flat and Phong shading. . . 5
2.3 Ambient occusion example. . . 6
2.4 Shadow umbra and penumbra. . . 6
2.5 Ray-trace route example made from two hit mirror primitives. . . 6
2.6 Axis oriented cuboid illustration. . . 7
2.7 Mirror reflection. . . 7
2.8 Refraction. . . 7
2.10 Relation between ray-tracing and path-tracing in identical circunstances. . . . 8
2.9 Glossy, or fuzzy, reflection. . . 8
2.11 Diffuse reflection. . . 9
2.12 Bidirectional path-tracing. . . 9
2.13 Vulkan logo. . . 10
2.14 SPIR logo. . . 11
2.15 Compute shader dispatch example. . . 12
2.16 Execution barrier example. . . 14
3.2 Path-tracer command schedue. . . 16
3.1 Path-tracer pipeline structure. . . 16
3.3 Path-tracer buffer memory organization. . . 18
3.4 Path-tracer descritors layout. . . 19
3.5 Camera and ray-launcher crossed perspectives. . . 19
4.1 Benchmark scene showcase. . . 23
4.2 Additional sampling test spreadsheet. . . 26
5.1 Glass Suzanne render. . . 28
iv
List of Tables
2.1 Rundown of all render features. . . 9
2.2 Vulkan’s changes from OpenGL. . . 10
4.1 Pattern combinations. . . 22
4.2 Test scene primitives. . . 23
4.3 Results from the simple-sample tests. . . 24
4.4 Multi-sample pattern combination results. . . 24
4.5 Single and multi-sample time normalised results. . . 24
1
Figure 1.1: 2D rendering example from the game, Child of Light.
PS4 version screenshot.
Figure 1.2: 3D rendered scene and its wireframe.
Images source: https: //www.artstation.com/ artwork/6GmqO
1 Renders that do not require instant results or have a large time-wise tolerance.
2 Or 30 frames in some console and less motion intensive games, where low frame-rate issues are not that visible.
1
Introduction
Architecture to design, visual effects to entire movies, small sim-ulations to complex video games. All have a common point. Ren-dering. A term which describes techniques used to synthesise im-ages from scenes, either 2D, figure 1.1, or 3D, figure 1.2.
Rendering can be either in real-time or not. When there is ”no time limit”1, quality is all that matters. Real-time rendering, alternatively, needs to maintain at least a set frame-rate to present a fluid motion. Even then, unless stated otherwise, the objective is always to flaunt the best quality under the aimed frame-rate. Video-games, for example, are attuned to maintain 60 frames2 for a predefined base specification.
In ”no time limit” renderers, such as Pixar’s RenderMan [21], it is common to use path-tracing [6]. It produces near-realistic results, figure 1.3 left image, by acquiring an abundance of light paths for each pixel in the frame. A significant amount of computational time is required to fufill this task and the reason why path-tracing is often unfit for real-time rendering.
Then there is ray-tracing [4, 3]. It is similar to path-tracing but does not require an abundance of light paths. Alternatively, it uses illumination methods to determine object shadows and lighting. Due to the difference in concept, it is way faster but sacrifices a bit of quality.
Finally, using an entirely distinct style to rendering than both tracing types is rasterisation. There, a scene’s polygon is rendered and lighted, one-by-one, with the same illumination system as ray-tracing. When compared to the other two, it is outstandingly faster, yet, by itself, incapable of displaying the same reflection, refraction and shadow accurancy. Still, it showcases incredible results, as shown in figure 1.3 right image, which led it to reign over real-time rendering for being the best quality achievable in those conditions. Not anymore. In the latest years, ray-tracing is now attainable in real-time. A change undoubtedly sought after, due to the appar-ent differences in quality, as seen in figure 1.4.
Allowing this change is the increase in both GPU and modern graphics API performance, such as Vulkan, Direct3D and Metal. Factors which eventually led to the creation of new dedicated ray-tracing pipelines for the newer GPU versions. APIs performance increase, however, bought a shift of driver’s duties to the developers and an added complexity to its use.
2 CHAPTER 1. INTRODUCTION
Figure 1.3: Path-tracing and rasterisation example. The left image shows a scene created in Pixar’s RenderMan, according to the sourcea. It has a near realistic photo-scenario. The
right image, on the other side, is a screenshot from The Witcher R 3: Wild Hunt. It has
impressive results, but far from a ”no time limit” render.
a
Left image source: https://www.flickr.com/photos/protohiro/38197653325
Figure 1.4: Real-time ray-tracing and rasterisation example. The left image is a scene from project PICA-PICAa. Glossy reflections are present in multiple surfaces, and object shadows are entirely accurate. The right image, on the other side, is a screenshot from The Witcher R
3: Wild Hunt. On the table, there are both non-present shadows, nor reflection in metallic objects.
a
Image taken from the book Ray Tracing Gems: High-Quality and Real-Time Rendering with DXR and Other APIs [29]
1.1. MOTIVATION 3
3 Two or more program builds with the same quality-wise, but differ-ent internal layout.
4 The ideas, methods and solutions practised within a single trait.
1.1
Motivation
What is the level of added complexity in these new APIs? How can this performance improvement be achieved? Is it possible to worsen the performance?
These are the types of questions that can arise with modern APIs. Especially when in Vulkan, for example, a renderer can have multiple valid builds3. Differences between builds can be as small as decisions over command orders or even a separation between two data types, and yet have an unforeseen impact in performance.
When jumping from an OpenGL to a Vulkan build, overlooking such small differences is quite easy and may lead to unexpected results. Nevertheless, there are not many works describing these issues and how to solve them.
Books such as Vulkan programming guide: The official guide to learning vulkan [19] and Vulkan Cookbook [20] provide information over Vulkan’s objects and how to set them up. Still, they leave all ideas over possible API abstractions or object groups to the reader. Getting an insight into all object’s utilities and impact take time, and might not be doable in a project’s lifetime. That is where those ideas come in. By reducing the intel spread, it is easier to locate targets and improve productivity.
This thesis intends to fill in that hole. Analyse the impact of these small differences, and provide a simple, but helpful abstrac-tion over Vulkan.
1.2
Goal
The main objective of this thesis is to evaluate a set of patterns4 and their performance impacts. The set includes:
G.1 The influence of multi-threading on performance. Where to obtain the best performance and how.
G.2 Explore the schedule options in command records. Clarify the difference between pre-records to on-the-fly commands. G.3 The overhaul impact of memory management. The
ef-fects of sub-allocation and whatnot.
Then, there are two additional secondary objectives. First is to create an abstraction of all Vulkan’s objects. The second and final result is to compile all results in a set of guidelines, tips and suggestions. Both joined together within an annexe.
The last secondary objective is intended to help out new Vulkan users or to prevent, or better yet, warn about small significant aspects easily overlooked.
4 CHAPTER 1. INTRODUCTION
1.3
Structure
This thesis description spreads over five chapters, including this one, and contains both an appendix and an annexe.
Background, chapter 2, expands on both renders and modern APIs, and also refers to some recent publications in the themes. Chapter 3, Path-Tracer, describes the built path-tracer, with all its functions. Benchmark, chapter 4, states all obtained results and their analysis. And chapter 5, Conclusion and Future Work, sums up the thesis and purposes some work expansions.
Lastly, Appendix A specifies a previously built path-tracer and Annexe B, the proposed abstraction and guidelines.
5
5 Images made by pixels. 6 The points, lines and shapes, or polygons, con-stituting each model’s mesh.
Figure 2.1: An example image of a green gradient triangle raster and all its possible pixels. Some processes can cull pixels with low shape coverage.
Figure 2.2: Example of a quad sphere, made with quads and not triangles, shaded with both flat, top, and Phong shading, bottom.
Examples created with Blender.
2
Background
Before delving deeper into the work, this chapter broadens the insight of involved themes.
It describes rasterisation, ray-tracing and path-tracing over sub-section 2.1.1 to 2.1.3 and compares them in sub-sub-section 2.1.4. Then it introduces modern graphics APIs at the start of section 2.2, their differences in paradigm from previous versions over sub-section 2.2.1, and both SPIR-V and Vulkan in subsection 2.2.2 and 2.2.3. In the end, it refers to some closely related works under section 2.3.
2.1
Rendering
Recreating accurate scene illumination would need an immense amount of time. It is just impractical. To counter this issue and speed up the process, renderers use strategies to simplify lighting traits, such as reflection, refraction and shadow.
2.1.1 Rasterisation
Rasterisation is the process of converting forms to raster im-ages [7]5, seen in figure 2.1. As a renderer, it uses this mechanism, to transform all scene primitives6, one-by-one, and imprints them
to the output frame. Its aided by an auxiliary image, the depth test, which saves all covered pixel’s depths and filters hidden pixels. Without it, there would be no way to know what are the closest objects.
Surfaces are lighted using two distinct types of models, lighting and shading. Lighting describes a material’s nature and shading how the surface illumination is scattered.
Out of the existing lighting models, Lambertian [14] and Phong lighting [15] are the most used types, expressing both diffuse and specular materials, accordingly. On the other hand, flat shading and Phong [2] shading are two well-known shading models. The first applies a polygon light scattering and the second a per-pixel, as depicted in figure 2.2.
Rasterisation has no straight way to draw said light effects, and instead, requires the use of auxiliary algorithms to do so. Effects may require more than one algorithm or even more than one frame sample [11] to be fully displayed.
6 CHAPTER 2. BACKGROUND
Figure 2.3: The same scene rendered without ambient occlusion, left, and with HBAO+a, right, in The Witcher R 3: Wild Hunt. The effects are most noticable on the gaps between
the floor planks and inside the box on the back.
a
HBAO+, Horizon Based Ambient Occlusion, is a type of ambient occlusion based on the depth image from the eye’s point of view [12].
Figure 2.4: When the light source consists of an area, and not a point light, the shadow breaks down into two parts, the umbra, blue ground sec-tion, and penumbra, yel-low ground section.
Original image from [29].
Figure 2.5: Ray-trace route example made from two hit mirror primitives. Here the path is formed by three segments, one hitting the sphere, other the box and a miss.
is either shadow mapping, shadow volumes, stencil solutions [33] or even ray-tracing to produce direct shadows, those created from a light source. The second step is the Ambient Occlusion [13], an estimate of all present indirect shadows, the light absorbed by the surfaces. This algorithm further boasts the visual perceptiveness of curved surfaces and the frame’s overhaul depth, as seen in figure 2.3.
Additionally, to be perfected, shadows need multi-sampling. Using only the first step once wields just the shadow’s umbra. To obtain the penumbra, as seen in figure 2.4, it is necessary to acquire more samples.
2.1.2 Ray-Tracing
In general, there are quite a few misuses of the word ray-tracing. Such is the example of Shirley’s ray-tracing books, featuring a path-tracer, described in the next subsection. But it gets worse over the misconceptions in what is to ray-trace.
To ray-trace is to pave a ray’s travel route within a scene by repeatedly casting rays for each route’s segment.
As described above, and shown in figure 2.5, trace is a ray-cast amalgamation, an outline of the means employed to ray-cast new rays. Ray-cast, on the other hand, is a mechanism used to find the nearest primitive in a ray’s path. It is the core of ray-tracing, path-tracing, and ray-casting [1], the predecessor to both types of tracings.
In its simplest version, a ray-cast intersects a ray against all primitives in a scene. Due to this property, it is increasingly slow as a scene increases in size. Nevertheless, with the help of aux-iliar acceleration structures such as body volume hierarchies, or
2.1. RENDERING 7
Figure 2.6: An illustra-tion of an axis oriented cuboid’s points, regions and the cuboid itself. 7 While Baldwin and Weber’s algorithm [17] is said to be faster, it is not as reliable as the M¨ oller-Trumbore’s [9].
8 The lighting model types introduced in the rasterisation sub-section, 2.1.1.
Figure 2.7: Mirror reflec-tion. With the incident ray in red, the reflected ray in green, and the nor-mal in blue.
Figure 2.8: Refraction. With the incident ray in red, the refracted ray in green, and the normal in blue.
BVHs [23], it can be hastened.
Each ray-primitive intersection crosses a ray, a half-line with an origin, O, which uniquely travels on one direction, ~d, as per eq. (2.1), with a scene primitive and returns their closest shared point, if any.
P(t) = O + t ~d (2.1)
The calculus method differs between each primitive’s type, where the most common are:
I.1 Axis oriented cuboids. Defined by two points, V0 and V1,
which delimit a region in each axis planes forming the cube in their crossing, as seen in figure 2.6. Intersected if there is a point, from the half-line, inside all the cube’s regions at once. I.2 Spheres. Defined by a single point, C, and a value, r, indicat-ing the sphere’s centre and radius respectively. Intersected if there is at least one solution to the eq. (2.2), a combination of the sphere and ray geometric equations, as seen in [25].
t2( ~d · ~d) + 2t( ~d · ~CO) + ~CO · ~CO = r2 (2.2) I.3 Triangles. Illustrated by three points, V0, V1 and V2, each
designating one of the triangle’s vertices. M¨oller-Thrimbore’s algorithm [9] is one fast, and reliable method7 methods to calculate the intersection using eq. (2.3), where (u, v) are the barycentric coordinates.
O + t ~d = (1 − u − v)V0+ uV1+ vV2 (2.3)
As said before, ray-casting is the origin of both tracing methods. In this render, images are created by shooting rays from a source, known as the camera’s viewpoint, towards each image pixel to find all nearest objects with the use of ray-cast.
In each hit, the render generates new rays, known as shadow rays, in every scene light direction and checks if they are visible, or not. In the end, it gathers all info and, together with a lighting model8, shades the found surface.
Ray-tracing expands this technique by applying ray-trace meth-ods on all hit specular surfaces. Doing so enables the render to depict accurate refraction and reflection.
In ray-trace upon hitting a clear specular surface, i.e. a mirror, a new ray is spawned, from the hit point, in the reflection path as per the law of reflection [10] and the normal of the surface, ~n, figure 2.7 and eq. (2.4).
cos θd~= −( ~d · ~n)
~
dnew= ~d + 2~n cos θd~
(2.4) Likewise, upon intersecting a translucent surface, to procure the object’s refraction, a new ray is spawned as per Snell’s law of refrac-tion [10], figure 2.8 and eq. (2.5), and both ray’s ambient refracrefrac-tion
8 CHAPTER 2. BACKGROUND
Figure 2.10: Depicted in this image is a relation between ray-tracing, on the left, and path-tracing, when both types launch the same rays, coloured in blue, green, red and yellow, in identical circumstances. In the ray-tracing side, it is shown, in purple striped lines, all spawned shadow rays from the ray hits. On the path-tracing side, the green ray exhibits but one possible traced paths.
Figure 2.9: Glossy, or fuzzy, reflection. With the incident ray in red, the reflected rays in green, and the normal in blue, and the spread area in yellow.
9 Monte Carlo algorithms are processes which rely on repeated or continuous random sampling to draw the desired result.
indices, η1 and η2. If the root is unattainable, the incident ray is
past the critical angle the refraction fails and produces a reflection ray instead. sin2θd~new = ( η1 η2 )2(1 − cos2θd~) ~ dnew = η1 η2 ~ d + (η1 η2 cos θd~− q 1 − sin2θd~new) (2.5)
Furthermore, to produce the light’s Fresnel effect, the change of re-flection strength upon the viewer’s perspective, refractive materials can apply the Schlick’s approximation [8], eq. (2.6), in the ray type decision. R0= ( η1− η2 η1+ η2 )2 RSchlick(θd~) = R0+ (1 − R0)(1 − cos θd~)5 (2.6)
Secondary rays, those spawned from either refraction or reflection, shade objects each object following the same technique as in ray-casting, and repeat the process until hitting a diffuse or emissive object or reaching a maximum bounce limit.
Effects, such as depth of field, glossy or fuzzy specular reflection and soft shadows, can be performed with the use of stochastic ray-tracing [5]. With it, some effects obtain various samples in a spread area, instead of one result, creating a blurred effect.
Of the listed examples, the first partially scatters pixel’s directions, the second effect spreads the direction of spawned rays, figure 2.9, and the third distributes shadow rays over the surface of each source light.
2.1.3 Path-Tracing
Path-tracing [6] is a Monte Carlo approach9 to ray-tracing. It changes the ray-trace design to cover reflection in diffuse surfaces, and casts lighting models and shadow rays, as shown in figure 2.10.
2.1. RENDERING 9
Rasterisation Ray-tracing Path-tracing
Shading With models. With models. Direct.
Direct shadows External. Internal. Internal.
Indirect shadows External. External. Internal.
Reflection External. Internal. Internal.
Refraction External. Internal. Internal.
Render Speed Fast to very fast. Moderate. Slow.
Implement Effort Very high. High. Medium to Low.
Table 2.1: A rundown of the render’s shading, direct and indirect shadows, reflection and refraction, as well as, their overhaul render speed and implement effort. In the table, external and internal denote if the renderer covers, or not, the trait by external methods or algorithms. On the other hand, speed and effort are renderer apprisals induced from their details.
Figure 2.11: Diffuse re-flection. With the incid-ent ray in red, the reflec-ted rays in green, and the normal in blue, and the spread area in yellow.
Figure 2.12: In bidirec-tional path-tracing, a camera ray is paired with a source light ray using shadow rays to obtain better illumination in shadings.
The reason why path-tracing is a Monte Carlo renderer comes from the fact that, under their irregular traits, diffuse surfaces ran-domly reflect incident rays, as depicted in figure 2.11. Hence, as they gather countless lighting contributions in their illumination, they also need many rays to establish it.
Nonetheless, there are quite a few benefits to path-tracing over ray-tracing. Due to diffuse reflections, the renderer already gathers all the much-needed info to produce both direct and indirect shad-ows, as well as the entire scene illumination.
Because of this, it can cast aside lighting models and shadow rays and ambient occlusion techniques. Furthermore, not relying on es-timates, causes path-tracing to be, out of all renderers, the one with the highest fidelity rate. It is, as well, the easiest to implement.
Path-tracing has a variant, bidirectional path-tracing [27], which fuses the shading info from rays launched from the light source with those from the camera.
To fuse both shadings, it employs a shadow ray to check if both intersection points are visible between each other, as depicted in figure 2.12. If so, it updates the camera ray shading with the light’s remaining intensity, increasing the overhaul scene illumin-ation without the need for extra bounces per ray.
2.1.4 Overview
Table 2.1 compiles the three renderers details and shows each renderer’s benefits and drawbacks. From there, it is easy to under-stand why path-tracing is a staple in asserting a renderer, or an algorithm in the same matter, quality-wise performance. It show-cases scenes with the least amount of approximations in lighting, and so turning it into a good quality aim.
Rasterisation and ray-tracing represent diffuse surfaces in the same manner. There is not a single gain between one and the other. Hybrid renderers exploit this by initially using rasterisation
10 CHAPTER 2. BACKGROUND
OpenGL Vulkan
Architected for graphics workstations with direct renderers and split memory.
Modern platform support including mobile with uniform memory and tiled rendering. Driver does state validation, dependency
tracking and error checking.
Only covers GPU control, with everything else as part of the user tasks. Threading model does not support parallel
command generation and execution.
Multi-threading support with parallel command generation and execution. Built-in shader language compiler with
GLSL only support.
SPIR-V as compiler target enables language flexibility and reliability. Table 2.2: A summary of all Vulkan design changes from OpenGL, described in one of the Vulkan’s presentations, Graphics and Compute Belong Together [16].
Figure 2.13: Vulkan logo.
Obtained from [31].
10 Vulkan still further distinguishes itself by employing OpenGL’s defining trait, its cross-platform capability, and extend it to cover both desktop and mobile. 11 For example, some complex application scen-arios push OpenGL to a performance bottleneck on the CPU front. Par-alleism, a non-supported feature, would fix the is-sue.
12 These issues, also veri-fied in Mantle and Dir-ect3D, lead to similar changes in both APIs.
and then ray-tracing to cover shadow, refraction and reflection. These kinds of renderers, in scenarios with many diffuse surfaces, offer huge speed-ups over normal ray-tracing with minimal to no losses.
2.2
Modern graphics APIs
Graphics APIs are designed to help with GPU integration. They offer ways to ease development by abstracting underlying specifics. Examples are Direct3D, Metal, OpenGL and Vulkan, represented by figure 2.13.
Modern graphics APIs differ from previous versions for adopting a new paradigm and joining both graphics and general program-ming into a single API. Such is the case of Vulkan10 and both Direct3D and Metal’s latest versions.
2.2.1 Modern vs old paradigm
Since their first release, OpenGL [30] and other graphics APIs always kept an unchanged architecture. Throughout the years, they started to drift away from modern paradigms11.
To cope with the issues, OpenGL’s developers proposed a com-plete redesign, detailed in table 2.2, focusing both parallelism and object-oriented support12. Which later, came to be known as the Vulkan [31]. The result was a fully versatile structure, employ-ing numerous objects, with the skill to execute all types of works, both graphics and generic, using the same API. However, it lost OpenGL’s straightforwardness that allowed the user to focus on the rendering and not how to render.
By removing everything besides the hardware control, newer APIs, not only minimise their jobs but also uncover new improve-ment openings in tasks, such as commands, context and memory. For example, users can record commands in advance on multiple
2.2. MODERN GRAPHICS APIS 11
13 Shaders are a pro-grammable sequence of instructions designed for the GPU.
Figure 2.14: SPIR logo
Obtained from [28].
14Push constants hold as little as 128 to 256 bytes.
15 Changing the size of a fixed-size array, or alter a renderer option are two examples.
threads and reuse them, eliminating with them a couple of records from rendering calls.
Still, there are drawbacks from the removal of most driver’s jobs. Even the most naive errors may end up, unexpected behaviour or even crash the system. To mitigate these cases, Vulkan offers valid-ation layers, an optional component that provides additional debug information, validation and profiling. As it is, as the name implies, an extra layer between the app and the driver, it is plausible to remove it in a released version.
2.2.2 Spir-V
Previous graphics APIs parsed shaders13 written in a
human-readable syntax, such as OpenGL Shading Language, GLSL, and High-Level Shading Language, HLSL, within their execution. Vulkan adjusts this to require the provided shader code to be un-der the Standard Portable Intermediate Representation, SPIR, tar-geted for Vulkan standards, SPIR-V [28], represented by figure 2.14. SPIR-V shaders internally use a set of resources to map their in-ternal variables from one language to the other. The first resource in the set is descriptor set layouts, a resource type that defines a large variety of storage containers that can be inputted and out-putted from the shader. Storage containers include:
• Storage buffer, image and texel buffers, a type of con-tainers that allows all operation natures.
• Uniform buffer and texel buffers, a type of containers that only allows read operations and usually faster than its storage counterparts. With a limited size.
• Samplers and sampled images, an image resource that can be associated with additional properties such as filters. • Input attachments, an image resource applied only in load
operations within the fragment shaders.
The second set type is push constants, unchangeable data supplied with shader call command itself. They hold a tiny data volume14 but permit frequent updates, even on successive calls of the same shader in one command buffer. The third and last type is the specialisation constants, a variable type used to perform on-the-run adjustments to the shader operations15.
The use of SPIR, or SPIR-V, as the shader format offers plenty of benefits to the application and driver:
• Native code is much simpler to interpret, consistent and limits the unexpected behaviours on different platforms.
12 CHAPTER 2. BACKGROUND
16 Render passes are also included in the Frame-work.
Figure 2.15: A compute shader dispatch example of a 3 × 2 × 1 group of 2 × 2 × 1 shader workgroups in a 10 × 4 × 1 core grid.
• Shaders are cross-platform, support multiple source shading languages.
• Unlike GLSL and HLSL code, for example, there is no need to supply the source code with the app, requiring just the binary.
2.2.3 Vulkan
To better explain Vulkan’s capabilities, components, and how everything interconnects, the description will use a conceived ab-straction. It separates all objects into four groups that define cer-tain functionalities within the API, namely the Core, Framework, Resources and Command.
Core encircles Vulkan’s object creation frame, a sequence of four objects, Instance, Surface, Physical and Logical Device, who draft the program’s capabilities. They define the used validation layers and extensions, create window connections and outline the used GPU along with its functions.
The Framework embodies a direct relation to SPIR-V shaders. It defines, on the program’s side, the used pipelines along with their descriptor set layouts, push and specialisation constants.
A pipeline is a predefined sequence of tasks that operate over data within the GPU. Graphics, Compute and Ray-Trace, consti-tute the three defined types of Vulkan pipelines. The first two are part the Vulkan’s base features, where the Ray-Trace pipeline is an extension created by Nvidia, available to only a sub-set of GPUs.
The Graphics and Ray-Trace pipelines are sequences of tasks en-visioned to renderer images with rasterisation and ray-tracing, ac-cordingly. Graphics pipelines are set to draw triangle-based meshes on a single command, where Ray-Trace build the ray-tracing se-quence within them and trace ray paths on demand. They use an auxiliary structure, the render pass16, which helps with the attach-ment manageattach-ment and in the command schedule.
Nevertheless, they are not limited to rendering. Most tasks can be activated and deactivated, and allow the user to employ them, as they wish.
Compute pipelines, on the other hand, are entirely versatile and do not have a complex sequence attached. Instead of doing a set job on command, they use a grid dispatch system which enables the user to do whatever they wish with the multiple GPU cores. Two parts make up the dispatch, the shader workgroup and the dispatch count. Shader workgroups define the number and pattern of parallels execution in a shader group dispatch. Dispatch groups, on the other hand, plan the number and direction of the dispatch shader workgroups. Together they form a similar pattern to figure 2.15.
2.2. MODERN GRAPHICS APIS 13
17 Descriptor sets are reupdatable, as long it is not in use, for as many times at its needed.
18 Graphics commands are associated with Graphics pipelines, com-pute with Comcom-pute, transfer with memory transfer operations, and sparse with sparse source operations, a re-source type that can be bound to multiple memories and descriptors at the same time.
Resources frame all objects related to the descriptor set in-stances, their containers and respective memories, as well as the swap-chain.
Descriptor sets are the link between the program and the SPIR-V shaders. They are instances of the shader descriptor set layouts that are bound and used in a shader call. Through an update process, or a reupdate17, they associate the program containers with the existing inputs and outputs of the shader.
In Vulkan, memories split into two types, host-visible and device-only. The first type can be directly accessed, read and written by the host program. The second type, alternatively, is, as the name implies, a memory exclusive to the device and cannot be accessed directly. To do so, it requires a host-visible to serve as the middle-man between the device-only and the host.
The swap-chain is a dedicated display structure associated with a surface. It holds a set of images in the desired format and has a specified presentation system that can use one of the following modes [20]:
• Immediate mode replaces the displayed images at the mo-ment they are ready, as there is no waiting involver it may induce screen tearing.
• FIFO mode has an internal queue used to display images in sync with blank periods, v-sync, always in the same order. It has an alternative, FIFO relaxed, which only uses v-sync when the image generation rate is higher than the screen refresh rate. If the is gen lower than refresh rate, it may induce screen tearing.
• Mailbox mode uses a single internal slot to hold an image, replaced at every newly produced image. Just like FIFO, it has v-sync but ensures the displayed image is always the latest.
The last of the four groups is Command. It shapes the GPU control sequence, commands and queues, and its internal and ex-ternal synchronisation, barriers and sync objects.
Commands are the operations to be executed in the GPU. They divide into four categories, graphics, compute, transfer and sparse, defined by the related operation type18, and recorded into buffers, later submitted to queues.
Queues are the internal GPU work schedulers. They accept the submitted and ensure their inter-queue execution is in order with the submissions and submitted sequence.
One device may have multiple queues which execute submitted commands in parallel. Associated in groups, or families, with sim-ilar characteristics, defined the queue’s accepted command types.
14 CHAPTER 2. BACKGROUND
Figure 2.16: An example of an execution barrier, placed after the 3rd com-mand of a 5-comcom-mand buffer.
19 CUDA and OpenCL, are two types of ded-icated general-purpose GPU, GPGPU, program-ing APIs. Vulkan, on the other hand, does both GPGPU and rendering dedicated work.
Within a command buffer, may exist more than one command. These commands execute under the record operation but not ne-cessarily one after the other.
A specific command exists to ensure mid-execution commands fin-ish before the start of a new one, barriers, or execution barriers. They create a point in the execution time where the queue halts the command buffer and waits for the respective undergoing commands to finish, as seen in figure 2.16.
At last, is the sync objects, split into Fences and Semaphores. Both do the same task, to pinpoint tasks that finished, yet on dif-ferent levels. Fences perform host-device sync, where Semaphores inter-device sync.
2.3
Related Work
In ‘Evaluation of Multi-Threading in Vulkan’ [18], Blackert as-sessed Vulkan’s multi-threading by making a direct comparison to OpenGL. In his thesis, though concludes upon the benefit of multi-threading and its relation to command buffers, there are no details the on where exactly it applies.
In ‘VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs’ [22], Mammeri and Juurlink presen-ted an average speed-up, per Vulkan’s part, of 1.53× and 1.66× against CUDA and OpenCL19, respectively. As well as a couple of drawn guidelines to help some future users, including:
• The use of memory barriers to synchronise interactive al-gorithms within a command buffer and the preference of push constants over bindings.
• The decrease the spread of CPU accesses to the GPU memory and request large memory transfers in the dedicated transfer queues.
However, there are no details on the patterns influence in the over-haul program’s performance.
Close to the end of this thesis was the publishing of Lavriˇc’s mas-ter’s thesis, ‘Volumetric path-tracing using the Vulkan API’ [32], also featuring a Path-Tracer built within Vulkan. It differs from this one on two details, the use latest ray-tracing pipelines and renderer’s performance focus. Wherein this thesis, the focus is the backend’s performance and a path-tracer using compute pipelines. He also proposes an abstraction over Vulkan’s objects that can be employed with the one introduced.
15
20 The features include versatile primitives with both sides, rotation, translation and scaling, loading models, multiple light sources, and mov-able cameras.
21 While a translucent emissive material can be controversial, there is a logic behind the design. Flames, for example, emit light and still dis-play translucency in some areas. This design tries to cover these cases.
3
Path-Tracer
This chapter will first give an overview of all the renders capab-ilities through section 3.1. In section 3.2, it will describe how all resources are laid out within the shader and the device’s memory. And finally, section 3.3 is going to give out details on the renderer’s process itself.
3.1
Overview
The developed renderer is built using only compute pipelines, in a context where dedicated ray-tracing is not available. It is a both simplified and modified version of Shirley’s books Ray Tracing in One Weekend [25] and Ray Tracing: The Next Week [26] path-tracer.
The renderer, while focused on the backend, it still is capable of displaying a scene with a decent set of features20. It supports three types of primitive, three types of material, transformations and a movable camera with depth of field.
Within the primitives, there is axis oriented cuboids, spheres and triangles, and the materials are diffuse, specular and emissive. The latter two materials may either be translucent or not21, and also, it is possible to form a unique entity, made with multiple primitives, with a collective transform and material.
While Ray Tracing: The Next Week [26] covers BVHs, the de-veloped path-tracer does not. The reason is, while crucial for the general renderer performance, it has no direct implications on the Vulkan’s backend, and so they were not a critical feature. Same with motion blur.
Additionally, the renderer has adjustable window size and type, ray depth, sample number per frame, or anti-aliasing, and max-imum and minmax-imum ray distance. There are also three options targeting the benchmarks, number program threads, memory co-hesion and frame time output.
3.1.1 Pipeline
The renderer employs a composed pipeline with a sequence of six distinct compute pipelines, where each shader has a precise duty. It is an earlier version upgrade built to correct some of its issues, described in appendix A.
16 CHAPTER 3. PATH-TRACER
Figure 3.2: A diagram of an arbritary command schedule, for any number of bounces and samples. Each double-lined box indicates a type of command buffer and shader group, see figure 3.1. Nodes indicate shader dispatches from the respective pipeline groupsa, and bars represent execution barriers, in yellow, or layout transitions, in blue.
a
Large green nodes are ray-gen shaders, and small green nodes, intersect or colour and scatter shaders.
22 In Vulkan, images have an internal layout that can be changed de-pending on the targeted purpose. Image layout must be “General” lay-out within shaders and “Present” layout to be displayed.
Figure 3.1: Renderer’s pipeline structure. Shader colours denote their task type, where ray tasks are green, im-age tasks are blue, scene tasks are red, and sample tasks are purple. Solid arrows specify the full sequence and dashed ar-rows the secondary paths. And lastly, the surround-ing boxes indicate task groups.
The full pipeline is envisioned to acquire, render and present a frame in one interlinked process, as seen in figure 3.1.
The small rounded boxes outline the beginning and end of the pro-cess, comprising the obligatory image layout transitions22necessary for the shader ops and the frame presentation. In contrast, the sharp boxes represent one shader, and consequently, one pipeline.
Pre-process and vertex, are shaders that clear the samples for a new frame and update all primitives in the scene with their trans-forms, accordingly. Ray-gen, intersect, and colour and scatter, spawns the camera rays, calculate the ray-casts, and determine both the surface’s shading and spawned ray, respectively. Finally, post-process updates the frame with the rendered image.
3.1.2 Commands
On figure 3.1, there are three defined groups of shaders and transitions, named as pre-processing, sampling and post-processing. Those groups are associated with commands used by the renderer in a rendering call.
A rendering call is specified by, at least three command buffers, one for each group scheduled in the pipeline’s order. It increases the number of command buffers by adding a sampling type command per each sample requested by the renderer’s anti-aliasing setting.
The first command buffer type, pre-processing, is marked by a recording of a layout transition, and dispatches for both pre-process and vertex, whenever there is a scene update. The second type, sampling, records one ray-gen dispatch followed by a chain of execution barrier, intersect dispatch, execution barrier, and colour and scatter dispatch, per allowed ray bounce. The third, and last type, pre-processing, records a post-process dispatch, and the final layout transition for the frame presentation.
When all put together, they form the sequence seen in figure 3.2, where every command waits for the completion of the one im-mediately before.
3.2. RESOURCES 17
23 Some tutorials, like Overvoorde’s Vulkan Tu-torial [24], as they use pre-recorded commands, they create a descriptor set for every possible combination of resources.
24 The specification only mentions the possibility of a performance change, and not if it increases or decreases.
25 Only one image is rendering at any given time.
3.2
Resources
The renderer’s resources took into account a couple of aspects, the resource’s descriptor set update frequency, buffer type and host access, and lastly, relation.
3.2.1 Arrangement criteria
As the renderer records the commands at every call, there is no need to create all possible descriptor variations23. Instead, the
renderer can update the necessary descriptors and bind them again in the same command layout. Moreover, there are some immutable resources, like the scene, which are independent of the in-render frame, a frequently updated resource, and require but one update in the renderer’s lifetime.
And so, with these two criteria in mind, the renderer uses only one descriptor set per layout and splits updated descriptors from immutable ones.
Buffers have two criteria in their creation, the type, storage or uniform, and the memory type, with host access or device only. Storage type depends uniquely on the resource size and usage, read-only or not. Therefore, if it is small and read-read-only, it is uniform, otherwise, storage.
Defining the memory type is not as straightforward. Citing the spe-cification, ‘Device local memory may have different performance characteristics than host local memory,(...)’ (Khronos R Group
Vulkan Specification [31]), which means, there might24 be bene-fits in preferring device only memory. However, considering the necessary transfer operations for the host access to the device only memory, this type of memory may be desirable for shader only stor-age. Hence, the renderer limits the device only memory to those types of buffers.
The last aspect was the resource relation. It serves, not only to organise the resources but to ensure all the group info is present with a shader binding, even when there are multiple buffers.
3.2.2 Layout
In the renderer, there is a Scene, Camera, Rays, presenting frames, or Images, and Settings as resources. Where on the ren-derer’s conditions25, only the Images require a constant update. Here is a breakdown of all resources:
Scene Encompasses the vertices, transforms, materials and primitives. Changes over time and has an update under the vertex shader.
18 CHAPTER 3. PATH-TRACER
Camera Ray generation elements. It has fixed resources through each execution.
Rays Spawned rays definition, intersection and pixel colour accumulation. A shader-only resource which continu-ally changes over time.
Image The acquired frame. It is used only in the pre and post-processing commands.
Settings Bounces and samples in a render call, as well as, the max and min ray-length and number of primitives. It is immutable through the renderings.
And so, camera and settings use uniform buffers, and the scene uses storage buffers, all in host visible memory. Rays, on the other hand, use storage buffers in device-only memory, as described by figure 3.3.
Figure 3.3: Path-tracer buffer memory organiz-ation. Blue blocks are memory allocations, green blocks buffers and yellow blocks descriptor layouts.
Like this, it is possible to bind only the used resources in every shader, as seen in figure 3.4, and limit the updates to the Image resource in every rendering call.
3.3
Rendering call
There are three tasks in a rendering call. Image acquisition, update and command submission.
In the image acquisition, the program requests an image from the swap-chain. Upon acquiring the image index, it moves to the next task where it updates the Image descriptor to the acquired image, as well as, any changes to either the scene or camera. After finishing updating everything that needs it, the renderer starts to dispatch a thread for each command in the submission, and submits everything to the queue, together with a presentation request at the submission end.
With the submission, ends the program’s job and starts the device work, described in the following sub-chapters.
3.3.1 Pre-processing
Both the update task and this command are directly linked. When there are no scene or camera updates, the buffer contains
3.3. RENDERING CALL 19
Figure 3.4: Descriptor set usage per shader. Lines symbolise bindings. Settings descriptor is used in every shader but was removed to clarify the others.
26 Shader dispatch design used in all shaders be-sides vertex. It translates the shader global invoca-tion IDs to pixel coordin-ates while ignoring any id outside the image. Global invocation IDs is a com-bination of both shader workgroup and dispatch group IDs.
27 Or, in the back-ground’s terms, the cam-era ray gencam-eration, and a ray-cast and ray-trace loop.
28 Increasing the focus increases the viewport dimensions, and vice-versa.
29 The vertical field of view defines the vertical angle of the camera’s captured image.
only the acquired image layout transition. Where, with updates, there are two subsequent dispatches after the layout.
The first dispatch is the pre-process shader. Dispatched with the image dimensions26, it clears all the acquired pixel samples on every ray in the Rays resource. Then there is the vertex shader. It traverses all primitives and replaces by their respective transformed version. Limiting the transform operations to this shader, instead of every ray-cast.
As there is no execution barrier between the dispatches, they can execute concurrently.
3.3.2 Sampling
Every Sampling command is, in reality, a renderer’s run. The command structure is a ray-gen dispatch, followed by an intersect and colour and scatter loop until the ray hits its allowed bounces27. When updating the camera, the program rewrites it in a different perspective, called the ray-launcher, using Shirley’s Ray Tracing in One Weekend [25] method.
The camera’s look-from, look-at, vertical orientation, focus28
and vertical field-of-view29, are converted into a definition of the viewport made by its top-left corner, orientation unit vectors, and horizontal and vertical lengths. And forwards the look from to the ray-launcher origin and the aperture to the launchers lens dimen-sion, as depicted in figure 3.5.
Figure 3.5: A visual illus-tration of the camera and ray-launcher elements, in orange and green, ac-cordingly, representing the same viewport. It also shows a spawned ray example.
The ray-gen shader picks up the ray-launcher and spawns a ray into every pixel in the image while using a random seed to
random-20 CHAPTER 3. PATH-TRACER
30 Increased transparency with lower alphas while maintaining the materials base colour.
ise the ray’s origin inside the ray-launcher lens. To prevent any unexpected conditions on the subsequent shaders, it considers as miss zero-length rays.
After generating the rays, the program enters in a intersect, and colour and scatter shader cycle.
Intersect shaders perform the ray-cast mechanism in all spawned, non-miss, rays. Colour and scatter shaders, alternatively, perform two tasks. The first is an absorption of the intersected rays by the intersected materials. And the second task is to spawn the secondary rays as part of the path-trace, or to set the ray as a miss.
The first task is ray absorption, an update of the ray albedo, the colour intensity carried by the ray, usually conducted according to eq (3.1).
Af = Ai× αC (3.1)
Where, Ai and Af, is the ray’s initial and final albedo. C, is the
intersected material, or background, colour. And the material’s alpha value, α, a value ignored if its a solid material intersection or a background.
If it is a miss, it updates the ray albedo with the background and adds the result to the accumulated samples. If it is a hit, then it depends on the material’s type.
Solid and emissive materials, either diffuse or specular, update the ray’s albedo with their colour as described. However, translucency is different. Linked with specular or emissive materials with al-phas below 1.0, it uses a different absorption strategy, as per eq. (3.2), where lower alphas, increase the whiteness of the material and produce the desired effect30.
Af = Ai(1 − α(1 − C)) (3.2)
The second task is to ray-trace and update the ray’s details with new secondary ray, using both the reflection and refraction laws, in line with the material’s type and transparency ratio. Or to tag the rays as misses if so.
3.3.3 Post-processing
The last command is remarkably simple. It updates the frame with the average of all accumulated samples and updates the image layout for its presentation.
21
31 In this renderer, it was the pipeline arrange-ment, resource disper-sion, memory layout, command scheduling, and threading. Some, re-quired multiple revisions, and still have improve-ments, later discussed in future work.
32 Allocating and binding separate memories to buf-fers is pretty straightfor-ward. Sharing memories, however, requires precise control of the buffer’s memory alignment, offset and size.
4
Benchmark
This chapter details the benchmark process. Over section 4.1, it introduces tested patterns, their changes and objectives, and a breakdown of the test method. Where, section 4.2, is a description of the obtained test results and their analysis.
4.1
Overview
During the development of a Vulkan application, no matter the kind, it is easy to encounter multiple possible solutions to certain aspects31.
Solutions are easily replicable in similar situations, forming ap-proach patterns to these aspects. For example it is possible to allocate memory for each resource, unite related resources, or even use a global allocation for all resources.
Some of these are simple to make, the first memory example, and others require a couple of extra steps, the other two32. Depending on the situation, those steps may not be worth it, but to know, the user must have an idea of the underlying performance implications. To obtain a good benchmark of the pattern’s performance, vari-ations tested do not change the program’s structure. Instead, they make the minimum possible changes to prevent any result clouding from the program.
4.1.1 Tested patterns
The thesis tests eight possible solutions with two pattern vari-ations for memory, scheduling and threading, as follows:
T.1 Memory tests change the memory arrangement from a par-tial cohesion to a full split memory. Done by slicing the Rays and Scene descriptor set memory, from the one seen in figure 3.3, to an allocation per buffer.
It evaluates the process’s performance in a situation where there’s only one or multiple memories involved.
T.2 Tests in command scheduling increase the number of com-mands created and scheduled in one batch by increasing the renderer’s anti-aliasing from 1 to 12.
22 CHAPTER 4. BENCHMARK Single-sample Single-thread cohesive Single-thread split Multi-thread cohesive Multi-thread split Samples 1 1 1 1 Threads 1 1 8 8
Split Memory No Yes No Yes
Multi-sample Single-thread cohesive Single-thread split Multi-thread cohesive Multi-thread split Samples 12 12 12 12 Threads 1 1 8 8
Split Memory No Yes No Yes
Table 4.1: Described on the table are all single and multi-sample tests and their corresponding names, used in the result tables.
33 Possible in virtue of the internal command schedule, as every task is sequential.
to threading, explained on the next point. The second is to estimate the time of one sample, by averaging the total render time as it cannot be measured directly33.
T.3 Threading tests alter the allowed program threads between 1, single-thread, to 8, multi-thread.
They evaluate the benefits of multi-threading in on-the-fly command recordings. As so, and due to the program design, they link to the number of commands used in each batch. Forming, in the end, the organisation described in table 4.1, where are all possible test combinations.
4.1.2 Method
For this, it employs a two-part fixed test method, for all the performed tests, where all submission times are measured, in a system with the following specifications:
CPU: Intel Core i7-4700HQ 2.40GHz; GPU: Nvidia GeForce GTX 860M; RAM: 16 GB.
In the first part, it renders a freshly updated scene 10 times, re-trieving from it the first frame’s time and the average of all submis-sions. This part serves to measure, not only the single frame where are pre-processing shader dispatches, but also to remove some ir-regularities verified in the first few frames.
In the second part, it renders the same scene an additional 60 times, but without any update. This set of renderers is stable and serves as a good performance measure between the tests that do not involve the pre-processing shaders.
4.2. TESTS 23
Figure 4.1: Benchmark scene. It features one grey fuzzy specular sphere, one blue translucent twelve triangle mesh cube, and seven axis-oriented cuboids. Where five are diffuse (walls), one is fully translucent specular (lamp cover), and the last one is emissive (lamp).
Type Quantity
Cuboids ×7
Spheres ×1
Triangles ×12
Table 4.2: The number of primitives, by type, in the used test scene. Cuboids stand for axis-oriented cuboids.
34 The previous renderer used the Vulkan’s debug FPS counter to measure the frame times. Using frames per second was both inconsistent and imprecise, thus changed in the new version.
All tests render to a 1920x1080 fullscreen window, and up to 12 bounces, the same scene. A Cornell’s Box with 20 primitives in total, described in table 4.2 and seen in figure 4.1, and where there are two light sources, one ceiling lamp and another from the missed rays.
The miss rays, those that leave the scene through the missing wall on the camera’s side, contribute to the overhaul illumination due to a renderer’s design where the background counts as ambient light.
4.2
Tests
Before presenting the results, it is worth to remind this path-tracer is a previous version re-design, described in appendix A, with considerably worse performance.
In a similar scene to figure A.1, the new version obtained a performance improvement of 8x. However, because the previous version measure is in FPS34and not in milliseconds, the comparison tests do not operate under the same circumstances, and therefore, not shown.
4.2.1 Results
Tables 4.3, 4.4 and 4.5 display the results of all test runs after a small treatment. They show the first frame time, and both the average and standard deviation for the 10, and 60, frames run.
24 CHAPTER 4. BENCHMARK Single-sample Single-thread cohesive Single-thread split Multi-thread cohesive Multi-thread split First frame t (ms) 235.832 322.726 293.860 362.723
10 runs with fresh scene ¯
t (ms) 213.317 191.893 210.245 200.385
σ (ms) 34.637 65.775 48.042 65.627
60 runs without scene update ¯
t (ms) 194.374 197.070 192.511 185.313
σ (ms) 41.221 39.091 39.689 37.991
Table 4.3: Results from the simple-sample tests. From a quick view, the first frameatakes a considerate time increase from a cohesive memory solution to a split memory solution. And in the 60 frame run, with the same memory type multi-threaded solutions are a bit faster. The first part times were too inconsistent, and the rest remains inconclusive.
a
The first frame, in contrast to all others, includes the pre-process and vertex shader dispatches.
Multi-sample Single-thread cohesive Single-thread split Multi-thread cohesive Multi-thread split First frame t (ms) 1983.160 2351.430 2288.380 2410.700
10 runs with fresh scene ¯
t (ms) 2209.609 2285.485 2316.922 2273.592
σ (ms) 122.975 127.072 137.874 154.999
60 runs without scene update ¯
t (ms) 2293.231 2297.726 2342.429 2302.966
σ (ms) 143.441 140.580 125.719 124.245
Table 4.4: Results form the multisample tests. From a quick view, there is a time increase from cohesive to split memory in the first frame. And in the 60 frame run, there’s a slight slowdown from single to multi-thread. The first part times were too inconsistent, and the rest remains inconclusive.
Single-thread cohesive Single-thread split Multi-thread cohesive Multi-thread split 10 runs with fresh scene
Single ¯t (ms) 213.317 191.893 210.245 200.385
Multi ¯t (ms) 184.134 190.457 193.077 189.466
60 runs without scene update
Single ¯t (ms) 194.374 197.070 192.511 185.313
Multi ¯t (ms) 191.103 191.477 195.202 191.914
Table 4.5: Single and multi-sample time normalised results. In the second part, single-thread solutions have a speed-up from both, split to cohesive memory and single-sample to multi-sample. And the opposite in multi-thread. From single-thread to multi-thread there is an improvement in single-sample and a slowdown in multi-sample.
4.2. TESTS 25
35 In the renderer, prim-itives hold the indices to all composing vertices, transform and material. They also include a type indication and a space for the radius.
Though all tables show the first part average and standard de-viation, the underlying values were too inconsistent for any conclu-sion, as expected.
Table 4.3 has a considerate time increase from cohesive to split memory, verified in both threading types. And, in the 60 frames run, there is a speed-up from single to multi-thread in both memory types. Table 4.4 maintains table 4.3 memory conclusions but shows single-threading to be slightly faster. And finally, table 4.5 shows a slight gain from split to cohesive memory and single-sample to sample in single-thread. And the reverse situation in multi-thread. Additionally, from single-thread to multi-thread there is a small increase in performance in single-sample and, likewise, the opposite in multi-sample.
4.2.2 Analysis
Memory approaches seem to have wavering weights during the 60 frames run, but a clear outline in the first frame, where there are two additional shaders, the pre-process and vertex. Here are the shaders details:
• Pre-process stage clears all rays accumulations located in the Pixels buffer of the Rays descriptor.
• Vertex stage updates all vertices in the Vertex buffer using the transformation matrices in the Transforms buffer. Both accessed through the indices located in each primitive35 in the Primitives buffer of the Scene descriptor.
While pre-process shader accesses one memory in either case, ver-tex shader accesses one and three memories in cohesive and split, respectively. This difference draws a clear line between both cases and exhibits the conditions where programs should use memory cohesion.
The small speed-up in the thread solutions from single-sample to multi-single-sample, in table 4.5, shows an expected benefit in bulk-batching. By eliminating a set of jumps from the CPU to GPU, it reduces the total sample time by around 3ms to around 6ms.
The average times of the multi-threaded multi-sample tests were unexpected. In a solution with multiple threads, one thread splits each command record to a separate thread. In theory, it would speed up the process, as its seen in the single-sample results. How-ever, when rising the number of samples, and consequently, the number of records, the average frame time increases instead.
Twelve additional tests were performed to ascertain the veracity of the multi-thread results. New experiments, employed a fixed
26 CHAPTER 4. BENCHMARK
Figure 4.2: Spreadsheet of the additional sampling test, it shows the result of a fixed state of cohesive memory and multi-threading a varied number of samples.
Samples Times
10 52.494
30 52.419
50 52.450
Table 4.6: Third multi-threading test results, it shows the frame time in a two bounce max, multi-threading rendering with 10, 30 and 50 samples per frame.
state of multi-threading and cohesive memory and varied the num-ber of samples from 1 to 12.
Figure 4.2 spreadsheet displays all normalised result times. The applied logarithmic regression is a result of three additional tests performed, under the same criteria as the previous twelve, with a lower amount of bounces, and a vast quantity of samples, described by table 4.6. Its results indicate that there’s a limit to the improve-ment.
And so, using multi-threading can slightly reduce the overhaul render time, which increases with both command quantity and com-plexity to an utmost limit.
While these changes in performance are quite low in these ex-amples, they are mostly independent between programs. The per-formance improvement from bulk-batching, for example, will have the same impact in a real-time solution. There, every millisecond gain is crucial, and so these patterns can produce improvements resulting in a significant difference.
27
5
Conclusion and Future Work
This final chapter answers the motivation’s questions and de-scribes all goals conclusions under section 5.1. And lastly, proposes future work on some possible changes and tests to be conducted over section 5.2.
5.1
Summary
Answering motivation’s questions, over this work’s development, modern graphics APIs increase in complexity held clear. The sheer quantity of additional elements required to work with is outstand-ingly higher than previous APIs. Opening room for extra mistakes and opportunities to worsen a work’s performance.
Nevertheless, there are also positive points. When compared to former APIs, it helps a user to a better understanding of all involved elements, and by consequence, possible improvements. Ultimately resulting in overhaul increased performance.
The thesis goals were laid out as a set of targets, and so described in the same context:
G.1 The impact of multi-threading is related to the command’s record operations.
From the evaluated criteria, performing several records in par-allel offers a performance increase. The increment is higher with the quantity and complexity of the record operations but has an utmost limit.
G.2 Command scheduling is, by far, the most crucial factor in an application using Vulkan. Decent command scheduling may radically improve performance, and likewise, decrease with poor executions.
In a circumstance where descriptor sets update frequently, on-the-fly reduce the quantity of both allocated descriptors as well as buffers, when compared to pre-records.
G.3 Memory management proved to be a dicey aspect. From the observed situations, related patterns performance link to shader operations, and the underlying memory accesses. When many buffers or images are involved in a shader, group-ing them under the same memory, may lead to a good per-formance increase. However, in cases where there is no direct relationship between the memories, there is no cohesion need.
28 CHAPTER 5. CONCLUSION AND FUTURE WORK
Figure 5.1: A render example of Suzanne in a translucent material. The scene has a light source above the model.
As an additional proposed result and to help out any future user, Annexe B describes some guidelines compiled from this work’s experience and conclusions.
5.2
Future work
Further enhancements fall over two categories, extra tests to be performed and renderer design changes.
On the test front, multi-queue dispatches, descriptor set arrays and render passes make up some of the untested Vulkan features that may have a direct impact in this path-tracer.
After performing the tests, there were some noticed improve-ments. It is possible to move the layout transition of pre-processing command to the post-processing. The change would limit the Im-age descriptor set to that command.
Additionally, exchanging the pre-processing command recording for a pre-recorded command is possible, as they neither do descriptor updates or use push constants.
Sub-dividing the image into smaller parts, either by tiles or ray-by-ray, would most-definitely improve the renderer. It would allow, not only multi-queue dispatches to be useful but also the use of more complex meshes, than the one shown in figure 5.1, without accel-eration structures or stopping the GPU, for the reasons described in section A.2.
There is also the possibility to use other pipeline types, such as the raster or ray-tracing pipeline. Yet, this would imply a complete change in the whole path-tracer.