• Nenhum resultado encontrado

A video encoding proposal for High Dynamic Range Television System = Uma proposta de codificação de video para um sistema de televisão com conteúdo de alta faixa dinâmica

N/A
N/A
Protected

Academic year: 2021

Share "A video encoding proposal for High Dynamic Range Television System = Uma proposta de codificação de video para um sistema de televisão com conteúdo de alta faixa dinâmica"

Copied!
130
0
0

Texto

(1)

Faculdade de Engenharia Elétrica e de Computação

Diego Arturo Pajuelo Castro

A VIDEO ENCODING PROPOSAL FOR HIGH

DYNAMIC RANGE TELEVISION SYSTEM

(UMA PROPOSTA DE CODIFICAÇÃO DE VIDEO

PARA UM SISTEMA DE TELEVISÃO COM

CONTEÚDO DE ALTA FAIXA DINÂMICA)

Campinas

2017

(2)

Faculdade de Engenharia Elétrica e de Computação

Diego Arturo Pajuelo Castro

A VIDEO ENCODING PROPOSAL FOR HIGH DYNAMIC RANGE

TELEVISION SYSTEM

(UMA PROPOSTA DE CODIFICAÇÃO DE VIDEO PARA UM SISTEMA

DE TELEVISÃO COM CONTEÚDO DE ALTA FAIXA DINÂMICA)

Thesis presented to the School of Electrical and Computer Engineering of the Univer-sity of Campinas in partial fulfilment of the requirements for the degree of Master, in the area of Telecommunications and Telematics.

Tese apresentada à Faculdade de Engen-haria Elétrica e Computação da Universi-dade Estadual de Campinas como parte dos requisitos exigidos para a obtenção do título de Mestre em Engenharia Elétrica na área de Telecomunicações e Telemática.

Supervisor: Prof. Dr. Yuzo Iano

ESTE EXEMPLAR CORRESPONDE À VERSÃO FINAL DA TESE DEFENDIDA PELO ALUNO DIEGO ARTURO PAJUELO

CASTRO, E ORIENTADA PELO PROF. DR. YUZOIANO

Campinas

2017

(3)

Ficha catalográfica

Universidade Estadual de Campinas Biblioteca da Área de Engenharia e Arquitetura

Luciana Pietrosanto Milla - CRB 8/8129

Pajuelo Castro, Diego Arturo,

P169v PajA video encoding proposal for High Dynamic Range Television System / Diego Arturo Pajuelo Castro. – Campinas, SP : [s.n.], 2017.

PajOrientador: Yuzo Iano.

PajDissertação (mestrado) – Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação.

Paj1. Compressão de imagens - Codificação. 2. Televisão. 3. MPEG ( Padrão de codificação de vídeo). I. Iano, Yuzo,1950-. II. Universidade Estadual de Campinas. Faculdade de Engenharia Elétrica e de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Uma proposta de codificação de video para um sistema de televisão com conteúdo de alta faixa dinâmica

Palavras-chave em inglês: Image compression - Encoding Television

MPEG (Video encoding standard)

Área de concentração: Telecomunicações e Telemática Titulação: Mestre em Engenharia Elétrica

Banca examinadora: Yuzo Iano [Orientador] Evaldo Gonçalves Pelaes Carlos Eduardo Câmara Data de defesa: 10-07-2017

Programa de Pós-Graduação: Engenharia Elétrica

(4)

Candidato:Diego Arturo Pajuelo Castro | RA: 180539 Data da Defesa:10 de Julho de 2017

Título da Tese:“A video encoding proposal for High Dynamic Range Television Sys-tem”.

Título da Tese: “Uma proposta de codificação de video para um sistema de televisão com conteúdo de alta faixa dinâmica”.

Prof. Dr. Yuzo Iano (Presidente, FEEC/UNICAMP) Prof. Dr. Evaldo Gonçalves Pelaes (UFPA)

Prof. Dr. Carlos Eduardo Câmara (UniAnchieta)

A ata de defesa, com as respectivas assinaturas dos membros da Comissão Julgadora, encontra-se no processo de vida acadêmica do aluno.

(5)
(6)

I would like to thank in first place to my grandfather, who teaches me to have the passion for the wonderful things that engineers can reach with its creativity. And for advising me that I always have to respect the ideas and thoughts of all the people around me. Although he is not present with us, his memory is still latent in my heart.

I am also very thankful to my mom for encouraging me to follow my dreams. To my dad for being my friend and mentor. To my sister because every time I admired her more, which gives me strength in difficult times. And to my grandmother for the unconditional love she always gave me.

I would like to thank God for his help, guidance and strength in every moment of my live.

I am grateful to Dr. Yuzo Iano whose guidance enabled me to complete this thesis.

Finally, special thanks to my professor and mentor, Dr. Guillermo Kemper, for motivating me to always be a better professional and person.

I must express my gratitude to the CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) programme for the financial support and the academic incen-tive that enabled the realization of this thesis.

(7)

become inevitable’. (Moises Pajuelo)

(8)

High Dynamic Range Television (HDR TV) is a topic of current interest in academia and industry. The possibility of radically improving the visual quality of the television systems is a challenge. Each improvement in digital television systems has always been related to higher resolutions such as High Definition Television (HD TV) and the lat-est Ultra High Definition Television (UHD TV) formats, which requires increasing the current bandwidth allocated. However High Dynamic Range (HDR) technology can attribute the same level of realism without the need to increase the resolution since a TV could reproduce natural scenes with the same visual impact with which it looks at a scene in the real world.

The reference end-to end HDR system is based on HDR10 System due to its encod-ing efficiency and visual quality. However, it cannot be directed apply for the current Standard Dynamic Range (SDR) television system. For this reason, this work focuses on proposing a HDR video encoding system with backward compatibility consider-ing the legacy of television sets and Set-Top Boxes (STB). Finally, this work presents objective metrics in different coding scenarios regarding the HDR10 System, using the codec H.264, mostly used in video workflows.

Keywords: High Dynamic Range, Video Coding, Television System, Video Compres-sion.

(9)

Um sistema de televisão com alta faixa dinâmica (High Dynamic Range Television – HDR TV) é um tema de interesse atual na academia e na indústria. A possibilidade de me-lhorar radicalmente a qualidade visual dos sistemas de televisão é um desafio. Cada melhora nos sistemas de televisão digital sempre foi relacionada a resoluções mais al-tas, como a televisão de alta definição (High Definition Television – HD TV) e os mais recentes formatos de televisão como o sistema de ultra-alta definição (Ultra High Defi-nition Television – UHDTV), o que requer o aumento da largura de banda atual alocada. No entanto, a tecnologia com alta faixa dinâmica (High Dynamic Range – HDR) pode atribuir o mesmo nível de realismo sem a necessidade de aumentar a resolução, uma vez que um televisor pode reproduzir cenas naturais com o mesmo impacto visual com o qual olha para uma cena no mundo real.

O sistema de referência de ponta a ponta com alta faixa dinâmica baseia-se no sistema HDR10 devido à sua eficiência de codificação e qualidade visual. No entanto, não pode ser aplicado diretamente no atual sistema de televisão padrão de alcance dinâmico (Standard Dynamic Range – SDR). Por esta razão, este trabalho se concentra em propor um sistema de codificação de vídeo com alta faixa dinâmica, mantendo o legado dos aparelhos de televisão e as caixas de decodificação (Set-Top Boxes – STB). Finalmente, este trabalho apresenta métricas objetivas em diferentes cenários de compressão em relação ao sistema HDR10, usando o codificador do video H.264, usado principalmente nos atuais sistemas de vídeo.

Palavras-chaves: Alta faixa dinâmica; Codificaçao de video; Sistemas de televisão; Compressão de video.

(10)

Figure 2.1 – Spectral radiance . . . 26

Figure 2.2 – CIE standard color matching functions ¯x, ¯y and ¯z . . . 27

Figure 2.3 – Additive reproduction . . . 29

Figure 2.4 – Gamuts of ITU-R BT.709, ITU-R BT.709 and DCI P3 . . . 32

Figure 2.5 – 4:2:0 format . . . 37

Figure 2.6 – GOP in H.264/AVC . . . 41

Figure 2.7 – Prediction in H.264/AVC for a macroblock . . . 42

Figure 2.8 – Transform in H.264/AVC for a macroblock . . . 42

Figure 2.9 – Quantization in H.264/AVC for a macroblock . . . 43

Figure 2.10–Basic coding structure for H.264/AVC for a macroblock . . . 43

Figure 2.11–Subdivision of a CTU into CUs (a)Spatial Partitioning. (b) Corre-sponding quadtree representation . . . 45

Figure 2.12–Subdivision of a frame into (a) slices and (b)tiles. (c) Illustrations of WPP . . . 45

Figure 2.13–The conceptual television system . . . 47

Figure 2.14–CRT transfer function . . . 48

Figure 2.15–Conventional luma/color difference encoder . . . 50

Figure 2.16–The BT.709 HD TV system architecture . . . 50

Figure 2.17–Multiplexed MPEG-2 transport stream packets . . . 52

Figure 2.18–MPEG-2 transport stream packet . . . 52

Figure 2.19–The dynamic range of a real-world scene along with the capabilities of the HVS capture and display technology . . . 55

Figure 2.20–Red–green–blue component encoding using half-precision floating point numbers . . . 56

Figure 2.21–32-bit per pixel RGBE encoding . . . 57

Figure 2.22–HDR10 system . . . 58

Figure 2.23–10 bit PQ curves compared to 10 bit BT.1886 (gamma system) curve at 1000 nits peak . . . 60

Figure 2.24–The block-diagram of the two visual metrics for visibility (discrimi-nation) and quality (mean-opinion-score) predictions and the under-lying visual model . . . 67

Figure 3.1 – Objective metrics of the BeerFest sequence . . . 70

Figure 3.2 – Objective metrics of the FirePlace sequence . . . 71

Figure 3.3 – Objective Metrics of the ShowGirl Sequence . . . 72

Figure 3.4 – HDR scalable encoding scheme . . . 75

(11)

Figure 3.8 – High dynamic range reducer scheme . . . 81

Figure 3.9 – Tone mapping curve for frame 190 of the BalloonFestival Sequence . . 83

Figure 4.1 – Transformed Domain-Based Metrics of the Market Sequence . . . 88

Figure 4.2 – CIEDE 2000 metrics of the Market Sequence . . . 89

Figure 4.3 – HDR-VDP2 . . . 90

Figure 4.4 – Objective metrics of the FireEater Sequence . . . 91

Figure 4.5 – Histogram of the tone-mapped version of the Market sequence . . . . 92

Figure 4.6 – SDR metrics . . . 93

Figure 4.7 – Photograph of the day of realization of the subjective test . . . 94

Figure A.1 – Stimulus presentation in the ACR method . . . 108

Figure C.1 – Market sequence . . . 116

Figure C.2 – ShowGirl Sequence . . . 117

Figure C.3 – FireEater sequence . . . 119

Figure C.4 – BalloonFestival Sequence . . . 120

Figure C.5 – Sunrise Sequence . . . 122

Figure C.6 – EBU_04 sequence . . . 123

Figure C.7 – EBU_06 Sequence . . . 125

Figure D.1 – Objective metrics of the Sunrise sequence . . . 126

Figure D.2 – Objective metrics of the BalloonFestival sequence . . . 127

Figure D.3 – Objective metrics of the ShowGirl sequence . . . 128

Figure D.4 – Objective metrics of the EBU_04 sequence . . . 129

(12)

Table 1.1 – Summary of required bitrates . . . 21

Table 2.1 – CIE chromaticities for sRGB reference primaries and CIE standard il-luminant . . . 29

Table 2.2 – Chroma downsampling coefficients . . . 37

Table 2.3 – Chroma upsampling coefficients . . . 38

Table 2.4 – Full HD parameters . . . 39

Table 2.5 – SMPTE digital interfaces . . . 51

Table 2.6 – Commom dynamic range terms and their applications . . . 54

Table 2.7 – LDR and HDR bitrates required for uncompressed movie in high def-inition (HD, spatial resolution of 1,920 × 1,080 at 25 Hz . . . 60

Table 2.8 – Input variables of the HDR-VDP-2 algorithm . . . 68

Table 3.1 – AVC/H.264 parameters . . . 69

Table 3.2 – Variable definitions . . . 80

Table 4.1 – HDR test sequences . . . 86

Table 4.2 – Bitrate of the sequences . . . 94

Table 6.1 – Bitrate of digital television channels . . . 98

(13)

ACR Absolute Category Rating

ADC Analog to Digital Converter

AVC Advanced Video Coding

bpp Bits Per Pixel

CABAC Context-Adaptive Binary Arithmetic Coding

CAVLC Context-Adaptive Variable Length Coding

CIE International Commission on Illumination

CRT Cathode Ray Tube

CSF Contrast Sensitivity Function

CTU Coding Tree Unit

CU Coding Unit

DAG Data Adaptive Grading

DCT Discrete Cosine Transform

DTH Direct-to-Home

EDR Enhanced Dynamic Range

EOTF Electro-Optical Transfer Function

ES Elementary Stream

Gbps Giga Bits Per Second

GOP Group Of Pictures

HD High Definition

HD TV High Definition Television

HDR High Dynamic Range

HDR-TV High Dynamic Range Television

(14)

ICT Integer Cosine Transform

IEC International Electrotechnical Commission

IP Internet Protocol

IPTV Internet Protocol Television

IRD Integrated Receiver Decoder

ISDB Integrated Services Digital Broadcasting

ISO International Organization for Standardization

ITU-T International Telecommunication Union - Telecommunication Standard-ization Sector

JCT-VC Joint Collaborative Team on Video Coding

JND Just Noticeable Difference

LCD Liquid Crystal Display

MB Macroblock

Mbps Mega Bits Per Second

MOS Mean Opinion Score

MPEG Moving Picture Experts Group

MSE Mean Square Error

OETF Opto-Electrical Transfer Function

OLED Organic light-emitting diode

OOTF Opto-Optical Transfer Function

OTT Over-The-Top

PAT Program Associaton Table

PID Packet Identification

PMT Program Associaton Table

(15)

PU Prediction Unit

QP Quantization Parameter

RDO Rate-Distortion Optimation

SD Standard Definition

SD TV Standard Definition Television

SDI Serial Digital Interface

SDR Standard Dynamic Range

SMPTE Society of Motion Picture and Television Engineers

SNR Signal-to-noise

STB Set-top boxes

TF Transfer Function

TIFF Tagged Image File Format

TMO Tone Mapping Operator

tPSNR Transform-Peak Signal Noise Ratio

TS Transport Stream

TU Transform Unit

UHD TV Ultra-High Definition Television

VDP Visual Difference Predictor

WCG Wide Color Gamut

(16)

1 Introduction . . . . 18

1.1 Motivation . . . 20

1.2 Objectives . . . 22

2 Fundamentals . . . . 24

2.1 Photometry and Colorimetry . . . 24

2.1.1 Light . . . 24 2.1.2 Color . . . 26 2.2 Color Encoding . . . 28 2.2.1 Color Model . . . 28 2.2.1.1 RGB . . . 29 2.2.1.2 YCbCr . . . 30 2.2.2 Color Spaces . . . 32 2.2.2.1 CIELAB . . . 33 2.2.2.2 BT.709 . . . 33 2.2.2.3 BT.2020 . . . 34 2.2.2.4 DCI P3 . . . 35 2.2.3 Chroma Subsampling . . . 36 2.3 Video Compression . . . 39 2.3.1 H.264 . . . 40 2.3.2 H.265 . . . 44

2.4 The Legacy Television Architecture . . . 46

2.4.1 Gamma System . . . 46

2.4.2 MPEG-2 Transport Stream . . . 51

2.5 High Dynamic Range Television System . . . 53

2.5.1 High Dynamic Range . . . 54

2.5.1.1 EXR . . . 55 2.5.1.2 Radiance . . . 56 2.5.1.3 TIFF . . . 57 2.5.2 HDR10 System . . . 58 2.5.3 HDR Metrics . . . 61 2.5.3.1 PSNR . . . 61

2.5.3.2 Transformed Domain-Based Metrics . . . 62

2.5.3.3 CIEDE 2000 . . . 63

2.5.3.4 HDR-VDP-2 . . . 66

3 Proposed HDR System . . . . 69

(17)

3.4 High Dynamic Range Reducer . . . 78

4 Experimental Results . . . . 85

4.1 Laboratory Experiments . . . 85

4.2 Objective and Subjective Results . . . 87

5 Conclusions . . . . 95

5.1 Future Work . . . 96

6 Study Case . . . . 97

Bibliography . . . . 99

Appendix

107

APPENDIX A The ACR Method . . . 108

APPENDIX B Program Code . . . 112

APPENDIX C HDR Sequences . . . 115

(18)

1 Introduction

High Dynamic Range (HDR) imaging is the process of capturing the brightness of a scene and produces the best rendering intent of an artist. Around this process, many technical terms and technologies are involved, since the knowledge of science such as photometry or radiometry until the most advanced video digital processing algorithms. Video Science theory and new technologies form part of a new multimedia service with numerous applications and a promissory future in broadcasting.

One of the day-to-day services widely used is the Television. For many years, people have been able to watch video content through microwave signals, known as broadcasting. However, new revolutionary technologies have taken place in the last years, known as streaming service, such as IPTV (Internet Protocol Television) or OTT (Over-The-Top) services. Thanks to these technologies, people can access to a vast va-riety of multimedia content every day and enjoy different type of television program-ming, making the television a more democratic service. Hence, better video quality and better user experience is required and demanded for people.

For this reason, video engineers try many possibilities to make it real. Many advances in this field has been proposed and are summarized in four groups: higher frame rates, higher spatial resolution, higher bit depth and higher dynamic range solu-tions. Each of these improvements has a different impact in human viewer experience according to the human visual system. Therefore, finding a technology that can achieve this will be a challenge, considering that any change in the current television system must respect the backward compatibility with the current system. This means that the assigned bandwidth of each service will not be exceeded and can be transmitted in the current communication systems. Also, this technology must achieve a substantial improvement in the subjective when watching television.

High Dynamic Range Television (HDR-TV) is the solution to undertake a new evolution in the current television systems in the short term. Due mainly to two

(19)

rea-sons, it is a solution that provides a better user experience using current video stan-dards resolutions and is a solution that can use the current video infrastructure with-out exceeding the assigned bitrate thresholds. Therefore, it is a cost-effective solution and generates a better pixel that will impact positively the user experience in the near future.

New HDR video encoding frameworks with SDR compatibility have been pro-posed in the last years. The single and dual-layer coding are the two types of solu-tions. The dual layer [1] generates an 8-bit base layer and a residual layer encoded in an enhanced layer using the Scalable Video Coding (SVC) extension. However, due to its multi-layer design, this solution is not adapted for all distribution workflows [2]. Thus, a single-layer solution requires the use of additional metadata with a base layer compatible to the 8-bit infrastructure.

The work proposed by François et al. [2] follows the single layer idea and presents an HDR decomposition process to generate a SDR (Standard Dynamic Range) signal in linear-light values. Then the square-root function is applied to these values to repro-duce a transfer function close to the BT.709 OETF (Opto-Electrical Transfer Function). Finally, a color correction is performed to overcome saturation color and hue prob-lems. Topiwala et al. [3] presents a single layer approach to HDR encoding with back-ward compatibility, called FV10. This approach proposes a new color encoding scheme, YFbFr, and a Data Adaptive Grading (DAG) process based on histogram analysis and a piecewise linear mapping scheme to convert the HDR samples into SDR samples. Ploumis et al. [4] proposes a perception-based histogram equalization for tone map-ping process with a 10-bit PQ signal as the input signal. Our previous work ([5, 6]) proposed the use of a modified HDR10 encoding, using a H.264 codec (Main Profile) at 8 bits. The three HDR sequences demonstrated that the PQ encoding efficiency in 8 bits compared to the 10-bit system has good results in terms of objective metrics, tested in an ISDB-Tb broadcast system. However, this solution is only applicable for PQ display devices.

(20)

1.1

Motivation

The main motivation of this work was the search for a solution that can be implemented in the current television system and substantially improve the visual ex-perience of the viewer in the short term. Different proposals have been discussed over the past few years and are divided into four groups: higher frame rates, higher spa-tial resolution, higher bit depth, and higher dynamic range solutions. In some cases, these can be jointly used to generate a better visual experience, however the high im-plementation cost and the high bitrate required for the transmission in the current communication channels make the development of these technologies nonviable.

The new television standard called UHD TV (Ultra-High Definition Television), increases the spatial resolution of the HD TV (High Definition Television) by four and eight times, the 3840 x 2169 system (4K) and the 7680 x 4320 system (8K) are speci-fied in Recommendation ITU-R BT.2020 [7]. The main advantages of this technology is the high sense of realism that can be perceived by the viewers, comparing an image with a real scene. According to the latest studies on this technology, it is concluded that the amount of pixels of the horizontal axis of an image should be approximately 8000, with an aspect ratio of 16: 9, in order to enjoy a content closer to the reality, thus reaching the minimum level of visual acuity of the viewer [8]. However these new resolution formats require a high bit rate to be transmitted to the end user. According to the latest study conducted in Japan, the required bit rate in a broadcasting system for transmitting the 7680 x 4320 or 4320p format at 60 Hz progressively oscillates from 80 Mbps (Mega bits per second) to 100 Mbps [9], considering the use of the latest, most efficient video encoder, HEVC (High Efficiency Video Coding)/H.265 [10] [11], standardized by MPEG (Moving Picture Experts Group) and ISO/IEC (International Organization for Standardization/International Electrotechnical Commission), which has approximately twice the AVC (Advanced Video Coding)/H.264 [12] compression capacity. Table 1.1 shows the required bitrates for UHD TV and Full HD TV system.

The bitrates of UHD resolutions could not fit into the current communication channels used in broadcasting. For example, the ISDB-Tb (Integrated Services Digital

(21)

Table 1.1 – Summary of required bitrates [9] Video Format Required bit rate

4320/60/P 80 Mbps - 100 Mbps 2160/60/P 30 Mbps - 40 Mbps 1080/60/P 10 Mbps - 15 Mbps 1080/60/I 10 Mbps - 15 Mbps

Broadcasting) digital television standard [13], developed and deployed in Brazil, uses a spectrum of 6Mhz with a maximum transmission bit rate of 23,234 Mbps.In many cases the broadcasters use an average of 18.4 Mbps to compensate the high noise levels in the radio channel, altering the convolutional code and the guard interval [14].

About the high frame rate proposals, it is demonstrated that doubling the frame rate from 30 to 60 Hz do not get a substantial improvement in terms of user experience. Also, their bit rates fluctuate in the same range when is compressing. For this reason, new studies assert that a frame rate of 120 Hz can have an impressive result in the audience due to its better better motion representation. However, the problem lies in the fact that the existing chips were designed for a progressive output, at most at 60 Hz. This would involve an engineering redesign of the silicon (chips) in the manufacturing process of displays [15]. The same problem is reflected in the design of new television sets that were thought to this type of technology. It is intended to include in the new STB (Set-top boxes) a conversion process from 120 Hz to 60 Hz and can be compatible with existing television sets.

Increasing the bit depth from 8 to 10 bits may reflect improvements in exist-ing television systems due to their codexist-ing gain at the compression process. As well as being able to send a WCG (Wide Color Gamut), covering almost 75.8% of the CIE (International Commission on Illumination) 1931 chromaticity diagram, specified in Recommendation ITU-R BT.2020 [7]. This new palette could reproduce colors that can-not be displayed in the Recommendation ITU-R BT.709 [16], specified in the Full HD TV system.

According to recent studies on subjective quality done to different video for-mats, unexpected results came to light. It is a fact that the subjective quality of the HD

(22)

format can be improved by approximately the same amount, either by increasing the dynamic range or increasing the resolution to UHD, and a small additional improve-ment is achieved when using both technologies at the same time [17]. For this reason, HDR TV systems are being discussed in the academic literature [18] as well as in global innovation companies [19] in order to have an international standard and have inter-operability among the different technologies.

Clearly, the further steps of television is encouraging new discussions about how is better represented a scene-referred signal into a display-referred signal. Each proposed improvement for the television system discussed above, will improve the viewer experience with no doubt. However, the deployment in the adoption of these technologies will be a gradual process. Therefore, the search for techniques and tech-nologies that can take advantage of the current television infrastructure to produce a better artistic expression of a scene so that it can be displayed as faithfully as possible in a television set are required and must be demystified.

1.2

Objectives

This work explores the most efficient way to implement an HDR TV system on the legacy television architecture. This implies that HDR content can be interpreted by existing STBs and television sets, which are widely deployed on all television net-works.

The system constraints to consider are that the bitrates generated by the system must fit in the current communication systems. It is very important to test all kind of television programming, covering different contrast and color gamuts. Another con-straint is the production of a single content, at the Production/Post-Production stage, that can feed the current Full High Definition (HD) television system and, based on the development of reconstruction techniques from limited range content, can be reliably recovered the HDR content at the reception stage.

(23)

tele-vision systems and provides for the way to be able to integrate the new system on this same physical infrastructure, considering the same transport stream that is pro-duced at the processing stage and allows the multiplexing of several programmes into a single stream, which could feed either a modulation system or a streaming service.

Finally, this work will be evaluated strictly under objective and subjective met-rics with the mission of evaluating the visual quality of the generated sequences in the best way. Also, the HDR10 System, which is widely discussed in academia and industry, will be considered as a reference when comparing the results obtained.

The rest of this work is organized as follows:

• Chapter 2 presents the fundamentals of a digital video signal, the legacy televi-sion and all the related about the HDR TV system.

• Chapter 3 presents the video encoding proposal for HDR TV.

• Chapter 4 discusses the experimental results of this work.

• Chapter 5 presents the conclusions and the future works.

• Chapter 6 presents a study case, considering the results reached by the proposed scheme.

(24)

2 Fundamentals

Traditionally, the video and still images formats were tailored to a particular display device: CRT (Cathode Ray Tube), Plasma, OLED (Organic light-emitting diode) and LCD (Liquid Crystal Display) monitors, with reduced brightness. The colors were usually encoded with 8 bits because higher than it do not get a noticeable subjective quality improvement. These formats are called as display-referred signal. On the other hand, HDR formats work with samples that are directly related to the physical mag-nitude of light, luminance. These formats are called as scene-referred signal. These different representations are important in order to study solutions that can reach an efficient compression of HDR images.

This chapter presents the fundamentals concepts about a digital video signal and the processing stages of a television system. The strict definition about the HDR content is detailed and all the topics related to the HDR-TV System.

2.1

Photometry and Colorimetry

The HDR images store real world luminance levels, which is a physical quantity derived by the study of light. This section focuses on the principal concepts about light and color.

2.1.1

Light

In terms of physics, light is an electromagnetic radiation within a certain portion of the electromagnetic spectrum, whether visible or not. Radiometry and Photometry are two sciences that study and measure light. Radiometry is the measure of radiant energy and Photometry is the art of making visual comparisons of light. In the past ten years, the measurement systems have been combined: photometrics units are now defined in terms of radiometric units. Photometry currently means radiometry with

(25)

sensors whose spectral sensitivity match the response of the human eye, which varies by several orders of magnitude over the 380nm to 780nm wavelength, the range of visible light [20]. In imaging systems, it is required that the measures try to emulate how the world is perceived by human vision. For this reason, photometric measures are useful for the following three important arguments:

‘First, the human eye is used as the sensor and tested with the most en-vironmental sources of light. Second, cameras extract things that humans find interesting in terms of color tone reproduction. And third, cameras are designed with the same light response as humans so that television images appear to be natural. Photometric measurements show directly how an ob-ject will appear to a camera’.[21]

The spectral radiance is a wavelength dependent function expressed in Equa-tion 2.1, where dA is an infinitely small area, dω is an infinitely small solid angle, Φ, the radiant flux, and θ is an angle between the rays and the surface.

L(λ) = d

2Φ(

λ)

dωdA cos(θ) (2.1)

According to the geometric representation, shown in Figure 2.1, the radiance is the amount of radiant flux arriving or leaving at a point in a particular direction, mea-sured in watts per steradian per square meter Wsr−1m−2[22] and the luminous Flux, the power of a light source, measured in lumen, is the spectral radiant flux weighted by the appropriate eye response function. Illuminance is the luminous flux per unit area, lumen/meter2or lux. Luminous intensity, is the luminous flux per solid angle, the unit is candela. Finally, the luminance is the luminous flux per unit area per unit solid angle, the unit is candela/meter2or nits [23]. In this work, the unit nits will be used hereafter. The luminance is commonly used in the HDR pipeline as light measure.

Before continuing in this section, is important to differentiate four important terms that generates confusion for video engineers. Luminance, Y, is a linear-light

(26)

Figure 2.1 – Spectral radiance [24]

quantity; proportional to physical intensity weighted by the spectral sensitivity of hu-man vision and is expressed in units of nits. Lightness is a non-linear transfer function of luminance that approximates the perception of brightness. This last word, bright-ness, is the attribute of a visual sensation according to which an area appears to exhibit more or less light, but cannot be measured. Luma Samples has a symbol Y’, and is an approximation of lightness and indirectly related to the relative luminance, and form part of the Y’Cb’Cr’ Color model [25].

2.1.2

Color

Color is a combined physiological and psychological response to light [20], color exists only in the eye and the brain. As Isaac Newton put it this way, in 1675:

“ ‘Indeed rays, properly expressed are not coloured’. [25] ”

For this reason, scientists and International Organizations tried to describe how to represent colors faithfully in a display device. Colorimetry helps in this issue, be-cause quantifies how the human perceived color through empirical experiments. In 1931, the CIE 1931 XYZ colorspace was defined [26] and deduced that any combina-tion of three stimulus or tristimulus encloses the colors that a human can perceive. In practical terms, a tabulation of these tristimulus values can reach the reproduction of any color without the need to reproduce the entire spectrum of light. Mathematically,

(27)

the CIE X, Y and Z tristimulus are defined as: X = Z λ=780 λ=380 J(λ)¯x(λ) (2.2) Y= Z λ=780 λ=380 J(λ)¯y(λ) (2.3) Z= Z λ=780 λ=380 J(λ)¯z(λ) (2.4)

Consider a given spectral power distribution J(λ), which is the concentration

of power per unit area per unit wavelength of light; ¯x, ¯y and ¯z are the standard color matching functions. The Y component, quantifies the amount of luminance correspond-ing to a certain image area and its units are nits. Figure 2.2 represents each matchcorrespond-ing function as a wavelength dependent function.

Figure 2.2 – CIE standard color matching functions ¯x, ¯y and ¯z [22]

This fact made possible the massification of digital video formats and its rapid adoption worldwide due to its practical way to code an image. From this color model, the RGB was derived.

(28)

chro-maticity. It presents another approach on the definition of color in the absence of light-ness, and helps to define the attributes of hue and saturation of a color. For this, is important to normalize the XYZ tristimulus for obtaining two chromaticity values x and y and are defined as:

x = X

X+Y+Z (2.5)

y= Y

X+Y+Z (2.6)

Each coordinate [x, y] of Figure 2.4 represents a color within the chromaticity diagram, that encompasses the entire gamut of human vision. The white point has coordinates in x = 1/3 and y = 1/3. Note that this diagram is ideal and today’s television and film standards have not yet been able to cover exactly the entire curve, however, are approximating to the curve at the boundary of the gamut.

2.2

Color Encoding

There are many ways to represent colors, however there are representations used by different television standards that are worldwide accepted. In some cases, these are designed to take advantage of the human visual model and achieve certain goals, such as bandwidth reduction. This section focuses on color models, color spaces and chroma subsampling techniques that will be employed throughout this work.

2.2.1

Color Model

A color model is a mathematical model describing the way colors can be rep-resented with a set number of components. For example, the CIE 1931 XYZ defined three components and store samples as device independent, because is not tailored to a specific device. Despite this, the most representative color models among the most used formats are device dependent, such as R0G0B0or Y0Cb0Cr0.

(29)

2.2.1.1 RGB

RGB color model defines three primary colors: R as red, G as green and B as blue, and their wavelengths are 435.8nm, 546.1nm and 700.0nm respectively [25]. Is an additive color model, so that, the sum of these three components reproduce any of the colors in the color space as is shown in Figure 2.3.

Figure 2.3 – Additive reproduction [25]

When defining this type of color model in a color space, the international rec-ommendations take into account the following considerations: the coordinates [x, y]

for each primary component in the chromaticity diagram; the white point, known as D65; the bit depth and the gamma value used. For example, a color space widely used by LCD and CRT monitors is the sRGB [27], created by Hewlett-Packard and Microsoft Corporation, that contains the following parameters:

Table 2.1 – CIE chromaticities for sRGB reference primaries and CIE standard illumi-nant

Red

Green

Blue

White, D65

x

0.64

0.3

0.15

0.3127

y

0.33

0.6

0.06

0.3290

Once the color space is defined, it can be transformed to or from a different color space such as the CIE XYZ using matrix multiplications. For example, to transform

(30)

from sRGB into CIE XYZ, the following matrix operations are performed:       RsRGB GsRGB BsRGB       =       3.2410 −1.5374 −0.4986 −0.9692 1.8760 0.0416 0.0556 −0.2040 1.0570             X Y Z       (2.7)

In the same way, to transform from CIE XYZ into sRGB, the following matrix operations are performed:

      X Y Z       =       0.4124 0.3576 0.1805 0.2126 0.7152 0.0722 0.0193 0.1192 0.9505             RsRGB GsRGB BsRGB       (2.8)

This matrix can have negative coefficients because the sRGB color space has some values out of the gamut in the CIE XYZ. In many cases, a clipping process is applied to saturate the values.

The importance of this color model to the work is that the HDR sequences pro-vided to the author are coded in linear-light RGB samples, and, although some se-quences do not include it or are non-linear, R0G0B0(the single quotation mark indicates a nonlinear coding in the samples), a preprocessing stage is required to return to the linear domain.

2.2.1.2 YCbCr

YCbCr color model defines three components: Y is an approximation of light-ness and indirectly related to the relative luminance, B−Y are scaled to form Cb and R−Y are scaled to form Cr.

This color model is derived from RGB to take advantage of the poor color acuity perceived by human vision, since the human eye is more sensitive to lightness than color, since the human eye is more sensitive to lightness than color. For this reason, each color channel in a digital video has half the data rate of the lightness and can save

(31)

a lot of memory when storing the samples [25]. In broadcasting, the video workflow uses the non-linear samples of this color model, expressed as Y0Cb0Cr0, considering the gamma transfer function applied to the linear samples.

The process to quantize Y0Cb0Cr0in the range of[0, 1]into a signal of bit depth BitDepthY for the Y0 component and BitDepthC for the chroma components, Cb0 and

Cr0[28] is as follows:

DY0 =Clip1Y0(Round((1  (BitDepthY−8) ∗ (219∗Y0+16)))) (2.9)

DCb0 =Clip1C0(Round((1  (BitDepthC−8) ∗ (224∗Cb0+128)))) (2.10)

DCr0 =Clip1C0(Round((1 (BitDepthC−8) ∗ (224∗Cr0+128)))) (2.11)

Consider:

Clip1Y0(x) = Clip3(0,(1BitDepthY) −1, x) (2.12)

Clip1C0(x) = Clip3(0,(1BitDepthC) −1, x) (2.13)

Clip3(x, y, z) =            x if z< x y if z>y z otherwise (2.14)

The inverse quantization of DY0, DCb0 and DCr0 is as follows [28]:

Y0 =ClipY0(( DY 0 (1  (BitDepthY−8)) −16)/219) (2.15) Cb0 =ClipC0(( DCb 0 (1 (BitDepthC−8)) −128)/224) (2.16) Cr0 =ClipC0(( DCr0 (1 (BitDepthC−8)) −128)/224) (2.17)

(32)

Consider:

ClipY0(x) = Clip3(0, 1.0, x) (2.18)

ClipC0(x) =Clip3(−0.5, 0.5, x) (2.19)

As part of the processing stages involved in this work, a quantization stage, with 8 and 10 bits, will be necessary to test and evaluate the proposed scheme of this work.

2.2.2

Color Spaces

Once the color model is chosen, video engineers need to work with standard-ized formats that can represent colors in the same way each time are created, whether for movies or for television. For this, color spaces are important in video and image field and new spaces have been proposed over the years, which is related to technolog-ical progress in digital production and display devices. The CIELAB, BT.709, BT.2020 and DCI P3 are discussed and reviewed. Their gamuts are shown in Figure 2.4.

(33)

2.2.2.1 CIELAB

CIELAB or CIE L∗a∗b∗ [30] color space defines three components, L∗ repre-sents the lightness of the color while a∗ and b∗are the color components and helps to distinguish the color difference between two images.

The color space conversion required to transform from XYZ linear samples to L∗a∗b∗ is [28] is as follows: L∗ =116∗ f(Y/Yn) −16 (2.20) a∗ =500∗ [f(X/Xn) − f(Y/Yn)] (2.21) L∗ =200∗ [f(Y/Yn) − f(Z/Zn)] (2.22) where: f(t) =      t1/3 if t > (24/116)3 (841/108) ∗t+16/116 otherwise (2.23)

Consider that the tristimulus values, Xn, Yn and Zn, represent a specified white

object color stimulus. In this case, Yn =100, Xn =Yn∗0.95047 and Zn =Yn∗1.08883[28].

This work uses this color space for obtaining the CIE2000 metrics of the HDR se-quences that are part of the assessments This work uses this color space for obtaining the CIE2000 metrics of the HDR sequences that are part of the assessments between the original and the reconstructed images.

2.2.2.2 BT.709

The recommendation ITU-R BT.709 [16] defines the parameters values for the HD TV standards for production and international programme exchange. The funda-mental parameters are the pixel count, the frame rate, the digital representation, the primary chromaticities, the luma coefficients and the transfer characteristics.

(34)

is transmitted HD or Full HD resolution formats. For this reason, the BT.709 serves as reference in the HDR compression process to keep the backward compatibility.

The color space conversion required to transform from R709G709B709linear

sam-ples to XYZ color space [28] is follows:

      X Y Z       =       0.412391 0.357584 0.180481 0.212639 0.715169 0.072192 0.019331 0.119195 0.950532             R709 G709 B709       (2.24)

The inverse conversion required for linear samples is [28]:

      R709 G709 B709       =       3.240970 -1.537383 -0.498611 -0.969244 1.875968 0.041555 0.055630 -0.203977 1.056972             X Y Z       (2.25)

It is important to define the color space conversion in the non-linear domain be-tween R0G0B0 and Y0Cb0Cr0. For this, to transform from R7090 G7090 B7090 to Y0Cb0Cr0 color space [28]:       R0709 G7090 B7090       =       1 0 1.57480 1 -0.18733 –0.46813 1 1.85563 0             Y0 Cb0 Cr0       (2.26)

And the inverse conversion required for non-linear samples is [28]:

      Y0 Cb0 Cr0       =       0.212600 0.715200 0.072200 -0.114572 -0.385428 0.500000 0.500000 -0.454153 -0.045847             R7090 G0709 B0709       (2.27) 2.2.2.3 BT.2020

The recommendation ITU-R BT.2020 [7] defines the parameters values for the UHD TV standards for production and international programme exchange. Also, uses

(35)

a wide color gamut compared to BT.709 as is shown in Figure 2.4 and can produce a greater realism in the images, however its reproduction is subject to restrictions in the video chain.

The color space conversion required to transform from R2020G2020B2020 linear

samples to XYZ color space is [28]:

      X Y Z       =       0.636958 0.144617 0.168881 0.262700 0.677998 0.059302 0.00000 0.028073 1.060985             R2020 G2020 B2020       (2.28)

To transform from R02020G20200 B02020to Y0Cb0Cr0color space [28]:

      R02020 G20200 B20200       =       1 0.0000 1.47460 1 –0.16455 –0.57135 1 1.88140 0.0000             Y0 Cb0 Cr0       (2.29)

And the inverse conversion required for non-linear samples is [28]:

      Y0 Cb0 Cr0       =       0.262700 0.678000 0.059300 -0.139630 -0.360370 0.500000 0.500000 -0.459786 -0.040214             R7090 G0709 B0709       (2.30) 2.2.2.4 DCI P3

DCI P3 [31] is a common color space for digital movie projection and covers 45.45% of all chromaticities. However, monitors that support this color space are not easily found on the market.

Some film producers used this color space as a reference throughout the capture process. This work will include a color mapping process to be compatible with BT.709.

The color space conversion required to transform from DCI P3 to BT.709 is a two-step conversion. First, the conversion from RP3GP3BP3linear samples to XYZ color

(36)

space is [28]:       X Y Z       =       0.486571 0.265668 0.198217 0.228975 0.691739 0.079287 0.00000 0.045113 1.043944             RP3 GP3 BP3       (2.31)

And the second process is to transform from XYZ to R709G709B709 linear

sam-ples, reviewed in Equation 2.25.

2.2.3

Chroma Subsampling

One of the first discoveries made on television was the coding alternatives pre-sented by the chroma component at the time of transmission. As it is known, the main limitation that video engineers have usually faced has been that the bandwidth of a communications channel is restricted. Because of this, it was found that the ana-log chroma components could pass through a low-pass filter before being modulated to save bandwidth without any visual impact. Accordingly, in the digital world, the chroma samples were subsampled at a ratio of 2 both horizontally and vertically, sav-ing bits at the time of storage. This section addresses the subsamplsav-ing techniques con-sidered in this work.

4:4:4 to 4:2:0

The 4:4:4 format means that the number of samples for all components are the same and is the most used by content producers to have all the raw information. The HDR sequences provided are stored in this format and, after the compression stage, is expected to return the samples to the initial format to evaluate the original and the reconstructed objectively.

The most used subsampling format in broadcasting is 4:2:0 and specifies a sub-sampling by a factor of 2 for the chroma components, both horizontally and vertically. Figure 2.5 shows where is located the chroma sample C (include Cb and Cr) in a frame, either progressive or interlaced scanning.

(37)

Figure 2.5 – 4:2:0 format [25]

The chroma downsampling process, considered by this work, is done in two steps or phases. Table 2.2 defines the coefficients involved in both processes.

Table 2.2 – Chroma downsampling coefficients Phase k Coef c1[k] Coef c2[k]

−1 1 0

−0 6 4

1 1 4

Consider that the input picture is declared as s[i][j]; H and W are assigned as the height and the width of the chroma samples respectively; the variable shi f t is equals to 6 and the variable o f f set is equals to 32.The first process involves the generation of an intermediate signal, declared as f[i][j], to convert from 4:4:4 to 4:2:2 and is done as follows [28]: f[i][j] = 1

k=−1 c1[k] ∗s[i][Clip3(0, W−1, 2∗j+k)] (2.32)

The second process serves to convert from 4:2:2 to 4:2:0 format and is derived from the intermediate samples as follows:

r[i][j] = (

1

k=−1

(38)

4:2:0 to 4:4:4

In the reconstruction process, an upsampling process is necessary to return to the original samples without compression.

The chroma upsampling process, considered by this work, is done by filtering. First, vertical filtering is applied on the 4:2:0 picture and then horizontal filtering. Con-sider that shi f t1 is equals to 6, shi f t2 is equals to 32 and o f f set2 is equals to 2048. Also, Table 2.3 defines the coefficients involved in the filtering processes.

Table 2.3 – Chroma upsampling coefficients Phase k Coef c[k] Coef d0[k] Coef d1[k] −2 −4 −2 −4 −1 36 16 54 0 36 54 16 1 −4 −4 −2

An intermediate signal is generated as follows [28]:

f[2∗i][j] = 1

k=−2 d0[k] ∗s[Clip3(0, H−1, i+k)][j] (2.34) f[2∗i+1][j] = 1

k=−2 d1[k] ∗s[Clip3(0, H−1, i+k+1)][j] (2.35)

Finally, the output samples r[i][j]is obtained as follows [28]:

r[i][2∗j] = (f[i][j] +o f f set1) shi f t1 (2.36)

r[i][2∗j+1] = (

1

k=−2

c[k] ∗ f[i][Clip3(0, W−1, j+k+1)] +o f f set2)  shi f t2 (2.37)

Both processes, the downsampling and upsampling techniques reviewed in this section is applied in the Y0Cb0Cr0domain and the equations described serve to program each technique efficiently.

(39)

2.3

Video Compression

A digital video signal is a sequence of frames that are displayed at a constant interval in a television set. Each image contains an array of pixels where each pixel is the sum of the tristimulus values, which represents a color. Linking these concepts with the described in the previous sections, a digital video is defined according to the type of color model, color space, chroma subsampling, spatial resolution, bits per sample and frame rate. In this work, the digital video standard to be used is the Full HD resolution format and their main parameters are summarized in Table 2.4:

Table 2.4 – Full HD parameters

Resolution 1920x1080

Frame Rate 30 Hz

Color Model YCbCr

Chroma Format 4:2:0 Coding Format 8 or 10 bits

According to these values, the bitrate of a Full HD digital video signal is even higher than 1Gbps. In current communication channels it would be unfeasible and unimaginable to use all the available spectrum only to send one video signal. Hence, the video compression process plays a very crucial role because its main objective is to encode a digital video signal with the lowest possible information without degrading the subjective quality of the same.

Based on the study of the human visual system, there is redundant information that is not perceived by the human eye. A video encoder implements different com-pression strategies to remove the redundant data according to the case. A reference model of a video encoder contains three main units; a temporal model, which attempts to reduce the temporal redundancy by motion estimating objects between frames, a spatial model, which attemps to reduce any spatial redundancy by transforming to a frequency domain the difference of neighboring blocks in a frame, an entropy encoder, which attempts to reduce the statistical redundancy of the transformed residual blocks by using the information of coefficients of the residual blocks of past frames (is part of the information theory).

(40)

To address the goal of most compression methods, is important to discuss the trade-off between video quality, bitrate, and coding complexity. The most desirable thing in a compression scheme will be to increase the quality of the video and reduce the bitrate, however this is achieved in exchange for a high level of processing since it will require greater iterations when deciding the best type of prediction for a partic-ular block either using information from the current frame or from past frames. The RDO (Rate-Distortion Optimization) process measures the cost of encoding a video to a certain quality and bitrate with different operation modes. [32].

The rate-distortion curves are very common in the video compression field. It assigns an objective metric to each bitrate and serves to known if there is a coding gain between two compression schemes or also to verify whether is possible to achieve the same values. In this work, this type of curves are used to know if the proposed scheme can reach the same metrics of the reference HDR compression scheme.

The video codecs were standardized by the sum of the efforts of global organi-zations such as ITU-T (International Telecommunication Union - Telecommunication Standardization Sector), MPEG and the JCT-VC (Joint Collaborative Team on Video Coding). Although there are other efforts such as the VP8 codec[33], developed by On2 Technologies, the MPEG family codecs are accepted by the most digital television standards, either terrestrial, cable system or satellite; and for the most well-known streaming services. With each new codec, it is always expected that the coding gain be 50% with respect to its predecessor. In this section, the H.264 and H.265 codecs will be addressed.

2.3.1

H.264

MPEG-4 AVC/ITU-T H.264 [12] is an industry standard for video coding which offers a set of tools and a syntax for video compression . It was first published in 2003 and is based on the concepts of earlier standards such as MPEG-2 [34].

The operation of the H.264 codec is processed in units of a macroblock (MB), which is equals to a 16 x 16 pixel group. The process starts splitting the input video

(41)

signal into MBs. After, each MB is subtracted by a predicted MB to form a residual MB as shown in Figure 2.7. The type of prediction depends on which redundancy informa-tion must be removed. To remove the spatial redundancy, the Intra-Predicinforma-tion process uses the previously coded pixels to form the prediction of the current block. The frame with this type of prediction is called as I-Frame and it is used as the reference image because is the first frame decoded when a STB is turned on. The Inter-Prediction uses the information of similar regions in previously coded frames to form the prediction of the current block to remove temporal redundancy. This prediction defines the motion estimation as the finding for the most suitable inter prediction and the motion com-pensation as the subtracting of an inter prediction from the current MB. The frames with this type of prediction are assigned as B-Frame and P-Frame. The difference be-tween them is that the first one predicts a frame from two reference frames (could be a P-Frame or a I-Frame) whereas the second one makes use of a reference frame. A group of pictures (GOP) specifies the order in which intra and inter frames images are sended as shown in Figure 2.6. Television displays 30 frames per second, hence that a GOP is set to 15 or 0.5 seconds of difference between intra frames. So, when the channels are being changed by the TV remote control, the maximum waiting time of the viewer will be 0.5 seconds.

Figure 2.6 – GOP in H.264/AVC [35]

Once is generated the residual MB, a transformation process is applied to decor-relate the pixel information, moving to the frequency domain thanks to the DCT (Dis-crete Cosine Transform). The H.264 developed an aproximation of the cosine trans-form, 4x4 and 8x8 integer transform called as ICT (Integer Cosine Transform). It defines integer operations and avoids inverse-transform mismatches. These operations work with bit-level operations, such as bitwise operations. This alleviates the computational

(42)

Figure 2.7 – Prediction in H.264/AVC for a macroblock [35]

load that is inherent in the processing of the DCT. The process is shown in Figure 2.8.

Figure 2.8 – Transform in H.264/AVC for a macroblock [35]

The output of the transform process, the transformed coefficients are quantized by dividing each coefficient by an integer value as shown in Figure 2.9. H.264 defines a QP (Quantization Parameter) to control the quality of the bitrate of the video coder. The parameter can take 52 values and the increase of 1 in its value reduces in 12% approxi-mately the final bitrate of the sequence [35]. This work uses four types of QP to evalu-ate different compression scenarios. Finally, the entropy coding is used to remove the statistical redundancy of the quantized coefficients. H.264 defines two types of cod-ing: CABAC Adaptive Binary Arithmetic Coding) [36] and CAVLC (Context-Adaptive Variable Length Coding)[37]. The compression efficiency of the first one is

(43)

Figure 2.9 – Quantization in H.264/AVC for a macroblock [35]

5-15% better as compared to CAVLC [35]. However, the complexity of CABAC can in-crease the computational load of the system. Hence, in-depth analysis of the trade-off between quality, bitrate and coding complexity is required when designing a compres-sion system. Figure 2.10 presents the entire comprescompres-sion process for a MB in H.264.

Figure 2.10 – Basic coding structure for H.264/AVC for a macroblock [35]

H.264 standard specifies Profiles and Levels to make more flexible the use of the standard by the users. The Profile defines the constraints of the type of digital video supported at the input. The Main Profile supports 8 bits per sample, 4:2:0 chroma for-mat, CAVLC or CABAC coding, the use of I-frame, B-frame and P-frame and 4x4 trans-form. The High Profile supports the same features, but includes 4:0:0 chroma format, 8x8 transform and a different QP for Cr and Cb. The High10 supports 9 or 10 bits per sample. In this work, the Main Profile is used because is more suitable for broadcast

(44)

and entertainment applications.

2.3.2

H.265

HEVC/ITU-T H.265 [11] is the most recent video project of the MPEG and the first edition of the standard was published in 2013. The emergence of beyond HD for-mats, such the UHD television, requires an efficient coding standard, compared to the prior standards. The compression performance with respect to the H.264 is in the range of 50% of bit reduction at similar visual quality [38].

HEVC uses the same hybrid approach as many of the previous standards, using a combination of inter and intra frame prediction and 2D transform coding. [38].

HEVC has been designed both to have better compression efficiency with 4K or 8K formats, but also to improve the current processing architecture. Typically, the coding process of a frame is sequential, each MB in H.264 needs to be coded before encoding the neighboring MBs. This leads to code efficiency problems, HEVC increases the use of parallel processing architectures by adding the wavefront parallel processing (WPP) [38] as shown in Figure 2.12 (c).

The macroblock was replaced by the CTU (Coding Tree Unit) in HEVC. The CTU has a non-fixed size, and the size can be larger than a traditional MB, up to 64x64 pixels. Each CTU can be subdivided in coding units (CU) using a quad-tree structure as in shown in Figure 2.11. The prediction unit (PU) splits each CU and uses an intra or inter prediction. The transform process is applied to each PU and forms the transform units (TU). Unlike the H.264, the sizes of the transform are 4x4, 8x8, 16x16 and 32x32. HEVC specifies the 4x4 DCT/DST (Discrete Sine Transform) as an alternative since presents a better coding efficiency compared to the DCT-based single solutions [39].

The main improvements in H.265 lie in the increase of intra predictions modes, up to 33 directional modes (compared to eight such modes in H.264). Advanced mo-tion vector signaling serves to derive several most probable candidates based on data from adjacent prediction blocks and 7-tap or 8 tap filters are used for interpolation of

(45)

Figure 2.11 – Subdivision of a CTU into CUs (a)Spatial Partitioning. (b) Corresponding quadtree representation[40]

fractional sample positions (compared to six-tap filtering in H.264). The CABAC is the only one entropy coder standarized in HEVC [38].

HEVC specifies a slice, which is a partition of picture that can be decoded inde-pendently from other slices in the same picture as is shown in Figure 2.12 (a), and the tile, which divides the picture into rectangular shaped groups as is shown in Figure 2.12 (b). The main purpose of this partitions is for parallel processing, error resilience and resynchronization in the event of data losses [38].

Figure 2.12 – Subdivision of a frame into (a) slices and (b)tiles. (c) Illustrations of WPP, extracted from [38]

The H.265 standard specifies the Main Profile that allows a bit depth of 8-bits per sample with 4:2:0 chroma subsampling. Also, the Main 10 Profile that allows a bit

(46)

depth of 8 or 10 bits with 4:2:0 chroma format.

2.4

The Legacy Television Architecture

The theoretical and technical fundamentals of a digital video signal and the video codecs reviewed in the past sections are a fundamental part of the infrastructure of the current television system.

Each of the processing steps in a television system considers that tristimulus values have a linear response with respect to human perception. When video engineers design or propose improvements in current television systems, they assume that the system is linear, however this is not entirely true, since each processing step has a linear response with respect to internal circuitry, if not, the communication between physical interfaces would be impossible; however the luminance samples, which is a physical quantity, have a non-linear response with respect to human perception. Therefore, from the beginning of television, it was important to apply a perceptual uniformity process called gamma correction, which matched the nonlinear response of the input voltages and output luminances of a CRT television.

This section will explain the details of a gamma system and the generation of the transport stream in a television system.

2.4.1

Gamma System

Real-world physical luminance, given in nits, represents the light intensity of a scene and is proportional to absolute scene light. These samples are the basic informa-tion and the input in a Television System. Scene-referred representainforma-tion is very useful because defines the actual relation between light intensity and the scene. Contrary to the human visual system, current Television sets do not have the capability to display high dynamic range scenes due to technological limitations. Studies concluded that the human eye is sensitive to relative, rather than absolute luminance values [41]. For this reason, the display in a television set has always showed a scaled version of the real

(47)

scene in terms of luminance.

In a typical Television system, two types of environment are important, the Reference-Viewing Environment and the Non-Reference Viewing Environment. The first one is used to create the artistic intention; hence, two stages are necessary: the Reference Opto-Optical Transfer Function (OOTF), whose main function is to map the high dynamic range into standard dynamic range, and the Artistic Adjust, whose main function is to adjust the rendering intention of the producer. The second one is used to simulate the receiver side. In ideal conditions, such a dim room, a human viewer could watch the same image in a consumer display (Non-Reference display) as much in a Reference display, however, due to a external conditions, non-reference Viewing Environment needs to adjust parameters, such as contrast, color or bright-ness, to match the initial intention of the television producer. Figure 2.13 schematizes the conceptual TV system. Currently, reference display used the gamma correction pro-cess in the reference-viewing environment and, at the receiver side, many TV manufac-turers developed consumer displays with the gamma law as Electro-Optical converter.

Figure 2.13 – The conceptual television system [18]

Gamma correction is inherently the conversion from physical domain values into a perceptual domain values. In an end-to-end reproduction, the gamma function is not mathematically linear but allows to match the original scene in the actual limited

(48)

brightness television sets with no distortion. Moreover, CRT Television set has a non-linear characterization between voltage and luminance as is shown in Figure 2.14.

Figure 2.14 – CRT transfer function [25]

However, the amazing coincidence is that the CRT transfer function is very nearly the inverse of the luminance-to-lightness relationship [25]. For this reason, a typical Television system uses the gamma law as part of the design of the Electro-Optical Transfer Function (EOTF), which converts the video signal into the linear light output of the display, and the Opto-Electrical Transfer Function (OETF), which con-verts linear scene light into the digital video signal, typically within a camera [18].

The EOTF mostly used by the traditional television system is standarized by the Recommendation ITU-R BT.1886 [42] and is specified as follows:

L =a(max((V+b), 0))γ

(2.38)

where:

L : Screen Luminance in nits .

a : Variable for user gain (legacy "contrast" control).

a = (L 1 γ W−L 1 γ B)γ

γ : Exponential of power function, 2.2 or 2.4. These values have been

shown to be a satisfactory match to the legacy CRT display.

(49)

b : Variable for user black lift (legacy "brightness control"). b = L 1 γ B L 1 γ W−L 1 γ B

LW : Screen luminance for white, reference setting is 100 nits.

LB : Screen luminance for black. For moderate black level settings: 0.1

nits.

And the OETF for a HD TV system is standarized by the Recommendation ITU-R 709 [16] and is specified as follows:

V =      1.099∗L0.45−0.099 for 1≥L ≥0.018 4.5∗L for 0.018≥ L≥0 (2.39) where:

L : Luminance of the image in nits 0 ≤L ≤1. V : Corresponding electrical signal.

According to BT.1886, “ ‘the ITU Radiocommunication Assembly considers that for consistency of picture representation, it is desirable that newly introduced display technologies have an EOTF that closely matches the CRT’s response’.[42]”. This is very controversial because encourages to manufacturers to develop a technology with lim-ited dynamic range, new EOTF and OETF proposals aim to a new concept based on the psychovisual and the contrast sensitivity function of the eye, to enhance visual quality and performance against to the CRT technology [19].

One advantage of gamma correction is the performance gain when encoding luminance information, in terms of bits per sample. If linear-light luminance samples need about 14 bits to represent one sample, the system would be encoding efficientless. However whether gamma correction is applied, luminance encoding needs about 9 or 8 bits per sample, due to the human viewer responds logarithmically to the relative luminance [25], thus some adjacent regions of luminances is impossible to distinguish by the human eye.

(50)

A conventional luma and color difference encoder uses the gamma transfer function and is applied to the linear RGB samples. The output signal are labeled as uniform samples in the perceptual domain or non-linear samples and is represented with a single quotation mark when is used in the video workflow. For example, the non-linear samples R0G0B0 or Y0Cb0Cr0are depicted in the Figure 2.15.

Figure 2.15 – Conventional luma/color difference encoder [25]

Finally, Figure 2.16 schematizes the BT.709 HD TV Television system architec-ture; this is much more likely to an actual Television System and is the reference for this work, because the proposed method considers and maintains the backward com-patibility with this system.

(51)

2.4.2

MPEG-2 Transport Stream

The transport of an uncompressed digital video signal is done through physi-cal interfaces known as digital interfaces, the most used in broadcasting is the Serial Digital Interface (SDI) standarized by the Society of Motion Picture and Television En-gineers (SMPTE).Table 2.5 summarizes the HD serial interfaces. It considers that video samples are non-linear coded samples, represented by Y0Cb0Cr0.

Table 2.5 – SMPTE digital interfaces Standard Name Bitrates Video

Format Color model Color Subsam-pling Bit Depth SMPTE 292M [43] HD-SDI 1.485 Gbps 720p, 1080i YCbCr 4:2:0 8 or 10 bits SMPTE 372M [44] Dual Link HD-SDI 2.970 Gbps 1080p YCbCr 4:2:0 8 or 10 bits

The first stage in an end-to-end television system, the Production and PosPro-duction Stage, is responsible for the generation of uncompressed digital video signal, adjusting the rendering intention of the producer. Then, a SDI signal carries the un-compressed data outside this stage to the Processing Stage, who is responsible for the compression of the digital video in lower bitrates, between 5 and 20Mbps per channel, and multiplexes all the incoming compressed signals into a single Transport Stream (TS). This stage is the core of current digital television system because allows to signal the video, audio and metadata into a single program, avoiding lip-sync issues. The El-ementary Stream (ES) is represented by the information of video, audio and metadata. A Program Stream (PS) is the union of different elementart streams and the union of several program streams generate the TS signal as shown in Figure 2.17. The MPEG multiplexer adds the timestamp to each program and generates a single stream.

MPEG2-Systems is formally known as ISO/IEC 13818-1 and ITU-T Rec. H.222.0 [46] and defines the bit representation of a TS. According to this standard, a TS contains 188 byte-long transport stream packets, with four bytes of header, where one byte is for

(52)

Figure 2.17 – Multiplexed MPEG-2 transport stream Packets [45]

synchronization, 1 bit for transport error indicator, 13 bits for packet identifier (PID), and the rest of the packet, the remaining 184 bytes are the payload information as shown in Figure 2.18.

Figure 2.18 – MPEG-2 transport stream packet [45]

The TS signal signalized each packet with a Packet Identification(PID) and the information is accessed by a hierarchical process. The first most relevant information, after the synchronization of a TS packet, is the Program Association Table (PAT) and is assigned with a PID equals to 0x00 . This table indicates the PID of the Program Map Table (PMT) of a correspondence program, represented by its service identification. The PMT identifies and indicates the location of the elementary streams which make up a program, either video, audio and metadata. Once the video PID is stored, the payload data of these packets are signaled by the bit representation of the video codec,

Referências

Documentos relacionados

Alguns ensaios desse tipo de modelos têm sido tentados, tendo conduzido lentamente à compreensão das alterações mentais (ou psicológicas) experienciadas pelos doentes

Além disso, o Facebook também disponibiliza várias ferramentas exclusivas como a criação de eventos, de publici- dade, fornece aos seus utilizadores milhares de jogos que podem

Os modelos desenvolvidos por Kable &amp; Jeffcry (19RO), Skilakakis (1981) c Milgroom &amp; Fry (19RR), ('onfirmam o resultado obtido, visto que, quanto maior a cfiráda do

A AAE é um instrumento da política do ambiente que tem como finalidade a avaliação ambiental a nível estratégico de determinados planos e programas assim como contribuir

Para tanto foi realizada uma pesquisa descritiva, utilizando-se da pesquisa documental, na Secretaria Nacional de Esporte de Alto Rendimento do Ministério do Esporte

o a Descriptor element containing the CCP ’s signature of the remaining content of its parent Container element (the signature of its Item sibling); o an Item

 Managers involved residents in the process of creating the new image of the city of Porto: It is clear that the participation of a resident designer in Porto gave a

de Melhoria da Capacidade de Pesquisa e de Transferência de Tecnologia para o Desenvolvimento da Agricultura no Corredor de Nacala em Moçambique), organizando