Dissertação de Mestrado
Exploiting Block Size as a Parameter to Reduce
Integer Motion Estimation Complexity on HEVC
Luiz Henrique De Lorenzi Cancellier
Universidade Federal de Santa Catarina
Programa de Pós Graduação em
Ciência da Computação
Luiz Henrique De Lorenzi Cancellier
Exploiting Block Size as a Parameter to
Reduce Integer Motion Estimation
Complexity on HEVC
Florianópolis
2019
Luiz Henrique De Lorenzi Cancellier
Exploiting Block Size as a Parameter to Reduce Integer
Motion Estimation Complexity on HEVC
Dissertação submetida ao Programa de Pós Graduação em Ciência da Computação da Universidade Federal de Santa Catarina para a obtenção do Grau de Mestre em Ciência da Computação.
Orientador: Prof. Dr. José Luís A. Güntzel
Florianópolis
2019
Ficha de identificação da obra elaborada pelo autor,
através do Programa de Geração Automática da Biblioteca Universitária da UFSC.
Cancellier, Luiz Henrique De Lorenzi
Exploiting Block Size as a Parameter to Reduce Integer Motion Estimation Complexity on HEVC / Luiz Henrique De Lorenzi Cancellier ; orientador, José Luís Almada Güntzel, 2019.
90 p.
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós Graduação em Ciência da Computação, Florianópolis, 2019.
Inclui referências.
1. Ciência da Computação. 2. Codificação de Vídeo. 3. HEVC. 4. Estimação de Movimento Inteira. 5. Tamanhos de Bloco. I. Güntzel, José Luís Almada. II. Universidade Federal de Santa Catarina. Programa de Pós-Graduação em Ciência da Computação. III. Título.
Dedico este trabalho aos meus pais Enio Luiz e Eunice Edna, e ao meu irmão Gabriel.
AGRADECIMENTOS
Começo agradecendo ao meu orientador Prof. Dr. José Luís A. Güntzel, por todo o tempo investido na orientação deste trabalho e por sua preocupação com pequenos detalhes. Apesar de não poder ser oficialmente meu coorientador – dado que este trabalho foi produzido em paralelo com seu doutorado – considero que você, Ismael Seidel, contribuiu o bastante para ser digno de ter essa posição (mesmo que informalmente). Ambos, Prof. Güntzel e Ismael Seidel foram fundamentais na minha formação acadêmica.
Agradeço também aos meus colegas de laboratório. Do meu ponto de vista, todos formam uma grande família. Problemas existem nas melhores famílias, mas sempre (ou quase) contornamos isso de alguma forma. Dentre os membros, destaco o André B. Bräscher e o Tiago A. Fontana, que forneceram todo o apoio possível quando meu avô faleceu. Foi a primeira perda de um parente próximo e participávamos de um evento científico longe de casa. Apresentamos nossos trabalhos com sucesso e acredito que eu não teria conseguido fazer isso se não fosse pela ajuda vocês.
Também agradeço à minha família. Não só meus pais e irmão, mas também meus avós, tios e primos, por compreenderem a importância da educação. O apoio de todos é fundamental na minha jornada e espero poder servir de exemplo para as gerações mais novas.
Por fim, mas não menos importante, o presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Código de Financiamento 001. Os desafios encontrados durante o desenvolvimento deste mestrado certamente não seriam superados sem dedicação exclusiva, que só foi possível graças aos recursos fornecidos pela CAPES.
“Work it harder, make it better Do it faster, makes us stronger More than ever, hour after hour Work is never over.” (Daft Punk - Harder, Better, Faster, Stronger)
RESUMO
O aumento das resoluções dos vídeos, combinado com a crescente demanda por serviços de streaming, requerem técnicas de compressão avançadas para viabilizar o armazenamento e transferência do grande volume de dados gerados. O High Efficiency Video Coding (HEVC), padrão de codificação estado da arte, permitiu reduzir em 50% a taxa de bits em comparação com o seu predecessor, o AVC, mantendo a qualidade visual perceptível. Entretanto, tal taxa de compressão vem acompanhada de um aumento de 2× a 10× na complexidade da codificação. Trabalhos correlatos propõem reduzir a complexidade da codificação eliminando operações com base em informações obtidas de diferentes níveis de profundidade da busca recursiva feita pelo codificador com a finalidade de encontrar o melhor modo de representar cada bloco de um quadro. Uma dessas informações é o tamanho do bloco avaliado. Porém, seu papel na redução da complexidade não fica claro, dado que outras variáveis têm influência nos algoritmos apresentados pelos trabalhos correlatos. Este trabalho explorou o impacto do tamanho do bloco como parâmetro para reduzir a complexidade da codificação. Tal exploração foi realizada no contexto da Estimação de Movimento Inteira, a qual é uma das etapas chave no fluxo de codificação. Para destacar o impacto do tamanho do bloco na codificação, a etapa de Estimação de Movimento Inteira foi seletivamente eliminada somente com base nessa característica. A abordagem proposta resultou em três estratégias que permitiram identificar que os ganhos não compensavam o grande esforço da Estimação de Movimento Inteira para predizer alguns tamanhos de bloco. A melhor estratégia, que elimina a Estimação de Movimento Inteira de blocos não quadrados entre os tamanhos 8×8 até 64×64, bem como todos os assimétricos, reduziu em média 43.9% a complexidade da Estimação de Movimento Interia, enquanto a perda da eficiência de codificação foi inferior a 1% em BD-Rate. Ainda, outras explorações foram feitas para analisar os resultados da melhor estratégia proposta, combinada com diferentes parâmetros de codificação e algoritmos rápidos.
Palavras-chave: Codificação de Vídeo. HEVC. Estimação de Movimento
RESUMO EXPANDIDO Introdução
O aumento das resoluções dos vídeos, combinado com a crescente demanda por serviços de streaming, requerem técnicas de compressão avançadas para viabilizar o armazenamento e transferência do grande volume de dados gerados. O High Efficiency Video Coding (HEVC), padrão de codificação estado da arte, permitiu reduzir em 50% a taxa de bits em comparação com o seu predecessor, mantendo a qualidade visual perceptível. Entretanto, tal taxa de compressão vem acompanhada de um aumento de 2× a 10× na complexidade da codificação.
A Estimação de Movimento está entre as etapas mais complexas do processo de codificação, portanto, diversos trabalhos da literatura buscam meios para reduzir complexidade desta etapa. O objetivo da Estimação de Movimento é predizer o conteúdo de um bloco de um quadro do vídeo utilizando dados extraídos de outros quadros já codificados. A Estimação de Movimento é dividida em duas outras etapas, chamadas de Estimação de Movimento Inteira e Estimação de Movimento Fracionária. O presente trabalho, tem como escopo a Estimação de Movimento Inteira.
L. Li et al. (2013) propuseram eliminar a Estimação de Movimento Inteira do fluxo de codificação do Advanced Video Coding (AVC) (padrão predecessor do HEVC). Eles mostraram que o codificador compensava a falta da Estimação de Movimento Inteira utilizando outras ferramentas, reduzindo a perda de eficiência de codificação. Se por um lado tal estudo apresenta contribuições no contexto do AVC, ele não se aplica ao HEVC. Eliminar a Estimação de Movimento Inteira do Test Model (HM) (implementação de referência do HEVC) resulta em perdas de eficiência que tornam sua codificação pior do que o seu predecessor.
No contexto do HEVC, T. S. et al. (2017) reduzem a complexidade da codificação explorando informações de movimento de diferentes níveis do particionamento recursivo de blocos feito pelo próprio codificador. Pan et al (2016) e Ahn et al. (2016), também exploram dados de diferentes níveis do particionamento recursivo de blocos, mas enfatizando o uso do tamanho do bloco em suas propostas. Apesar de suas contribuições, os trabalhos supracitados não isolam o impacto do tamanho do bloco. I. K. Kim et al. (2012) e Sinangil et al. (2013) realizaram experimentos buscando identificar o impacto do tamanho de bloco no HEVC. Entretanto, o primeiro apenas reporta que o esquema de particionamento introduzido pelo HEVC apresenta melhores resultados de eficiência de codificação, enquanto o segundo se concentra apenas em avaliações relacionadas ao desenvolvimento de blocos em hardware. O presente trabalho visa cobrir um problema em aberto na literatura, que consiste em isolar o impacto do tamanho de bloco no codificador com o
obje-tivo de identificar aqueles tamanhos que poderiam ser utilizados para reduzir a complexidade da codificação no contexto da Estimação de Movimento Inteira.
Objetivos
O principal objetivo deste trabalho é identificar se o tamanho do bloco pode ser explorado como parâmetro para reduzir a complexidade da Estimação de Movimento Inteira, mantendo a perda de eficiência de codificação tão próxima de 0% quanto possível.
Os seguintes objetivos secundários foram necessários para completar o principal:
• Propor estratégias de eliminação da Estimação de Movimento Inteira utilizando somente o tamanho do bloco;
• Identificar como o codificador adapta suas decisões para lidar com a eliminação da Estimação de Movimento Inteira;
• Identificar quais são os parâmetros do codificador que impactam nos resultados de eficiência de codificação e redução de complexidade das estratégias propostas.
Método
Para avaliar o impacto do tamanho do bloco na codificação foram definidas três diferentes estratégias de eliminação da Estimação de Movimento Inteira. Dado U como o conjunto de todos os possíveis tamanhos de bloco, cada estratégia define:
P0= ∅
P1, P2, ..., Pn ⊂ U | Pi∩ Pj = ∅ ∀ i , j
(1) Dado p ∈ U, uma estratégia de eliminação classifica p usando a seguinte equação:
f (p)= ` ⇐⇒ pu ∈ P` (2)
, onde ` ∈ N∗é o nível de p. Cada execução de codificação é configurada
com um parâmetro chamado nível de eliminação, que define um limite inferior de nível para que a Estimação de Movimento Inteira seja executada. No nível de eliminação 0 nenhuma Estimação de Movimento Inteira é eliminada e este é o experimento que gera os resultados a serem usados como referência (baseline) No nível de eliminação 1, tamanhos de bloco categorizados no nível 1 têm sua Estimação de Movimento Inteira eliminada. No nível de eliminação n, todas os tamanhos de bloco categorizados com o nível n têm sua Estimação
de Movimento Inteira eliminada, bem como todos os outros tamanhos de bloco classificados com nível ` < n. Os resultados de cada nível de eliminação usando diferentes estratégias permite a identificação dos tamanhos de bloco mais adequados para serem utilizados como parâmetro para a redução de complexidade do codificador.
Para avaliar os resultados de eficiência de codificação foi usada a métrica BD-Rate, que permite uma comparação do percentual médio da taxa de bits para realizar a codificação na mesma qualidade (medida em PSNR) utilizando duas configurações diferentes. Quanto à complexidade, a métrica utilizada considerou a área total na qual a Estimação de Movimento Inteira realizou alguma busca (medida em número de amostras). Tal métrica de complexidade foi adotada, pois não é enviesada por arquiteturas de software ou hardware, além de aumentar a reprodutibilidade dos resultados apresentados neste trabalho.
Resultados e Discussão
A primeira estratégia proposta apresentou grande perda de eficiência de codificação. O uso do segundo nível de eliminação resultou em um aumento médio de 2,34% de BD-Rate (o pior caso apresentado pelos trabalhos correlatos foi de 2,63%). Além disso, os resultados obtidos foram bastante particulares para cada sequência de vídeo, impossibilitando identificar um comportamento comum que pudesse ser explorado. Apesar das características negativas, essa estratégia permitiu identificar que o codificador de fato adapta suas decisões para lidar com a eliminação da Estimação de Movimento Inteira. Quando um tamanho de bloco tinha sua Estimação de Movimento Inteira eliminada, o codificador passava a tomar menos decisões com aquele tamanho e dava preferência para os blocos assimétricos com mesmo valor de N ou blocos maiores.
A segunda estratégia proposta apresentou redução de complexidade menor que a primeira estratégia. Apesar de tal ponto negativo, o impacto na eficiência de codificação foi significativamente reduzido para, em média, 1,34% de BD-Rate no, pior caso. Além disso, os resultados da estratégia 2 foram muito mais previsíveis e forneceram base para a definição da terceira estratégia.
A terceira estratégia define os conjuntos de tamanhos de bloco elimina-dos em cada nível conforme apresentado na Tabela 1. Os tamanhos de bloco são apresentados na Figura 1.
Os resultados médios obtidos para cada nível da estratégia 3 são apresentados na Tabela 2. Note que no nível de eliminação 4 foi possível reduzir em 43,85% a complexidade da Estimação de Movimento Inteira, aumentando em apenas 0,8756% o BD-Rate. No nível 5 a complexidade é
Level N block type 0 - -1 32 N×2N and 2N×N 2 16 N×2N and 2N×N 3 8 N×2N and 2N×N 4 4 to 32 Asymmetric 5 4 N×2N and 2N×N
Tabela 1: Níveis, com seus respectivos tamanhos de bloco, definidos pela estratégia 3. Symmetric: Asymmetric: 2N×2N N×2N 2N×N N×N N 2×2N (L) N 2×2N (R) 2N× N 2(U) 2N× N 2(D)
Figura 1: Possíveis tamanhos de bloco, onde N ∈ {4, 8, 16, 32}.
reduzida ainda mais, mas tal redução já não é considerada significativa e o aumento de BD-rate ultrapassa 1%. A estratégia 3 permitiu atingir o principal objetivo deste trabalho: pode-se afirmar que os tamanhos de bloco nos níveis 1, 2, 3 e 4 podem ser usados como parâmetro para reduzir a complexidade da Estimação de Movimento Inteira com baixo impacto na eficiência de codificação.
Level BD-Rate (%) Redução de Complexidade (%)
1 0,0376 14,75
2 0,1341 26,44
3 0,3325 35,11
4 0,8756 43,85
5 1,3405 49,57
Tabela 2: Resultados médios de BD-Rate de redução de complexidade para cada nível definido pela estratégia 3.
A abordagem proposta resultou em três estratégias que permitiram identificar que os ganhos não compensavam o grande esforço da Estimação de Movimento Inteira para predizer alguns tamanhos de bloco. A melhor estratégia, a qual elimina a Estimação de Movimento Inteira de blocos não quadrados entre os tamanhos 8×8 até 64×64, bem como todos os assimétricos, reduziu a complexidade da Estimação de Movimento Inteira em em 43.9%, em média, enquanto a perda da eficiência de codificação foi inferior a 1% em BD-Rate.
Palavras-chave: Codificação de Vídeo. HEVC. Estimação de Movimento
ABSTRACT
The increasing video resolutions, combined with the growing demand for video streaming services on the internet, require advanced compression techniques to make it feasible to store and transfer the resulting huge amounts of data. The High Efficiency Video Coding (HEVC), state-of-the-art video encoding standard, allowed for an increase of 50% in bitrate compression in comparison to its predecessor, the AVC, while keeping the same perceptive visual quality. However, such compression rate comes along with an increase of 2× to 10× in complexity. Related works propose to reduce the encoding complexity by eliminating operations based on information from different depths of the recursive search targeting to find the best mode to encode each block of a frame. One of such information is the size of the block being evaluated, but because there are other variables influencing the algorithms presented by related works, the role of the block size in complexity reduction does not become clear. This work explores the impact of the block size as a parameter to reduce the encoding complexity. Such exploration was done in the context of the Integer Motion Estimation, which is one of the key steps of the encoding flow. By eliminating the Integer Motion Estimation based only on the block size, the block size effect on encoding was highlighted. The proposed approach resulted in three elimination strategies that allowed to identify that the payoffs do not compensate the huge Integer Motion Estimation efforts to predict some block sizes. The best strategy, which eliminates the Integer Motion Estimation of the nonsquare blocks with sizes between 8×8 and 64×64 and all the asymmetric ones as well, reduces the Integer Motion Estimation complexity by 43.9%, on average, with encoding efficiency loss lower than 1% in BD-Rate. Further explorations were made analyzing the best elimination strategy with different encoding parameters and combining it with fast algorithms.
LIST OF FIGURES
Figure 1 – Simplified hybrid model for video compression. . . 33
Figure 2 – Sequence of frames with POC, references and types identified. 34
Figure 3 – CTU partitioning and its quad-tree structure. . . 35
Figure 4 – Possible PU partitioning in HEVC. . . 36
Figure 5 – Block diagram of Prediction step. . . 37
Figure 6 – Integer Motion Estimation example. . . 38
Figure 7 – Fractional sample domain. . . 38
Figure 8 – AMVP candidates. . . 40
Figure 9 – Example of R-D curves. . . 42
Figure 10 – The elimination strategy based on Prediction Unit (PU) size. 51
Figure 11 – Example using strategy 1. . . 56
Figure 12 – IME complexity reduction and Bjøntegaard Delta Bitrate
(BD-Rate) using strategy 1. . . 56
Figure 13 – Mode decision distribution using strategy 1. . . 58
Figure 14 – Inter decisions and block size distributions using strategy 1. 59
Figure 15 – Residual Motion Vector (MV) distribution. . . 59
Figure 16 – IME complexity reduction and BD-Rate using the first five
levels of strategy 2. . . 62
Figure 17 – IME complexity reduction and BD-Rate using the last three
levels of strategy 2. . . 62
LIST OF TABLES
Table 1 – Video sizes. . . 29
Table 2 – Characteristics of Common Test Conditions (CTC) video
sequences. . . 52
Table 3 – Levels specification for strategy 1. . . 55
Table 4 – Levels specification for strategy 2. . . 61
Table 5 – Levels specification for strategy 3. . . 63
Table 6 – BD-Rate and IME complexity reductionusing QPs 22 to 37. 66
Table 7 – BD-Rate and IME complexity reduction using QPs 12 to 27. 67 Table 8 – BD-Rate and IME complexity reductionresults of 1080p
and 4K video sequences . . . 68
Table 9 – BD-Rate and IME complexity reduction using HM fast
algorithms . . . 69
Table 10 – Levels 1 and 2 results of strategy 1 . . . 79
Table 11 – Levels 2 and 3 results of strategy 1 . . . 80
Table 12 – Levels 1 and 2 results of strategy 2 . . . 81
Table 13 – Levels 3 and 4 results of strategy 2 . . . 82
Table 14 – Levels 5 and 6 results of strategy 2 . . . 83
Table 15 – Levels 7 and 8 results of strategy 2 . . . 84
Table 16 – Levels 1 and 2 results of strategy 3 . . . 85
Table 17 – Levels 3 and 4 results of strategy 3 . . . 86
LIST OF ABBREVIATIONS AND ACRONYMS
AMVP Advanced Motion Vector Prediction . . . 39, 40, 45 AVC Advanced Video Coding . . . 11, 29, 30, 41, 47 BD-Rate Bjøntegaard Delta Bitrate . . . 19, 21, 30, 42, 43, 46–50, 52, 55–58,
60–72, 79–87
CFM Coded block Fast Mode . . . 69 CTC Common Test Conditions . . . 21, 51, 52, 57, 65 CTU Coding Tree Unit . . . 35, 41, 48, 49, 55 CU Coding Unit . . . 35, 36, 46–48, 59 DCT Discrete Cosine Transform . . . 41 ECU Early CU . . . 68 ESD Early Skip Detection . . . 69 FME Fractional Motion Estimation . . . 30, 37, 38, 47, 48, 59, 60, 69 GOP Group of Pictures . . . 35, 65, 66 HEVC High Efficiency Video Coding . 9, 11, 17, 29, 30, 32–35, 39, 40, 42,
45–49, 68
HM Test Model . . . 11, 30–32, 46, 47, 51, 55, 68, 72 IME Integer Motion Estimation 30–32, 37–39, 46–49, 51–53, 55, 57–72, 89 ME Motion Estimation . . . 30, 37–39, 45–47, 49, 71 MV Motion Vector . . . 19, 39, 40, 45–47, 53, 57–61 MVP Motion Vector Predictor . . . 45, 46, 61 PMV Predicted Motion Vector . . . 39, 48, 53
POC Picture Order Count . . . 34 PSNR Peak signal-to-noise ratio . . . 42, 43, 48, 51 PU Prediction Unit 19, 35–37, 39, 40, 46–49, 51, 53, 55, 57–64, 68, 69, 71,
72
QP Quantization Parameter . . . 31, 41, 42, 51, 53, 58, 59, 65–67, 69, 71 RDO Rate-Distortion Optimization . . . 36, 45, 57 TU Transform Unit . . . 48 TZS Test Zone Search . . . 46
LIST OF SYMBOLS
Bcan Candidate block . . . 36
Bori Original block . . . 36 d Distortion. . . 36
jcost The lagrangian rate-distortion cost of selecting a given candidate as
reference . . . 36, 37, 39, 40 λ The Lagrange multiplier. . . 36 r Rate. . . 36
CONTENTS 1 INTRODUCTION . . . . 29 1.1 GOALS . . . 31 1.2 CONTRIBUTIONS . . . 31 1.3 ORGANIZATION . . . 31 2 BASIC CONCEPTS . . . . 33 2.1 A SEQUENCE OF IMAGES . . . 34 2.2 BLOCK STRUCTURE . . . 35
2.3 RATE DISTORTION OPTIMIZATION MODEL . . . 36
2.4 THE PREDICTION . . . 36
2.4.1 Inter Mode . . . . 37 2.4.2 Advanced Motion Vector Prediction . . . . 39 2.4.3 Merge and Skip . . . . 40
2.5 TRANSFORM, QUANTIZATION, ENTROPY ENCODING
AND RECONSTRUCTION . . . 41 2.6 CODING EFFICIENCY . . . 42 3 RELATED WORKS . . . . 45 3.1 AHN; SIM (2016) . . . 46 3.2 PAN ET AL. (2016) . . . 47 3.3 L. LI ET AL. (2013) . . . 47 3.4 I. KIM ET AL. (2012) . . . 48 3.5 SINANGIL ET AL. (2013) . . . 49 3.6 SUMMARY . . . 49 4 METHOD . . . . 51 5 ELIMINATION STRATEGIES AND RESULTS . . . . 55
5.1 ELIMINATION STRATEGY 1: BOTTOM-UP APPROACH . . 55
5.1.1 Complexity Reduction x Coding Efficiency . . . . 56 5.1.2 Mode, PU Size, and Residual MV Distributions . . . . 57
5.2 ELIMINATION STRATEGY 2: SQUARE LAST . . . 60
5.3 STRATEGY 3: SQUARE LAST REFINEMENT . . . 63
5.4 CONFIGURATION IMPACT . . . 64
5.4.1 Low Delay P, Random Access 8 and Random Access Config-urations . . . . 65 5.4.2 1080p and 4K Resolutions . . . . 67 5.4.3 Fast HM Algorithms . . . . 68 6 CONCLUSIONS . . . . 71 6.1 FUTURE WORKS . . . 72 BIBLIOGRAPHY . . . . 73
APPENDIX
77
APPENDIX A – STRATEGIES RESULTS . . . . 79 APPENDIX B – PUBLISHED WORKS AND AWARDS . 89B.1 WORKS RELATED TO MASTER DISSERTATION THEME . 89
B.1.1 25th IEEE International Conference on Electronics Circuits and Systems, 2018 . . . . 89
B.2 OTHER WORKS . . . 89
B.2.1 30th Symposium on Integrated Circuits and Systems Design, 2017 . . . . 89 B.2.2 33rd South Symposium on Microelectronics, 2018 . . . 90
29
1 INTRODUCTION
In television, cinema and internet, video is a well disseminated digital media. According to (CISCO, 2017), in 2016 73% of all internet traffic was video. Such popularization of videos on the internet, combined with the increasing resolutions, causes two problems that can be deducted from data in Table 1: it is unfeasible to achieve real-time streaming with the current network infrastructure and it is impracticable to store video considering the storage limitations of mobile devices. In face of those problems, video compression becomes mandatory.
Resolution (pixels) Bitrate (Gbps) Size (GB) HD (1280×720) 0.66 24.88 Full HD (1920×1080) 1.49 55.99 Ultra HD (3840×2160) 5.97 223.95 Full Ultra HD (7680×4320) 23.88 895.79
Table 1: Video resolutions and their respective bitrates and file sizes. We consider videos with 30 frames per second, 3 bytes per pixel and 5 minutes long, in raw format.
Compression techniques are being developed over the years to reduce the amount of data to represent a video. The Advanced Video Coding (AVC) (ITU-T, 2017) standard was proposed in 2003 aiming to improve HD video compression. In comparison to H.263, the AVC reduces the bitrate by about 50% while keeping the same perceptive visual quality (WIEGAND et al., 2003). With the advent of Ultra and Full Ultra HD, the High Efficiency Video Coding (HEVC) emerges as a new video coding standard. Once again, the new standard allows for improving the bitrate by 50% while keeping the same perceptive visual quality in comparison to AVC reference software (SULLIVAN; OHM, et al., 2012).
New standards are created by the ITU Telecommunication Standardiza-tion Sector (ITU-T) to achieve better encoding efficiency by extending the set of encoding tools. We can observe such tendency on the maximum block size supported by a standard. An encoder split each video frame into smaller blocks to better explore redundancies. The AVC innovated adding 16×16 block sizes, called macroblocks. The HEVC, by its turn, has blocks up to 64×64 sizes and uses a quad-tree to signal smaller blocks.
A new standard is being developed with conclusion planned for 2020 (SEGALL, 2017). As it happens with its predecessors, the block size and partitioning possibilities will increase. Blocks with sizes 128×128 and 256×256 together with more complex partitioning structures (like quad-tree, binary-tree
30 Chapter 1. Introduction
and ternary-tree combination) are being studied to compound the final standard (ME et al., 2018).
The increasing number of tools to improve the compression rate for the same perceptive visual quality comes together with the increase of coding complexity. The HEVC reference software, called Test Model (HM), is 2× to 10× slower than AVC reference software (GROIS et al., 2013; T. S. KIM et al., 2017).
Every standard release is followed by many works proposing to reduce the encoding complexity efficiently exploring the new tools. The Motion Estimation (ME) is among the most time-consuming steps of the encoding process and therefore many works propose to reduce its complexity (BOSSEN et al., 2012). It is responsible for predicting a block using data of already encoded frames. The ME is further divided in two steps: the Integer Motion Estimation (IME) and Fractional Motion Estimation (FME). The first step searches among blocks at integer positions whereas the second interpolates data to generate blocks at fractional positions.
L. Li et al. (2013) propose to completely remove IME from the encoding for AVC standard. They show that the encoder uses other tools to compensate for the lack of IME. If the study has its contribution for AVC standard, the same does not apply for the HEVC standard. Completely eliminating the IME from HM results in more than 20% of encoding efficiency loss (measured with Bjøntegaard Delta Bitrate (BD-Rate) (BJØNTEGAARD, 2001)).
In HEVC, T. S. Kim et al. (2017) explores motion data from different depths of the recursive block partitioning to reduce the encoding complexity. Pan et al. (2016) and Ahn et al. (2016) also explore data from different depths, but they emphasize the use of block size in their proposal. However, those works do not isolate the impact of the block size. I. K. Kim et al. (2012) and Sinangil et al. (2013) made experiments that show the block size impact on HEVC. However, the first aim to report the improvements of HEVC block partitioning while the second one focus on hardware design decisions. We do not found works that isolate the block size parameter to identify its impact on HEVC encoder.
The main contribution relies on exploring the block size as a parameter to reduce the encoding complexity while minimizing the encoding efficiency loss. We restrict the scope of this work to the ME context. Our strategies use only the block size to decide if the IME will be performed or not; when the IME is is not performed, only FME is executed. The approach is simplistic though pragmatic because it allows for highlighting the impact of different block sizes in a pessimistic scenario for the HEVC encoder. We expect that the three proposed strategies to evaluate the block size impact can lead to ideas to further improve the existing algorithms.
1.1. Goals 31
1.1 GOALS
The main goal of this work is to identify if the block size can be explored as a parameter to reduce IME complexity, while keeping the efficiency loss as close to 0% as possible.
The following secondary goals are necessary to achieve the main one: • To propose and evaluate IME elimination strategies based only on block
size;
• To identify how the encoder adapts its decisions to compensate for the IME elimination;
• To identify the encoding parameters that affect the elimination strategy results of encoding efficiency and complexity reduction.
1.2 CONTRIBUTIONS
Our research brought the following results:
• We propose three elimination strategies based on block size. We show how the results of a given strategy allowed us to devise an improved strategy. In the third improvement iteration, we identified that nonsquare blocks with sizes between 8×8 and 64×64 and all the asymmetric blocks as well can have their IME eliminated to reduce the Integer Motion Estimation complexity by 43.9%, on average, with encoding efficiency loss lower than 1% in BD-Rate.
• We provide a quantitative evaluation using three different HM config-urations combined with six Quantization Parameters (QPs). We also performed experiments using three sequences with both 1080p and 4K resolutions. The experiment set let us identify if the block size to eliminate IME has consistent results among different encoding scenarios. • We combine our best strategy with three fast algorithms present in HM. The evaluations allow us to present the impact of our elimination strategy considering a scenario where the encoding efficiency is not a primary constraint.
1.3 ORGANIZATION
The remaining of this text is organized as follows. Chapter 2 is dedicated to the readers not familiar with basic video coding concepts and
32 Chapter 1. Introduction
HEVC techniques necessary to appreciate the next chapters. In Chapter 3 we categorize the related works and detail the ones that provide us insightful statements to conduct our research. We describe the method in Chapter 4, which includes the baseline setup and the evaluation metrics. Results are presented in Chapter 5 where we analyze three IME elimination strategies based only on block size. We also present the impact of IME elimination using different encoding configurations and combining our strategy with three fast algorithms in HM. Finally, Chapter 6 draws the conclusions and presents the perspectives of future works.
33
2 BASIC CONCEPTS
This chapter explains hybrid video encoding concepts and High Effi-ciency Video Coding (HEVC) techniques necessary to understand the following chapters. Video compression concepts rely on Richardson (2010) whereas HEVC techniques are based on Sullivan; Ohm, et al. (2012) and Sze et al. (2014).
Video compression uses the hybrid model depicted in Figure 1. It is called hybrid because transform and quantization of blocks, already used for image compression, are combined with the prediction step. The encoding starts by parsing the video frames and separating each frame into blocks called original. The prediction step is responsible for selecting a reference block to encode the original block using data already encoded. The encoder transforms and quantizes the residue between the original and the reference blocks. The sequence of bits produced is processed with an entropy encoder to remove symbol redundancies from the final bitstream.
Original frame
a Reference frame buffer
Prediction -+ Transform Inverse transform Quantization Inverse quantization Entropy coding Coded video In-loop filters Reconstructed frame Original block Reference block Residue
Figure 1: Simplified hybrid model for video compression. Adapted from (RICHARDSON, 2010)
The result from quantization will be decoded through inverse
quan-tization, inverse transformation and sum with the reference block. The
reconstructed block is then filtered to attenuate encoding errors and added to compose the reconstructed frame. After reconstruct an entire frame, that frame is added to a buffer of reference frames and a new frame can be encoded. In the following sections, we present more details of each encoding step.
34 Chapter 2. Basic Concepts
2.1 A SEQUENCE OF IMAGES
A video is a sequence of frames represented with a matrix of pixels. To represent the frame information, video compression adopts the YCbCr color format. The YCbCr separates image brightness – represented in one luma component (Y) – from the color – represented in two chroma components (Cb and Cr). The YCbCr is convenient for compression because the human vision is more sensitive to luma component. If all the frame information is kept, then there is one sample from each color component at every pixel position. However, the encoder reduces the number of samples from chroma components to improve the compression rate with inexpressive visual loss.
The encoder indexes each frame with a POC number identifying its position inside the sequence. Moreover, the encoder assigns a type to the frame that can be Intra (I), Predicted (P), or Bi-Predicted (B). Figure 2 shows configurations using the three frame types. An I frame relies only on its information to be encoded, hence the first frame of a sequence is always
POC I 0 I 1 I 2 I 3 I 4 I 5 . . . (a) I 0 P 1 P 2 P 3 P 4 P 5 . . . (b) POC I 0 B 16 B 8 B 4 B 2 B 1 B 3 . . . (c)
Figure 2: Sequences based on HEVC reference software using the configura-tions All Intra (a), Low Delay P (b) and Random Access (c). Each frame is identified with Picture Order Count (POC) and type. An arrow from frame X to Y identifies that X can use Y as reference.
2.2. Block Structure 35
of type I. When at least one I frame is encoded, the others may be of P or B types. While P frames are encoded using one reference frame list, B frames are encoded using two reference frame lists. The differences between P and B frames will become clearer after we show the prediction for each of them.
A set of frames that only references other frames within that same set is called Group of Pictures (GOP). To guarantee this property, the GOP must have at least one I frame to serve as a reference for the others. The GOP concept is very important in video streaming. In the event of losing information during transmission, a frame would be reconstructed with error. In such scenario, other frames with references to the erroneous one would also be reconstructed with error, causing a snowball effect. A GOP restrains the error propagation within its scope, hence keeping the remaining sequence unchanged.
2.2 BLOCK STRUCTURE
In this section, we explain the block representation of HEVC using
I. K. Kim et al. (2012) as reference. Figure 3 shows a possible block
partitioning according to the HEVC standard. The encoder divides a frame into non-overlapping blocks called Coding Tree Units (CTUs). A CTU has a maximum size of 64×64 samples and roots a quad-tree. A quad-tree node can be split to signal four new square partitions, and the smallest partition is limited to 8×8 size. The quad-tree leaves are known as Coding Units (CUs). Each CU is further split into Prediction Units (PUs) using one among eight partitions depicted in Figure 4. CTU, CU and PU are respectively associated with Coding Tree Block (CTB), Coding Block (CB) and Prediction Block (PB). While the samples matrices are the Blocks, the partitioning representation belongs to the Unit.
64×64
32×32
16×16 2N×2N 2N×2N
2N×2N 2N×N 2N×2N 2N×2N 2N×2N N×N 2N×2N 2N×2N
Figure 3: Example of CTU and its quad-tree structure with the chosen PU identified on each CU.
In the prediction step, the encoder evaluates different partitionings. To decide the best partitioning scheme, the prediction takes into account the cost
36 Chapter 2. Basic Concepts Symmetric: Asymmetric: 2N×2N N×2N 2N×N N×N N 2×2N (L) N2×2N (R) 2N×N2(U) 2N×N2(D)
Figure 4: Possible PU partitions of a 2N×2N sized CU, where N ∈ {4, 8, 16, 32}. Adapted from (SULLIVAN; OHM, et al., 2012).
of all PUs. To better understand the prediction, firstly we describe the cost calculation in the next section.
2.3 RATE DISTORTION OPTIMIZATION MODEL
Video compression aims to reduce the number of bits while keeping a similar perceptive visual quality. To achieve such goal, the encoder tries to minimize a cost taking into account the final bitrate and the quality loss in comparison to the original video. Sullivan; Wiegand (1998) propose the Rate-Distortion Optimization (RDO) model to evaluate the cost. The RDO needs real values of bitrate and quality from the encoded video, which makes unfeasible its exhaustive use due to the high complexity to compute those values. To solve this problem, the encoder uses a simplified RDO for most evaluations.
The simplified RDO specifies the following equation:
jcost(Bori, Bcan)= d(Bori, Bcan)+ λ × r(Bori, Bcan) (2.1)
, where the total cost ( jcost) to encode the original block (Bori) with the
candidate (Bcan) is given by the sum of distortion (d) and rate (r), weighted
by a λ factor commonly referred to as Lagrange multiplier. The distortion measures the difference between two blocks. If the distortion is zero (ideal case), the candidate is equal to the original. The rate estimates the number of bits necessary to signal the decision in the final bit stream. From now on, we assume that the best candidate in a given search set is the one that minimizes jcost.
2.4 THE PREDICTION
A video has a lot of redundant information. Homogeneous textures inside a frame, like solid colors and motion blur, are an example of spatial
2.4. The Prediction 37
redundancies. When frames close in time have similar areas spatially dislocated due to scene movement, this is called temporal redundancy. The role of
prediction is to explore both spatial and temporal redundancies by using
previously encoded data to represent the block being encoded.
Figure 5 illustrates the prediction step. The encoder may use either
Intra or Inter modes to predict a PU. While the former generates a set
of candidates from encoded neighboring pixels, the latter searches among candidates from already encoded frames. We will only explain Inter mode which is the scope of this work.
Reference frame buffer
Reconstructed frame Original block Prediction Inter Intra Mode decision Reference block
Figure 5: Block diagram of Prediction step. The Inter prediction receives as input the Original block and frames from the reference frame buffer. The Intra prediction receives as input the Original block and the already reconstructed data from the frame being encoded. The Prediction chooses the best mode to generate the reference block that the encoder uses to continue the process.
2.4.1 Inter Mode
To capture temporal redundancies, Inter mode prediction performs a process called Motion Estimation (ME). The ME searches for a block
that minimizes jcost and relies on Integer Motion Estimation (IME) and
Fractional Motion Estimation (FME) to achieve this goal. Figure 6 shows
the IME process. It is called integer because the candidates are composed of samples at integer positions. IME searches among candidates within a search area and after selecting the best block FME starts.
38 Chapter 2. Basic Concepts
Candidate Block Original Block Candidate Frame Original Frame
Encoding Order
S:
Figure 6: IME compares the Original block with candidate blocks at integer positions. In this example, S is the set of all possible candidates within the search area delimited by the rectangle in the candidate frame.
The FME searches within a fractional space that is artificially created by applying an interpolation filter on integer samples around and within the block selected by IME. An interpolation filter expression is given by:
n Õ
i=1
Xi× ci (2.2)
, where n is the total number of samples Xito interpolate the new sample and
ci is a filter constant defined by the standard. Figure 7 illustrates the fractional
space with integer, half and quarter sample precision. The ME output is a
.
.
.
.
.
.
.
.
0 1 1 0Figure 7: Representation of fractional sample domain. Circles, squares and triangles respectively represents samples at integer, half and quarter positions.
2.4. The Prediction 39
Motion Vector (MV) indicating the movement of the original block related to
the reference block position with quarter precision.
Similarly to the frame classification, there are three types of PUs. Moreover, the frame type restricts the PU types that a frame can have. An I frame has only I-type PUs. A P frame can have both I- and P-type PUs, and a B frame can have B-type PUs in addition to the other two types.
In a P frame, the encoder performs ME for each frame in the reference list and selects only the best candidate. The execution flow is similar to encoding a P-type PU in a B frame, but there are two selected candidates, one for each list of reference frames. While the P-type PU is signalized only with the best of them, the samples and MVs averages of those two candidates predict a B-type PU.
Besides the ME, HEVC uses two other processes in Inter Mode: the first one provides a start point where ME defines the search area and the second one tries to predict the original block signalizing the repetition of motion information already encoded. These two processes are detailed in the following sub-sections.
2.4.2 Advanced Motion Vector Prediction
An important decision made in Inter prediction is the position of the search area from where IME will extract the candidates. HEVC encoders centralize the search area in a Predicted Motion Vector (PMV) that attempts to predict the movement of the original block based on MVs from neighbor PUs already encoded. The closer the PMV is to the final MV, the fewer bits are needed to encode the residual MV (difference between PMV and the final MV). The PMV concept was already used in previous standards, but the decision algorithm was improved and received the name of Advanced Motion Vector Prediction (AMVP) in HEVC (LAROCHE et al., 2008; JCT-VC, 2010).
Initially, the AMVP creates a list with two PMV candidates. Figure 8 illustrates the seven MVs that can be chosen to fill the list. The MVs from the
frame being encoded are called spatial and there are five of them: A0, A1, B0,
B1and B2. The two remaining MVs, C0and C1, come from reference frames
and are called temporal.
The AMVP starts by selecting one A MV: they are evaluated in crescent number order and the first available one is assigned as one of the two candidates. The algorithm selects one B MV using the same process. If candidates A and B are redundant or any of them is unavailable, then candidates C are chosen in the same way as the others. If the list is incomplete, then the AMVP inserts MVs indicating zero movements to complete it.
After filling the list, the encoder selects the best PMV based on jcost
40 Chapter 2. Basic Concepts
A
0A
1B
0B
1B
2C
0C
1Figure 8: MVs evaluated to fill the AMVP list of the original block (in gray). ‘A’s and ‘B’s are spatial MVs, while ‘C’s are temporal MVs. Adapted from (HELLE et al., 2012).
used to signal the decision in the final bit stream.
2.4.3 Merge and Skip
Helle et al. (2012) developed a method called merge to reuse information between neighboring PUs that shares the same residual MV. HEVC introduces the merge as an alternative for the P and B PU decisions. If the encoder decides for the merge, then the decision is signal with an index to a list of merge candidates instead of the information associated with the MV. When the quad-tree decides to divide the block into several small partitions, merge decision increases compression avoiding unnecessary MV replication among those partitions.
The merge candidates list is limited to five candidates, selected among the same candidates evaluated in AMVP. The insertion order priority in merge list is given as follows: A1, B1, B0, A0, B2, C0, and C1. Redundant candidates are replaced by the next available ones. If there is no more candidates to evaluate and the list is incomplete, new B candidates will be created by combining vectors already listed. In the worst case, the encoder completes the list using MVs indicating zero movements. The best merge decision is made
based on jcostminimization using the samples at merge candidate position as
reference.
Besides avoiding MV replication, the encoder can identify that no residue between the original and the candidate needs to be signalized. This specific case of merge is called skip. The skip is not only efficient to signal but also exempts the PU from transform and quantization steps. Because of the skip decisions advantages, the encoder can force a skip even if the residual
2.5. Transform, Quantization, Entropy Encoding and Reconstruction 41
2.5 TRANSFORM, QUANTIZATION, ENTROPY ENCODING AND
RE-CONSTRUCTION
After the prediction step, the encoder has all the information about decisions and residues to encode a CTU. So far, the encoder represented frames and blocks data in a 2D space called spatial domain. However, the Transform step changes the samples from the spatial domain to the frequency domain – usually using Discrete Cosine Transform (DCT). In the frequency domain, the block data is represented with weights of different frequency components.
The main advantage of the frequency domain is that human vision is less sensitive to elements in high frequency. If no chroma component is sub-sampled, up to this point the encoding process could be made without error1 , but this may change in Quantization step. Quantization may scale down the elements in the frequency domain. Elements in high frequencies will be more quantized than elements in low frequencies to reduce the total amount of data for encoding while keeping a low visual perceptive error. The scaling values are calculated based on a Quantization Parameter (QP). A large QP will result in a more aggressive quantization, resulting in a higher compression rate with more quality loss.
The quantization result and standard signalization are enough to rep-resent the encoded video, but there is a lot of symbol repetition (e.g. long sequences of 0’s due to the higher frequency components quantization). To attenuate symbol redundancies before generating the final bitstream an
en-tropy encoder is used. The enen-tropy encoder performs a lossless compression
considering only the sequence of bits generated by the video encoder. After performing the entropy encoding, the output is the final encoded video.
The encoder must reconstruct the block to feedback the process. The first two steps consist in performing inverse quantization and inverse transform to reconstruct the residual block in the spatial domain, which considers the error introduced by the encoder. The residual block must be summed with the reference to generate the block that composes the reconstructed frame. The Advanced Video Coding (AVC) standard introduced in-loop filters. In-loop filters process the block to smooth the edges and reduce block artifacts2, improving the perceptive visual quality. The decoded block can be used as a reference for intra prediction of other blocks in the same frame. When the encoding of a frame ends, that frame is added to a buffer to be used as a
1 Usually, the standard specifies just the bitstream and how to decode it, but there are some
exceptions. The standard has a constraint to the number of bits of any internal value in encoding. If some value exceeds this limit, it is clipped and an error is introduced before the quatization.
2 In video compression, artifacts are visual perceptive errors in decoded video. A block artifact
42 Chapter 2. Basic Concepts
reference for inter prediction of the next frames.
2.6 CODING EFFICIENCY
When a new coding technique is proposed, it is necessary to measure its impact considering the bitrate and the quality results. The bitrate is the number of bits per time unit (usually per second); the smaller the bitrate is, the higher was the compression achieved. The quality can be measured with subjective or objective methods. Objective methods are commonly used in scientific works to make possible the reproducibility of experiments. The HEVC reference software reports the quality with the objective metric called Peak signal-to-noise ratio (PSNR), measured in decibel (dB); the higher the PSNR is, better is the video quality.
It is hard to assess the encoding efficiency evaluating the bitrate and quality metrics separately. To solve this problem, Bjøntegaard (2001) created a metric called Bjøntegaard Delta Bitrate (BD-Rate). By encoding the same video with two configurations combined with four different QPs it is possible to plot two Rate-Distortion curves (R-D curves) as presented in Figure 9. In this example we want to compare the encoding efficiency of curve A with respect to Ref (the baseline configuration).
Figure 9: Example of R-D curves using QPs 22, 27, 32, and 37. l and h are, respectively, the lower and the higher values of PSNR shared by A and Ref R-D curves. Adapted from: Bjøntegaard (2001).
Notice that the bitrate has to be scaled as follows:
2.6. Coding Efficiency 43
Each R-D curve is then approximated by a polynomial function of third degree like the following one:
f (D)= R = a + b × D + c × D2+ d × D3 (2.4)
, where a, b, c and d are fitting constants used to describe the scaled bitrate (R) as a function of PSNR (D).
Given fAand fRe f as the functions that respectively approximates A
and Ref R-D curves, the BD-Rate is now calculated as follows:
BD-Rate= (10h−l1
∫h
l fR e f(D)− fA(D)
dD− 1) × 100% (2.5)
, where l and h are, respectively, the lower and the higher values of PSNR shared by A and Ref R-D curves. A positive BD-Rate indicates that
Ahas worst encoding efficiency than Ref, because A results in, on average,
higher bitrate to achieve the same quality – in terms of PSNR – than Ref. The opposite happens when the BD-Rate result is negative (as occurs in the example depicted in Figure 9), indicating that A requires, on average, lower bitrate to achieve the same quality than Ref.
45
3 RELATED WORKS
In this chapter, we focus on works that aim to reduce the ME complexity. Based on T. S. Kim et al. (2017), we categorized the works in three groups.
1. Works proposing to reduce the number of calculations or memory accesses performed by the simplified RDO.
The distortion metric evaluated in the simplified RDO model is one of the most time-consuming operations of the encoder (BOSSEN et al., 2012). A solution to reduce the distortion complexity is sub-sampling the original and candidate blocks. However, using fewer data reduces the number of arithmetic operations and memory accesses, thus degrading the encoding efficiency (SEIDEL; BRÄSCHER, et al., 2015). The Partial Distortion Elimination proposed by Eckart et al. (1995) and the Successive Elimination Algorithm of W. Li et al. (1995) explore mathematical properties to avoid unnecessary distortion calculations. The Partial Distortion Elimination evaluates partial distortion results to stop the operation whenever the partial value is larger than the current minimum. The Successive Elimination Algorithm, by its turn, evaluates each candidate with an elimination metric simpler to compute than the target distortion metric itself: if the elimination metric result is greater than the current minimum, then the distortion computation is eliminated. Considering the same input, if the elimination metric result is always less or equal than the target distortion metric, then the best reference among the search set is never eliminated. Both Partial Distortion Elimination and Successive Elimination Algorithm are still being improved and used in state-of-the-art video encoders (SEIDEL; GÜNTZEL, et al., 2016; TRUDEAU et al., 2016; 2018), reducing the complexity without deteriorating the encoding efficiency.
2. Works proposing to choose a better Motion Vector Predictor (MVP). Finding a better MVP may reduce the number of bits used to signal the residual MV. The HEVC improves the MVP selection over the previous standards using the AMVP (JCT-VC, 2010). Because the AMVP is part of the standard, changing the AMVP implies changing the standard. Therefore, works in this group, such as (LAROCHE et al., 2008; CHIEN et al., 2014), have their contributions restricted to the standardization process.
3. Works proposing to better select the search set. We further divide this group in three different approaches:
46 Chapter 3. Related Works
• Find precise MVPs: A strategy that finds a predictor closer to the best MV despite cannot be considered to signal the MVP -can be used to reduce the number of evaluated -candidates in IME (PAN et al., 2016). T. S. Kim et al. (2017) show that a bottom-up approach estimates better the movement of larger PUs using the information of smaller ones, allowing to reduce the search set size of large blocks with low impact on video encoding efficiency (0.39% of BD-Rate increase).
• Define more adequate search range: Instead of using a fixed range where the search algorithm selects candidates, the range varies under certain conditions (LEE et al., 2017). This method is called Adaptive Search Range, and may reduce the number of candidates using a more suitable range for different scenarios. The Test Model (HM) adopts an Adaptive Search Range that scales the search range according to the distance between the original and the reference frame: longer distances between those frames result in a larger search range (ROSEWARNE et al., 2011).
• Reduce the number of evaluated candidates: There are works proposing fast search algorithms that select a subset of candidates among all the ones in a search area. Among those approaches, we can mention: New Three-Step Search (R. LI et al., 1994), Four-Step Search (PO et al., 1996), Diamond Search (S. ZHU et al., 2000), Cross-Diamond Search (CHEUNG et al., 2002) and Hexagon-based Search (C. ZHU et al., 2002). The HEVC encoder reference software uses the Test Zone Search (TZS) as fast search algorithm (X. LI et al., 2014).
In the following sections, we detailed the works that inspired this research by presenting algorithms to eliminate ME operations or evaluating the block size impact.
3.1 AHN; SIM (2016)
HM uses a top-down approach to evaluate, in the following order, the merge mode, symmetric PUs - square ones first - asymmetric PUs, and intra modes, then the recursive calls to split the CU. Ahn et al. (2016) show that square PUs are more often selected than non-square ones, and explore this characteristic to reduce the encoder complexity. They propose first to evaluate square PUs using the top-down approach and the non-square PUs using a bottom-up approach. The algorithm only evaluates non-square PUs under one condition: the cost of split a CU is smaller than the cost of not to do it. The
3.2. Pan et al. (2016) 47
authors stated that if it is better to split a CU, then it is more likely to use a non-square PU.
In addition to the algorithm, the authors show how to reduce even more the complexity by combining the proposed algorithm with an early termination criterion. The criterion checks three conditions: whether skip is the best mode at current depth, wheter skip is the best mode at previous depth, and whether the 2N×2N PU has no residue. Only when those three conditions are true, neither the non-square PU nor recursive calls are evaluated. Their algorithm combined with the termination criterion achieves average reductions of 57.54% and 57.82% for the total encoding time , with increases of 1.39% and 2.05% in BD-Rate when using, respectively, Random Access and Low
Delayconfigurations.
3.2 PAN ET AL. (2016)
Pan et al. (2016) proposed a method to skip the ME execution. They centralize the ME search area at the best of two MVs: the zero MV, and the median between the up, left and up-left neighbor PUs MV. The experiments using HM combined with the Random Access and Low Delay configurations show that after ME at least 55.9% of the references are at the center position. Moreover, if the ME of 2N×2N PU selects the reference at center position, at least 89.67% of the other partitions selects the reference at the center position too.
The method to eliminate the ME works as follows. After performing the ME of 2N×2N PU, if the reference is at the center position, then only the initial search positions of the remaining PUs are evaluated. Otherwise, the encoder performs the ME of the remaining PUs. On average, the proposed method reduces 15.04% of total encoding time and increases BD-Rate by 0.55% when using Low Delay configuration, and reduces 12.29% of total encoding time and increases BD-Rate 0.8% when using Random Access configuration.
3.3 L. LI ET AL. (2013)
Given the variety of possible partitions and modes brought by the AVC, the standard encoder previous to the HEVC, the relevance of IME was called into question by L. Li et al. (2013). Using the x264 (MERRITT et al., 2004) to encode four 1080p videos they show that at least half of the MV is less than one pixel away from the predictor. They completely remove the IME step from the encoding flow based upon three arguments: 1) The combination of MV prediction with FME may find a good enough block predictor, due to the strong spatio-temporal correlation between neighboring blocks. 2) Even when
48 Chapter 3. Related Works
such correlation is weak, the FME residue surrounding the PMV is similar to the ones surrounding nearby blocks. 3) If the optimal candidate is too far from the PMV, IME will not improve Rate-Distortion significantly; choosing a different Macroblock partition or Intra mode may alleviate the impact when IME is disabled.
They show that 22.5% of the blocks that were more than four pixels away from the predictor changed their modes. Because the encoder uses its range of possibilities to compensate for the lack of IME, the bitrate increases by 0.18% and the PSNR decreases by 0.01dB. Despite the encoding efficiency losses, they reduce 57.9% of total encoding time. Unfortunately, no results of BD-Rate or BD-PSNR were given.
3.4 I. KIM ET AL. (2012)
I. K. Kim et al. (2012) evaluate the encoding efficiency improvement brought by the more flexible block partitioning scheme in HEVC. They made experiments using All Intra, Random Access and Low Delay encoding configurations and varying the maximum CTU size, the minimum CU size, the PU size types (rectangular and asymmetric), the Transform Unit (TU) size, and the transform tree depth. They evaluate the encoding complexity considering the total execution time, and the encoding efficiency considering the BD-Rate.
According to the authors, reducing the CTU maximum size from 64×64 to 16×16 negatively affected the encoding efficiency. The impact is prominent in higher resolution sequences, resulting in up to 21.9% of BD-Rate increase. It is more feasible to explore redundancies with large blocks in sequences with higher resolutions, and the encoder takes advantage of blocks with 64×64 samples. Increasing the CU minimum size from 8×8 to 32×32 also negatively affects the encoding efficiency. In contrast to the previous experiment, the impact is prominent for lower resolution sequences, resulting in up to 29% of BD-Rate increase. In lower resolutions, it is hard to explore redundancies with large block because it comprehend large portion of the frame. Both experiments show that the HEVC standard has a flexible block partitioning to encode video in different resolutions.
They also made experiments based on TU representation, but we do not discuss on those results because they are out of the scope of this work. Their exploration about the PU sizes was very brief and only concludes that enabling nonsquare and asymmetric partitions on encoding reduces, on average, 3.5% and 4.3% the BD-Rate for Random Access and Low Delay, respectively.
3.5. Sinangil et al. (2013) 49
3.5 SINANGIL ET AL. (2013)
Sinangil et al. (2013) propose a hardware design for ME in HEVC. Most of the discussion presented in the work relies on hardware exploration. However, they evaluated the block size impact on HEVC by enabling different PU sizes combinations on the encoder. In the experiments, the authors do not consider the use of asymmetric PUs.
Considering the bitrate increase, the best configuration uses all PU sizes except the sizes smaller than 8×8, which increases only the bitrate by 2%. A better trade-off was achieved by enabling only square sizes or enabling the sizes between 32×32 and 8×8, which significantly improved hardware constraints with only 3% of bitrate increase. Enabling only square PUs except the 8×8 increases bitrate by 12%, but this was the chosen configuration by Sinangil et al. (2013) to make further evaluations because of power and memory constraints. Unfortunately, the encoding efficiency was evaluated using the average bitrate of four encoding configuration (All Intra, Random Access, Low Delay andLow Delay P). Therefore, it is not possible to draw conclusions regarding the encoding visual quality results or the PU size impact in a particular configuration.
3.6 SUMMARY
There are works that successfully reduce the encoder complexity by eliminating ME operations. Algorithms that act on different depths of the recursive search for the best mode to encode the CTU are implicity evaluating different PU sizes. However, we realize that those works lack a detailed analysis of PU size impact on ME.
I. K. Kim et al. (2012) show detailed experiments using different block size configuration, but their target goal is to highlight the HEVC improvements on block partitioning and they not explore the PU as a parameter to reduce the encoding complexity. Sinangil et al. (2013) explore the PU size impact on encoding, but they do not consider encoding quality and only present the averages of different configuration results which makes difficult the interpretation.
Inspired by the conclusions of (L. LI et al., 2013), in this work we eliminate the IME process from the encoding flow based only on the PU size being evaluated in ME. This very pessimistic scenario may result in encoding efficiency loss that exceeds the acceptable values in the literature (related works do not exceed, on average, 3% of BD-Rate increase), but by doing so, we expect to highlight the impact of PU size on the encoding process. We explore three elimination strategies combining different PU sizes aiming to
50 Chapter 3. Related Works
reduce the BD-Rate loss and to identify a set of sizes being overestimated by the encoder. Moreover, we also explore different encoding configurations aiming to distinguish the parameters that affect the results.
51
4 METHOD
To evaluate the block size impact on encoding, we defined three different elimination strategies. Given U as the set of all PU sizes, each elimination strategy defines:
P0 = ∅
P1, P2, ..., Pn⊂ U | Pi∩ Pj = ∅ ∀ i , j
(4.1) Given p ∈ U, an elimination strategy categorizes p using the following equation:
f (p)= ` ⇐⇒ pu ∈ P` (4.2)
, where ` ∈ N∗is the level of p. Each coding execution is configured with
a parameter called elimination level, which defines a lower bound level to execute the IME. Figure 10 illustrates the encoding flow with an elimination strategy. At elimination level 0 no IME is eliminated and this is the baseline for all strategies. At elimination level 1, the PUs categorized as level 1 have their IME eliminated. At elimination level n, the PUs categorized as level n have their IME eliminated as well as other PUs categorized as level ` < n. The results of each elimination level using different strategies allow us to identify the PU sizes most suitable to be used as a parameter to reduce the encoding complexity. ME IME FME `≤ elimination level Yes No
Figure 10: Block diagram of an elimination strategy based on PU size. We used the Common Test Conditions (CTC) specifications (BOSSEN, 2013) to define our experimental setup. We encoded 22 video sequences with 8 bits depth (described in Table 2) using the HM encoder, version 16.16 (ROSEWARNE et al., 2011)1, running under the configuration Random Access with four QPs: 22, 27, 32 and 37.
We evaluated the experiments in terms of coding efficiency and compu-tational complexity. The coding efficiency has to consider the compression rate and image quality, measured in PSNR relative to the luma component (Y).
52 Chapter 4. Method
CTC class Video sequence Resolution Number of frames Frames per second
A Traffic 2560×1600 150 30 PeopleOnStreet 2560×1600 150 30 B Kimono 1920×1080 240 24 ParkScene 1920×1080 240 24 Cactus 1920×1080 500 50 BQTerrace 1920×1080 600 60 BasketballDrive 1920×1080 500 50 C RaceHorsesC 832×480 300 30 BQMall 832×480 600 60 PartyScene 832×480 500 50 BasketballDrill 832×480 500 50 D RaceHorses 416×240 300 30 BQSquare 416×240 600 60 BlowingBubbles 416×240 500 50 BasketballPass 416×240 500 50 E FourPeople 1280×720 600 60 Johnny 1280×720 600 60 KristenAndSara 1280×720 600 60 BasketballDrillText 832×480 500 50 F ChinaSpeed 1024×768 500 30 SlideEditing 1280×720 300 30 SlideShow 1280×720 500 20
Table 2: Characteristics of CTC video sequences.
Therefore, the BD-Rate metric (BJØNTEGAARD, 2001) seems quite adequate because it does not only takes into account those two factors but also allows a relative comparison of encoding efficiency between two distinct encoding configurations. That is why we choose the BD-Rate as encoding efficiency metric.
Many works evaluate the computational complexity using the total coding time. However, this work relies only on a heterogeneous computing environment to execute the experiments, which makes unfeasible to use the total coding time due to the unfair comparison between results obtained in different processing units. Even using a single processing unit, and a minimal Linux installation and hwloc tool to avoid process migration between cores, our experiments presented 10% of time variation using the same coding
configurations. We could repeat the experiment until achieve statistical
confidence, but such procedure would not fit in the time schedule to finish this work.
To handle such infrastructure limitation, we adopted the solution presented by T. S. Kim et al. (2017), and estimated IME complexity through
53
the following equation: Õ
x ∈ {PU size}
search_points(x) × ar ea(x) (4.3)
, where search_points(x) is the number of searched candidates with size x and ar ea(x) calculates the number of samples in x PU size. We reported only the average IME complexity of the four evaluated QPs. Despite its simplicity, this metric takes into account candidate sizes, is not affected by software or hardware architectures and allow for the reproducibility of our experiments.
To better understand the effect of eliminating the IME we analyzed decision results for both mode and PU sizes. The decisions of a given mode m are computed as follows:
Õ
x ∈PU decided with m
ar ea(x) (4.4)
Similarly, we compute the decision for each PU size. We present all the results relative to the total video sequence area, calculated as follows:
H × W × f ps × t (4.5)
, where H and W are, respectively, the height and width of a frame, the f ps is the number of frames per second and t is the sequence length in seconds. This was made to highlight the representativeness of a decision in the sequence. Counting only the number of decisions would turn the results biased by the small blocks. One 64×64 sized block, for example, can be split into 128 8×4 sized blocks, while having the same area.
We also save the information of MV decisions and classified them into regions. A MV is included in a region r calculated as follows:
r= max(|x|, |y|) (4.6)
, where x and y are the coordinates of the residual MV (difference between the PMV and the final MVs). This classification is similar to the one used by L. Li et al. (2013), but our regions have quarter pixel precision. For instance, region 0 means no movement in comparison to the PMV, whereas region 3 means that the final vector has a distance of 3/4 pixels with respect to PMV.