Motion compensated permutation-based video encryption

(1)

Caio César Sabino Silva

MOTION COMPENSATED PERMUTATION-BASED VIDEO

ENCRYPTION

Federal University of Pernambuco [email protected] www.cin.ufpe.br/~posgraduacao

RECIFE 2015

(2)

Caio César Sabino Silva

MOTION COMPENSATED PERMUTATION-BASED VIDEO

ENCRYPTION

A M.Sc. Dissertation presented to the Center for Informatics of Federal University of Pernambuco in partial fulfillment of the requirements for the degree of Master of Science in Computer Science.

Advisor: Tsang Ing Ren Co-Advisor: George Darmiton da Cunha Cavalcanti

RECIFE 2015

(3)

Catalogação na fonte

Bibliotecária Monick Raquel Silvestre da S. Portes, CRB4-1217

S586m Silva, Caio César Sabino

Motion compensated permutation-based video encryption / Caio César Sabino Silva. – 2015.

72 f.: il., fig., tab.

Orientador: Tsang Ing Ren.

Dissertação (Mestrado) – Universidade Federal de Pernambuco. CIn, Ciência da Computação, Recife, 2015.

Inclui referências.

1. Ciência da computação. 2. Segurança de dados multimídia. I. Ren, Tsang Ing (orientador). II. Título.

004 CDD (23. ed.) UFPE- MEI 2017-35

(4)

Dissertação de Mestrado apresentada por Caio César Sabino Silva à Pós-Graduação em Ciência da Computação do Centro de Informática da Universidade Federal de Pernambuco, sob o título “Motion Compensated Permutation-based Video

Encryption” orientada pelo Prof. Tsang Ing Ren e aprovada pela Banca Examinadora

formada pelos professores:

______________________________________________ Prof. Carlos Alexandre Barros de Mello Centro de Informática / UFPE

______________________________________________ Profa. Vanessa Testoni

Sansung Research Brazil

_______________________________________________

Prof. Tsang Ing Ren

Centro de Informática / UFPE

Visto e permitida a impressão. Recife, 25 de agosto de 2015.

_______________________________________

Profa. Edna Natividade da Silva Barros

Coordenador da Pós-Graduação em Ciência da Computação do Centro de Informática da Universidade Federal de Pernambuco.

(5)

(6)

Acknowledgements

I would like to thank everyone who assisted me directly or indirectly in the development of this dissertation project:

to my parents, José Antonio and Valdeci Sabino, above anything, for giving me all the support needed and for stimulating me into trying hard to do the best possible and not to give up on any circumstances.

to my brother, Eduardo, for helping me find interest in science, giving me the support and the attention I needed and always keeping me pushing forward in whatever activity I was involved in.

to professor Carlos Alexandre and researcher Vanessa Testoni for kindly accepting the invitation to participate in the examining board. It is an honor for me to have both of you evaluating this work.

to my advisor Tsang Ing Ren, who helped me in so many ways not only in the masters degree program but also in the graduation process. I feel especially grateful to him, especially since he cares about me as his friend and not only as his student and I always appreciate his opinions and career advices. He also motivates me to become more active in the researching community and always finds new interesting problems for us to work on.

to my co-advisor George Darmiton, who always had interesting insights about the work even when it was not exactly the field he has been working on, and collaborated to the work developed in this dissertation.

to my friend Lais Sousa, who started working on this research with me, when we were still undergraduate students. I feel a big part of this work is also hers, since she collaborated to this from the very beginning. I also appreciate all of her support and algorithmic visions and theoretical insights, which contributed a lot to my Computer Science background.

to my friend Ruan Carvalho, who has always been with me through my undergraduate and masters program. I feel especially grateful to him for the countless times he went out of his way just to help me. Also for helping me with my career decisions, which he always had an open mind to listen and advise me.

to my friends Amora, Anália and Lorena who have been quite close to me through this time, holding my back and giving me friendly support. It feels good to know I have these friends to count on if something bad happens. And it was always enjoyable to spend time with them, which helped relieve a lot of stress from working in this project at times.

(7)

Rather than love, than money, than fame, give me truth. —HENRY DAVID THOREAU (WALDEN)

(8)

Resumo

No contexto de segurança de aplicações multimídia, técnicas de encriptação de vídeo têm sido desenvolvidas com o intuito de assegurar a confidencialidade das informações contidas em tal tipo de mídia. Compressão e encriptação costumavam ser consideradas áreas opostas em termos de exploração de entropia de dados, entretanto nas últimas décadas houve um aumento significante no volume de dados operado por aplicações de encriptação de vídeo, o que exigiu melhoras na compressão de vídeos encriptados. Neste sentido, diversas técnicas têm sido desenvolvidas como codificação de entropia provendo encriptação e compressão simultaneamente.

Um esquema criptográfico existente, introduzido por Socek et al., é baseado em transfor-mações de permutação e aplica encriptação anteriormente à fase de compressão. A encriptação aplicada por essa técnica pode ser considerada não tão segura quanto um esquema criptográfico convencional, mas ainda aceitável pela maioria das aplicações de vídeo. A mesma é capaz de mel-horar a correlação espacial do vídeo original, caso os quadros consecutivos sejam suficientemente similares, tornando-o possivelmente mais compressível que o vídeo original.

Entretanto, o esquema criptográfico original foi designado para explorar apenas corre-lação espacial de cada quadro, e codificadores podem explorar também correcorre-lação temporal não trivial. Além disso, as melhoras na correção espacial advindas das transformações de permutação são altamente baseadas na correlação temporal natural do vídeo. Portanto, a performance do es-quema é extremamente associada à quantidade de movimento no vídeo. O trabalho desenvolvido nesta dissertação tem como objetivo estender esse esquema criptográfico, incluindo conceitos de compensação de movimento nas transformações baseadas em permutação usadas na encriptação de vídeo para melhorar sua performance, tornando o esquema mais resiliente a vídeos com muito movimento.

Palavras-chave: Encriptação de vídeo. Codificação de vídeo. Segurança de dados multimídia. Compensação de movimento. Correlação de dados. Compressão

(9)

Abstract

In the context of multimedia applications security, digital video encryption techniques have been developed to assure the confidentiality of information contained in such media type. Compression and encryption used to be considered as opposite in terms of exploring the data’s entropy, however in the last decades there was an increase of data volume operated by video encryption applications which demanded improvements on data compressibility in video encryption. In this sense, many techniques have been developed as entropy coding providing both encryption and compression simultaneously.

An existing cryptographic scheme, introduced by Socek et al., is based on permutation transformations and applies encryption prior to the compression stage. The encryption applied by this technique may not be as safe as a conventional encryption technique, but its security is still considered acceptable for most video applications. It can improve the original data’s spatial correlation in case the consecutive frames are similar, making it possibly even more compressible than the original video.

However the original cryptographic scheme was designed to explore only the spatial correlation inside every frame, but codecs can also explore non-trivial temporal correlation. Also the improvements on the data’s spatial correlation coming from the permutation transformations are highly based on the natural temporal correlation in the video. Hence its performance is extremely associated to the amount of motion in the video. The work developed in this dissertation aims to extend this cryptographic scheme, including motion compensation concepts to the permutation based transformations used in the video encryption technique to improve its performance and make it more resilient to high motion videos.

Keywords: Video encryption. Video coding. Multimedia data security. Motion compensation. Data correlation. Compression

(10)

List of Figures

1.1 Cryptography scenario . . . 15

2.1 Video notation . . . 20

2.2 Example of image histogram plot . . . 20

2.3 Spatial correlation illustration . . . 22

2.4 Temporal correlation . . . 23

2.5 Video coding: frame types . . . 24

2.6 Video coding scheme . . . 26

2.7 Permutation and compressibility . . . 28

3.1 Overview of cryptographic scheme . . . 30

3.2 Block-based approach . . . 34

3.3 Histogram hiding extension . . . 36

4.1 Example of Three Step Search algorithm’s convergence: the numbers indicate the center of candidate blocks at each execution step. . . 43

4.2 Example of Two Dimensional Logarithmic Search algorithm’s convergence: the numbers indicate the center of candidate blocks at each execution step. . . 44

4.3 Linear motion estimation principle for frame prediction . . . 45

4.4 Example of block motion vector parameters in a frame . . . 46

4.5 Consecutive frames residual difference - Flower sequence . . . 48

4.6 Consecutive residual frames with motion compensation - Flower sequence . . 49

6.1 Frame examples of the video sequences dataset . . . 59

6.2 ‘Almost-sorting’ permutation quality in Flower sequence frame 4 . . . 63

6.3 Bitstream size by frame plot comparing original method versus motion compen-sated extended algorithm (using TDL motion estimation). . . 64

6.4 PSNR-QP plots comparing the original Socek method with the extended motion compensation encryption version for each high motion sequence . . . 65

6.5 PSNR-QP plots comparing the original Socek method with the extended motion compensation residual encryption version in each high motion sequence . . . . 67

(11)

List of Tables

6.1 Video sequences information . . . 60 6.2 Bitstream size comparison of encrypted video (in KB) for different unique

sorting permutation algorithms for MPNG codec . . . 61 6.3 Bitstream size comparison of encrypted video (in KB) for different unique

sorting permutation algorithms for H.264 codec with QP = 4 . . . 61 6.4 Average PSNR comparison of encrypted video (in dB) for different unique

sorting permutation algorithms for H.264 codec with QP = 4 . . . 61 6.5 Average MSSIM comparison of encrypted video for different unique sorting

permutation algorithms for H.264 codec with QP = 4 . . . 62 6.6 Bitstream size comparison (in KB) for the motion compensation extension using

H.264 codec with QP = 4 . . . 62 6.7 Average MSSIM comparison for the motion compensation extension using H.264

codec with QP = 4 . . . 62 6.8 Average PSNR comparison (in dB) for the motion compensation extension using

H.264 codec with QP = 4 . . . 63 6.9 Bitstream size reduction relative to Socek method for the extended methods in

high motion sequences for different motion estimation algorithms using H.264 codec with QP = 4 . . . 63 6.10 Bitstream size comparison (in KB) for the motion compensated histogram hiding

extension encryption using H.264 codec with QP = 4 . . . 65 6.11 Average MSSIM comparison for the motion compensated histogram hiding

extension encryption using H.264 codec with QP = 4 . . . 66 6.12 Average PSNR comparison (in dB) for the motion compensated histogram hiding

extension encryption using H.264 codec with QP = 4 . . . 66 6.13 Bitstream size reduction relative to Socek method for motion compensated

(12)

List of Algorithms

3.1 Unique sorting permutation method based on Quicksort . . . 31

3.2 Video encryption algorithm . . . 32

3.3 Video decryption algorithm . . . 33

3.4 Constant camera translation adjustment for sorting permutation . . . 35

4.1 Unique sorting permutation method based on Counting Sort . . . 39

4.2 Motion compensated video encryption algorithm . . . 46

4.3 Motion compensated video decryption algorithm . . . 47

4.4 Motion compensated video encryption algorithm on the frame residuals . . . 50

(13)

1

Introduction

This chapter introduces the multimedia data security and encryption context, which is the concern of the research developed in this dissertation. Besides, an overview of the scope of this work is presented and the objective of the research is highlighted.

1.1 Multimedia data security

In the latest years, there has been a considerable improvement in computer and net-working technologies which were responsible for providing simple methods for processing, distributing and storing of data to most user services and applications. However due to the openness of wired and wireless networks, the data operated by such applications can be easily copied or modified (JAKIMOSKI; SUBBALAKSHMI, 2008). In parallel to the growth of these technologies, the emergence of digital rights management came as an important research area to guarantee the protection and authentication of copyrighted multimedia data (GRANGETTO; MAGLI; OLMO, 2006).

Cryptographyis defined as the study of mathematical techniques related to information security aspects, such as confidentiality, data integrity and entity and data origin authenticity (MENEZES; VANSTONE; OORSCHOT, 1996). Hence, this research area is useful to guarantee the secrecy of information that must be recognizable only by authorized individuals. This is usually done by applying a series of transformations in the original data to a format which is only restorable by individuals holding a given secret key. The process which makes the data illegible is referred to Encryption, while the reverse process to obtain the original message is called Decryption.

The scenario of cryptography can be demonstrated in Figure 1.1. Two entities (sender and receiver) attempt to stabilish a secure communication to transmit a message using a public channel. This channel is insecure and can be eavesdropped by an unauthorized entity, i.e., attacker, who tries to recover the original message. The original message to be transmitted is defined as plaintext while the encrypted message to be sent through the insecure channel is denoted as cyphertext.

(16)

1.1. MULTIMEDIA DATA SECURITY 15

Figure 1.1: Cryptography scenario

In terms of information security, there are four main goals of cryptographic systems (MENEZES; VANSTONE; OORSCHOT, 1996):

Confidentiality : a service designed to avoid that the content of the information becomes legible to unauthorized users. The techniques to perform such functionality usually are related to application of mathematical functions that make the data unreadable.

Data Integrity : a service meant to be used to recognize if the message was modified by an unauthorized user. This is very important in the context of Internet since the connection between sender and receiver usually goes through many intermediate points which could be malicious to perform data manipulation before delivering the message to the receiver.

Authentication : a service on the identification of entities and the information itself. In communication, it is important that both sides are able to identify each other. It can be provided as entity authentication, which concerns about identifying the source and endpoint of the communication or data origin authentication, that basically checks the data integrity and assumes that if there was any unauthorized modification in the content, the data origin may have changed.

Non-repudiation : a service that assures that the validity of some statements regard-ing the data is not manipulated by any entity. This usually requires the presence of a trusted third party to be able to resolve some types of conflict. In this context, it means for instance that no entity can deny any previous actions or commitment that they claimed to have done before.

(17)

1.2. VIDEO ENCRYPTION 16

1.2 Video encryption

In the video applications context, due to the existing public networks and the widely use of the Internet, information security has become an important issue (LIU; KOENIG, 2005), specially in relation to the secrecy of the users information. Therefore the use of cryptographic techniques became an important necessity in this area.

However these applications have some specific requirements that are not considered by regular encryption systems. They demanded the adaptation of conventional cryptographic schemes. Among these requirements, some are highlighted (SOCEK et al., 2007):

Codec standard and video format compliance: the encrypted data must preserve the video compression format in such a way that standard decoders must be able to decode it without any errors, even without decryption, but perceiving an unrecognizable video. In other words, the encryption system must require no changes in the encoding and decoding modules and they should be used as a black-box.

Perception quality control: the encrypted video should not contain any information related to the original video. Usually the encryption algorithm has a mechanism to control how much the perception quality is degraded in the encrypted data. In some applications of partial encryption, the perception quality is only partially degraded, so the video is still perceivable but some specific details cannot be recovered. Processing speed: most of the video applications are real time, like video conference

and video streaming, so they need an encryption algorithm that is fast enough to be applied in these conditions. Also, it is important to notice that videos are usually large data, which makes this task even more complicated.

Video compression: it is very desirable that the encrypted video has a similar bitstream size compared to the original video, when both are using the same coding modules and under the same encoding settings.

The video encryption research is basically split into two main approaches. Selective Encryption applies conventional encryption techniques in specific small parts of the video bitstream. The principle is that these encrypted parts are crucial for the video perception and small enough to be able to apply a conventional encryption algorithm such as AES (Advanced Encryption Standard) or DES (Data Encryption Standard) (SINGH; SUPRIYA, 2013). Spanos and Maples, for instance, proposed an algorithm that encrypts only the I frames of every MPEG group of picture, since the decoding of P and B frames depends on the decrypted associated I-frames, they cannot be recovered as well from an unauthorized user (MAPLES; SPANOS, 1995).

The second methodology is Full Encryption which applies on the whole bitstream an entropy coding that also provides encryption (LI; CHEN; ZHENG, 2004). The principle here is

(18)

1.3. OBJECTIVE OF THE RESEARCH 17 to design an efficient technique that encrypts and encodes at the same time. This methodology is more promising, since it usually does not require any modifications in the codec modules. However it is a big challenge to be able to design an efficient algorithm to process large bitstreams and still provide a safe encryption with a good performance compression. Hence the techniques usually in this method have been proved insecure against some types of cryptographic attacks or many of them are meant for limited security scenarios.

Some encryption algorithms are codec specific (SHI; BHARGAVA, 1998a,b), while others do not depend on the video codec used by the application. The codec specific ones have a bigger potential for optimization to obtain a better compressibility or video quality performance, but they have more limited usage and restrict the possibility of using different codecs for the application. Most of the codec independent algorithms apply transformations in each video frame that will be encoded and decoded by the codec module, by using it as black-box.

1.3 Objective of the research

The focus of this work is to propose a secure codec independent video encryption algorithm for a generic video application aiming also to obtain a solid video compression and quality performance with an acceptable video processing speed.

In order to be codec independent and optimize for compression performance, some techniques apply the encryption before the coding stage in such a way that the encrypted video is highly compressible. In this context, the technique proposed by Socek explores the duality of permutation in data encryption and compression (SOCEK et al., 2007). This technique is codec independent, with only a few codec restrictions, and explore the compressibility potential for a generic spatial codec.

This cryptographic scheme explores the spatial correlation within each video frame and uses permutations to enhance it in the encrypted data assuming there is a trivial and natural temporal correlation between consecutive frames. The compression performance of the scheme is very sensitive to the amount of motion in the scene and exploring non-trivial temporal correlation was not in the scope of the work.

However, many applications deal with high motion videos and most codecs nowadays also explore non-trivial temporal correlation in order to obtain a better compression rate. The original method is studied and extended in this dissertation to include more complex video coding concepts highly used by modern codecs in order to improve its performance.

This research can be considered an extended work of the original method. The crypto-graphic scheme proposed by Socek focused on improving the spatial correlation of every video frame with permutations. This work expands the research in exploring also non-trivial temporal correlation making it more suitable for high motion sequences.

(19)

1.4. ORGANIZATION 18

1.4 Organization

This work is organized in seven chapters. The first one introduces the context, motivation and highlights the objective of the research. In the second chapter, basic digital video encryption and compression concepts are explained. The third describes and analyzes the original method studied and extended by this work. In the fourth chapter, the temporal correlation extension proposed by this work is detailed. In the fifth chapter, the cryptographic system is analyzed in terms of security and performance. In the sixth chapter, experiments are conducted to evaluate the performance of the proposed methods in this dissertation. For last, in the seventh chapter, final considerations about this dissertation as well as possible future works are shown.

(20)

19 19 19

2

Digital video processing

This chapter introduces some basic concepts related to digital video processing, focusing in the coding and encryption areas. Also it defines the notations that are used in this document.

2.1 Definitions

A digital video can be denoted as a sequence of frames. Each frame is a digital image with specific width and height. As it is defined in (GONZALEZ; WOODS, 2006), a digital image is a two dimensional function f (x, y), where the variables x and y are spatial coordinates in the image and f (x, y) is the intensity or grayscale level in the point identified by x and y coordinates. Since it is a digital image, x, y and f (x, y) are discrete values. The term pixel is used to represent the elements of the digital image and is equivalent to a point in the space defined by the image function.

The definition used before is related to grayscale images, however it can be easily extended to include colored images concepts, only having to consider f (x, y) as a tuple that represents the color itself, i.e., usually the coordinates in the color space used. In the RGB color space, for instance, each value of f (x, y) would be a tuple (R, G, B).

The notation used in this work, especially in the description of algorithms and procedures, is the following (also shown in Figure 2.1):

The total number of frames in a video is represented by the variable N.

The variables W and H represent, respectively, the width and height of each video frame. Those two parameters are fixed in every frame of the video.

The n-th frame is denoted by F_n. The frames are indexed from 1 to N.

A pixel in the coordinates (x, y) of a frame F_ihas intensity denoted by F_i(x, y). The coordinate x is defined in the interval [0,W − 1], while y in [0, H − 1]. The point (0, 0) denotes the upper leftmost point in the image.

(21)

2.2. VIDEO CODING 20

(a) frame sequence (b) frame parameters

Figure 2.1: Video notation

The pixel intensity in the image varies from 0 to I_max, according to the intensity resolution of the image. In practice, in an n-bit resolution image, Imax= 2n− 1. The histogram of an image is the distribution of the pixel intensities in it. Mathematically, the histogram h(x) of an image can be defined as a function of the grayscale level x, where h(x) represents the number of pixels with intensity x in the image. A visual example of the histogram function plot is seen in Figure 2.2.

Figure 2.2: Example of image histogram plot

2.2 Video coding

Codecis a processing module responsible for encoding and decoding signal. This term is used in video, image and audio processing fields. There are two main types of codecs:

(22)

Lossless: assures the encoded signal can be decoded with content identical to the original signal. In general, this implies a much smaller compression rate, since no change can be made in the original signal to make it more compressible and it can only explore statistical redundancy in it.

Lossy: allows that the encoding process loses some information in the original signal. It is based in the principle that some information in the signal (minor details) can be discarded without affecting significantly how the signal is perceived. For instance, in the audio area, some frequencies are barely perceptible by human audition and hence, if removed, they do not degrade too much the quality of perception. Most codecs of this type have a quality control mechanism so that the user can increase or reduce the amount of information the codec may discard to make the signal more compressible.

2.2.1 Data redundancy

With respect to data encoding, it is important to be able to represent the desired infor-mation with as few inforinfor-mation units as possible. Some types of inforinfor-mation require less bits than others to be encoded under the same coding system. For example, a video with all frames identical to one another is much easily coded with fewer units of information than a video with very distinct frames.

In order to formalize the concept of information “complexity", it is needed to use a metric that is able to quantify the amount of information in a signal. With this purpose, Shannon introduced the concept of entropy (SHANNON, 1948) in the information research area, although it was originally proposed for communication theory (IHARA, 1993).

Entropy is associated to the level of uncertainty or unpredictability of the information. In other words, it is related to the level of lack of redundancy in the information. In a video, there are four types of data redundancy or data correlation that can be explored in the encoding (ESAKKIRAJAN; VEERAKUMAR; NAVANEETHAN, 2009; GONZALEZ; WOODS, 2006): Coding Redundancy : refers to the average length of the words used to encode the symbols that occur in the data to be encoded. It is applicable to many types of data, such as image, audio and text. For example, if a grayscale image, with only black and white pixels, is coded in intensity resolution of 8 bits, it would be highly redundant, since it would be possible to use a single bit to identify the black and white pixels all over the image.

Entropy coders in general are capable of eliminating a lot of coding redundancy. The classic Huffman coding (HUFFMAN, 1952) defines a codewords table prioritizing the smallest codewords to the ones that are most frequent in the data.

Spatial Correlation : applies to each video frame separately (intra-frame) and comes from the image compression area. Also known as interpixel redundancy, the idea

(23)

2.2. VIDEO CODING 22 is that in most images pixels spatially close to one another have high probability of having similar intensity, especially in the interior of regions. This tends to be true in the whole image, except for the edge areas. It also comes from the human vision perception: when an image with very low spatial correlation is seen, the shapes and details of it are barely recognizable by the human eye in most cases. This principle is also explored by encryption in order to degrade explicitly the quality of perception of an image, as shown in Figure 2.3.

(a) original Lena (b) random permutation of Lena image

Figure 2.3: Spatial correlation illustration

The concept of spatial correlation can also be easily explored by image compression algorithms. A simple example is the Run Length algorithm which encodes a sequence of pixels with same intensity with the pair of parameters: intensity value and number of pixels with such intensity.

Temporal Correlation : this concept involves more than a single frame (inter-frame). It is based in the video perception quality principle that consecutive frames tend to be very similar, differing among themselves by small object movements, unless it is a change of scene. This principle is justified by the fact that if this is not true, the video would cause a strange sensation of discontinuous movement of the elements of the scene which would greatly degrade the video’s perception quality.

Hence, the transition of two consecutive frames in the same scene usually consists of small object movements (motion), or camera translation, zoom or minor intensity variation. Figure 2.4 shows, for example, two consecutive frames and the residual difference between them (the difference is centered in the medium grayscale level). Psychovisual redundancy : consists in the fact that the human vision does not

(24)

(a) previous frame (b) current frame

(c) residual difference of (a) and (b)

Figure 2.4: Temporal correlation

respond equally to all frequencies in a signal (or intensities, colors of pixels), so some of them can be discarded barely degrading the signal perception. This type of redundancy requires more specific studying of the human vision and it can be more subjective and complicated to estimate properly.

2.2.2 Video frame types

With respect to exploring data redundancy in video, there are spatial-only codecs that explore only the data redundancy that occurs spatially in each frame separately. However, most codecs also explore temporal correlation. Hence the frames coding process can occur in three main ways, which will define the three main frame types (MAYER-PATEL; LE; CARLE, 2002):

I-frame: its compression is intra-frame, which means it is self encoded and decoded. However its potential of compression is more limited since the redundancy explored is only in itself. Spatial-only codecs only have frames of this type.

(25)

P-frame: also known as forward predicted picture. It is compressed based on small changes to an earlier coded picture. Obviously it is not self-decodable, requiring the reference frame to be decoded first.

B-frame: also known as bidirectionally predicted picture. It is compressed based on predictions or interpolations of earlier and/or later picture. Its potential of compres-sion is the best among the three types but its decoding is dependent on the reference frames. In a long chain of B and P frames, the decoding speed of a given frame can be degraded significantly.

2.2.3 Encoding and decoding module

In order to have a consistent way of decoding a video efficiently, when a P or B frame is being decoded, its reference frames should be previously decoded in the bitstream. The codecs usually read the bitstream linearly and store the decoded frames in the frame buffer module in memory. As the next frames are being read and decoded from the bitstream, their decoding, in case of a P or B frame type, can only depend on the frames that are already stored in the frame buffer. This implies that the coded frames order is not necessarily the same as the original video order, as it can be seen in Figure 2.5.

(a) original frame order

(b) coded frame order in bitstream

Figure 2.5: Video coding: frame types

To reduce the complexity of decoding a video, the amount of space needed in memory for it and also to allow the decoding process to “jump” to a given frame, the concept of group

(26)

2.2. VIDEO CODING 25 of pictures(GOP) was created. A group of pictures is a self decodable unit of frames, it can be interpreted as if the large video was divided into a set of multiple independent small videos. Naturally these units have to start with an I-frame. So whenever a given frame (which is not an I-frame) needs to be decoded from scratch, the decoding module must find which GOP that frame belongs to and start the decoding of that GOP until the desired frame is found in the bitstream. In an open GOP, it is possible that a few frames in the current GOP reference frames from a previous one, but such cross-GOP references are usually avoided not to degrade the decoding speed.

In B or P frames, the frame prediction mentioned is basically frame “alignment” (since there may have been some object movements or camera translation between such frames) and its compressibility potential comes from the coding of the frame residual difference of the current one and its prediction. The goal of frame alignment is to minimize the differences (prediction error) between the current and reference frames when calculating difference of pixel intensities in same coordinates.

To perform frame alignment, motion estimation parameters techniques must be computed and then these parameters are compensated using motion compensation techniques in the desired frame. The motion estimation parameters are usually computed in the encoding stage and encoded in the bitstream so that the decoder module is able to reconstruct the frame properly.

Codecs, in general, operate in a block basis, applying the encoding process of blocks separately. Some of them allow some blocks of a frame to be intra-coded (like an I-frame) and some others inter-frame coded, like a P or B frame. In this sense, the concept of reference frames is extrapolated to reference blocks, and addresses a block in one of the reference frames stored in the frame buffer.

Hence the encoding process for each block is usually the following:

1. Decide if the block should be intra-coded or inter-coded. This decision is usually based in the existence of a similar block in one of the reference frames. The similar block concept is usually the computation of a block difference metric and is detailed later in Chapter 4. In case it is intra-coded, encode the block itself with the process defined in step 3.

2. Compute motion vector parameters from this block to find a good matching block. The methods for finding such blocks efficiently in the reference frame is detailed later in Chapter 4. The motion vectors are basically vertical and horizontal translations of the position of the current block to the reference one. Encode these motion vectors in the bitstream using an entropy coder.

3. Compute the residual block which is the aligned difference of the current block and the predicted one. Encode the residual block like a spatial encoder, using an algorithm that explores spatial correlation. The most common approach for such encoding is

(27)

2.3. PERMUTATIONS IN DIGITAL VIDEOS 26

Figure 2.6: Video coding scheme

based on the discrete cosine transform (DCT), which outputs a matrix where the most significant coefficients (higher frequencies) are located in a given portion of it. After that, a zig-zag scan is performed on this matrix to generate an output sequence in such a way that the higher frequency values are close to one another and the values gradually decrease until eventually zero frequency values. The output sequence is then processed by a Run Length encoding to remove redundancy of high frequency values in the sequence and its output is finally given to an entropy coder.

The basic coding module scheme can be seen in Figure 2.6. The temporal module is responsible for storing the reference frames, performing motion estimation and computing the residual image after motion compensation. The spatial module explores the spatial correlation in the residual and outputs it to the entropy coder. The entropy coder also takes the motion vectors from the temporal module and outputs both into the bitstream.

2.3 Permutations in digital videos

A sequence s of length N is an ordered collection of N elements allowing repetition. Formally, we define:

s=h x₀ x₁ x₂ ... xN−1 i

We denote s[i] by the i-th element of the sequence s. A permutation of a sequence s is a bijection from s to itself (SOCEK et al., 2007), i.e., the mapping of an element to another element of the same sequence, without repetition of the mapped element and every element being mapped to another one. Formally, a permutation P of a sequence s is defined as a matrix:

P=h i0 i1 i2 ... iN−1 i

where:

(28)

2.3. PERMUTATIONS IN DIGITAL VIDEOS 27

i_j, where 0 ≤ j < N, is the index into which the j-th element is mapped. the notation P[ j] is equivalent to i_j.

Since P is a bijection, the permutation matrix has the following properties:

∀x, y ∈ {0, ..., N − 1}, x 6= y → i_x6= i_yand ∀x ∈ {0, ..., N − 1}, ∃y ∈ {0, ..., N − 1}|i_y= x.

The permutation matrix is defined unidimensionally, but it can be applied into a multidi-mensional entity (such as a video frame), by just defining an unidimultidi-mensional way of traverse through the multidimensional entity. In an image or video frame, this could be done by traversing the image row by row. The notation P(s) used in this work represents the application of the permutation P in the sequence s, which can be a frame for instance. A permutation application of a sequence s is defined as a sequence:

P(s) =h s[P[0]] s[P[1]] s[P[2]] ... s[P[N − 1]] i

Given a permutation P, the notation P−1is used to define its inverse permutation, i.e., the unique permutation P0such that P(P0(s)) = s.

Permutations have been widely used in encryption techniques for a long time. Permuta-tion based transformaPermuta-tions are the fundamental basis of modern symmetric key cryptography, such as in AES or DES encryption systems (SINGH; SUPRIYA, 2013).

Permutations have also been used as compression techniques primitives. The Burrows-Wheeler Transform (BWT) (BURROWS et al., 1994) is an example of permutation-based

transform. It operates modifying blocks of text, generating a block with same characters but in a different order, with improved spatial correlation.

In the context of encryption, permutations can be applied in different ways. It can be applied into each video frame separately or it can be applied into the bitstream itself. These permutations are usually generated by a secret key which must be shared between both sides of communication. Such secret permutation can then be applied into pixels or blocks of the frames to completely degrade the video’s perception quality but making it still decodable bitstream (pure permutation video encryption). Another way is to perform permutations into important parts of the bitstream making the permutated bitstream impossible to be decoded properly, unless it is decrypted.

Among the existing approaches, one of them consists in scrambling randomly the coefficients of the DCT in an MPEG video frame based on a secret permutation (TANG, 1996). Another known strategy is to apply the permutation on the codewords table used by the Huffman coding algorithm in the entropy coder module (BHARGAVA; SHI; WANG, 2004). Despite being optimized, both of them are highly invasive and specific to the codec.

(29)

2.3. PERMUTATIONS IN DIGITAL VIDEOS 28 Applying a permutation in pixels of an image affects the spatial correlation of their neighborhood. This fact is used by encryption techniques to destroy the spatial correlation, making it unrecognizable to non-authorized users. However, this drastically degrades the compression rate of the image. In the other hand, if a sorting permutation is used, for instance, bringing close pixels with similar intensities that were distant to each other in the image, the permuted frame becomes highly compressible. This would require the transmission of such permutation so that the frame could be reconstructed properly, which can be expensive.

According to (SOCEK et al., 2007), sorted and ‘almost-sorted’ frames, are strongly spatially correlated and, hence, can be even more compressible than the original frame, when using a spatial-only codec, as it is shown in Figure 2.7. The ‘almost-sorted’ concept mentioned by the authors refers to the application of the previous frame’s sorting permutation into the current frame. For this purpose, it is assumed that consecutive frames are very similar, such that this application results in an almost-sorted image with a few unsorted pixels. This result looks like a gradient-like image with some noise in it.

(a) original image PNG 176x144 22.3KB (b) ‘almost-sorted’ image PNG 176x144 12.5KB

(30)

29 29 29

3

Permutation-based video encryption

This chapter presents a detailed description of the video encryption method proposed by Socek et al. (SOCEK et al., 2007). This permutation-based video encryption method was designed to create an efficient and highly compressible solution for video encryption suitable for real time applications.

3.1 Cryptographic scheme

The cryptographic scheme assumes the existence of two communication channels. The first one, ChS, is a secure channel, where the data transmitted are encrypted with a safe commu-nication protocol, which is generally a conventional encryption technique. The second channel ChRallows free data transfer, without any protection or security protocol executed on both ends.

Both channels can be eavesdropped, but the secure channel requires that the attacker breaks the security of the encrypted data transmitted through such channel, which is assumed to be extremely computationally expensive and unfeasible. However any data transferred through the secure channel ends up in a much bigger bitstream size, because of its encryption, so this channel should be avoided as much as possible since it would affect drastically the video transmission rate.

This cryptographic scheme operates with a spatial-only codec module as a black-box, without depending on how it works or how the video is represented in the bitstream. Two functions are assumed to be available for the codec: encode and decode. The following notation is used for the codec module functions used in the system:

The expression E(F) denotes the output frame of the encoding of frame F to be written in the bitstream.

The expression D(F) denotes the decoded version of the frame F read from the bitstream.

Note that in case of a lossless codec, D(E(F)) = F, but this may or may not hold true for a lossy codec.

(31)

3.1. CRYPTOGRAPHIC SCHEME 30

Figure 3.1: Overview of cryptographic scheme

The encryption stage in this cryptographic scheme occurs before the encoding stage, in such a way that the encrypted data usually preserves or improves the spatial correlation of the pixels in each frame. An overview of the cryptographic system’s architecture is shown in Figure 3.1.

The scheme is a symmetric key cryptography type and the protocol includes a setup for the transmission of the first key, which is the unique sorting permutation of the first frame, through the secure channel. The ownership of the key determines the authenticity of a user. The next frames are transmitted through ChR and are encrypted with the current key, which is the previous frame’s sorting permutation. At each step, the current key is updated to the current frame’s sorting permutation. Hence, the second frame is encrypted with the first frame’s sorting permutation, the third one with the second frame’s sorting permutation, and so on.

3.1.1 Unique sorting permutation

The scheme’s encryption process is based on generating a key which is the unique sorting permutation of a frame. The unique permutation restriction comes from the fact that a frame can have multiple sorting permutations and since this permutation is the cryptographic key, both sides must be able to obtain the exact same sorting permutation so that the frame reconstruction works correctly.

The sorting permutation is computed in respect to the pixel intensities in the frame. There are many ways this permutation can be computed. The authors indicated a Quicksort variation method, shown in Algorithm 3.1, in which the permutation is computed as a copy of the frame is sorted.

(32)

3.1. CRYPTOGRAPHIC SCHEME 31 which is adjusted, and two integer parameters which are the beginning and ending indices of array to be sorted. Note that both F and P are seen as unidimensional entities of size W × H.

Algorithm 3.1: Unique sorting permutation method based on Quicksort 1 procedure unique_sorting_permutation (F, P, le f t, right):

Input :Frame F, unique sorting permutation P, integers le f t and right 2 Initialize i ← le f t − 1, j ← right, v ← F[le f t]

3 if right ≤ le f t then 4 return 5 end 6 repeat 7 i← i + 1 8 while F[i] < v do 9 i← i + 1 10 end 11 j← j − 1 12 while j > le f t and F[ j] > v do 13 j← j − 1 14 end 15 if i < j then

16 swap F[i] and F[ j] 17 swap P[i] and P[ j]

18 end

19 until i ≥ j

20 unique_sorting_permutation(F, P, le f t, i − 1) 21 unique_sorting_permutation(F, P, i + 1, right)

22 end

A sorted frame (gradient-like image) is obtained when applying the sorting permutation into the frame itself. However, when this permutation is applied in the next frame, it tends to generate an ‘almost-sorted’ frame, since the consecutive frames are very similar usually. The authors suggest that sorted and ‘almost-sorted’ frames are strongly spatial correlated, hence they can be highly compressible by a spatial encoder.

3.1.2 Encryption and decryption algorithms

In algorithm 3.2, the pseudocode for the encryption method is shown. It is appliable to both lossless and lossy codecs. Notice that, in case of lossy codecs, the encrypter side needs to compute the unique sorting permutation on the frames that will be decrypted by the receiver to make sure they both obtain the same key for the next frames. In case it is a lossless codec, the sorting permutation can be computed directly on the input frames.

(33)

3.2. ADVANCED EXTENSIONS 32 In this process, the first frame is sent through ChS, as part of the protocol setup, and its decoded version is set as the first key. At each step, the next frame is encrypted by such permutation key and sent to the encoder module before being transmitted through ChR and the key is updated to the sorting permutation of the currently decrypted frame.

Algorithm 3.2: Video encryption algorithm Input :Stream of video frames F1, ..., FN 1 begin

2 F₁transmitted ← E(F₁)

3 F₁decrypted← D(F₁transmitted)

4 P₁← unique_sorting_permutation(F₁decrypted) 5 Send F₁transmitted through ChS

6 for each F_i, where i= 2, ..., N do

7 F_itransmitted ← E(P_i−1(F_i))

8 F_idecrypted← P_i−1−1(D(F_itransmitted)) 9 Send F_itransmitted through ChR 10 if i < N then

11 P_i← unique_sorting_permutation(F_idecrypted)

12 end

13 end

14 end

The decryption process is simpler as described in Algorithm 3.3. This algorithm is also valid for both lossless and lossy codecs. Notice that the input of this procedure is the frames transmitted in the encryption process described earlier. Since the notation used in both algorithms is precisely the same, it can be verified that the expression found for the unique sorting permutation (which is the cryptographic key of the system) is exactly the same for both sides, which assures the correctness of the method.

3.2 Advanced extensions

In order to make the scheme more adapted to specific requirements or to improve certain aspects of the cryptographic scheme, the authors proposed a set of extensions, which are detailed below.

3.2.1 Perception quality control

One of the requirements of many video applications is to be able to control how much the perception quality is degraded in the encrypted video and also to allow partial encryption. On this perspective, Socek et al. designed a block-based extension.

(34)

3.2. ADVANCED EXTENSIONS 33 Algorithm 3.3: Video decryption algorithm

Input :Stream of video frames F₁transmitted, F₂transmitted..., F_Ntransmitted 1 begin

2 Receive F₁transmitted through ChS 3 F₁decrypted← D(F₁transmitted)

4 P₁← unique_sorting_permutation(F₁decrypted) 5 for each F_itransmitted, where i= 2, ..., N do

6 Receive F₁transmitted through ChR 7 F_idecrypted← P_i−1−1(D(F_itransmitted)) 8 if i < N then

9 P_i← unique_sorting_permutation(F_idecrypted)

10 end

11 end

12 end

In this extension, basically instead of computing the sorting permutation for the whole frame, it is computed for each block of fixed size separately. The quality perception control in this method is determined by the block size parameter used.

As the block size increases, the quality of perception decreases. In extreme cases, when block size is 1 × 1, each block sorting permutation is the identity and hence no actual encryption is performed, while when the block size is the same as the frame’s dimensions, the encryption is total and it performs just like the original method.

Notice that when the sorting permutation is reduced to the block space instead of the whole frame, each pixel is potentially permuted to a smaller region, which means that this also reduces the visual impact of unsorted pixels in the almost-sorted permutation coming from the motion between consecutive frames. This can be verified in the high motion sequence example in Figure 3.2, where the almost-sorted permutation errors are much more noticeable in higher block sizes. Hence, this parameter also affects the compressibility of the method, since it potentially affects the spatial correlation of the encrypted frames sent to the encoder modules.

3.2.2 Handling constant camera translation

The original algorithm is highly sensitive to global camera translation, since the consec-utive frames do not match properly in this situation and the almost-sorting permutation errors are bigger in such frames. In order to minimize this issue, the authors proposed a method of adjusting the sorting permutation with global camera translation parameters.

The idea of this extension is that the sender, before encrypting the video, detects the global camera translation, adjusts the sorting permutation and transmits those translation parameters (tx,ty) to the receiver, which will also adjust the sorting permutation once received to ensure both sides obtain same key.

(35)

3.2. ADVANCED EXTENSIONS 34

(a) original frame (b) 8 x 8 block-based method

(c) 16 x 16 block-based method

Figure 3.2: Block-based approach

The Algorithm 3.4 shows the proposed implementation for adjusting the permutation. In a simplified description, the pseudocode adjusts the permutation of a given pixel to the one shifted by the translation parameters. Since some of these pixels may fall outside the boundaries of the image, they are shifted to the bottom of the image.

3.2.3 Histogram hiding method

One of the biggest security issues of the original method is that the encrypted frame reveals completely the histogram of the original image. Some applications can not tolerate such information being revealed. Histogram can hint some aspects about the image, and expose what type of image is being encrypted, since some image types (darker, brighter, cartoon) usually have specific histogram patterns, not to mention it can make it easier for an attacker to reconstruct the original image.

The authors proposed the histogram hiding method which consists basically in applying the unique sorting permutation on the consecutive frame difference, instead of the plain current

(36)

3.2. ADVANCED EXTENSIONS 35 Algorithm 3.4: Constant camera translation adjustment for sorting permutation

1 function adjust_sorting_permutation (P,t_x,t_y):

Input :Sorting permutation P and translation parameters txand ty Output :Adjusted sorting permutation P0

2 Initialize begin ← 0 and end ← W × H 3 Initialize P0as a copy of P 4 for each 0 ≤ k < W × H do 5 i← t_x+ P[k] mod W 6 j← t_y+ bP[k]/W c 7 if 0 ≤ j < H and 0 ≤ i < W then 8 P0[begin] = j ×W + i 9 begin← begin + 1 10 else 11 end← end − 1

12 P0[end] = ( j mod H) ×W + (i mod W )

13 end

14 end

15 return P0

16 end

frame. This way, unless the attacker has access to a previously decrypted frame, it can not obtain any direct histogram information from the encrypted frame.

To allow the proper frame reconstruction, the difference between two frames F and G are formalized as in Equations 3.1 and 3.2.

∆(F, G)[x, y] = cli p F[x, y] − G[x, y] + Imax 2 3.1 clip(x) =          I_max, if x > Imax x, if 0 ≤ x ≤ Imax 0, if x < 0 3.2

The point of this difference function is to centralize the pixel intensity difference in I_max

2

, in such a way that intensities above this level represent a positive pixel intensity difference of F and G. The closer it is to a zero frame difference, the more the resulting image looks like plain average gray. This clipping function is only feasible for the lossy scenario though, because pixel differences above Imax

2 level are clipped to I_max

2 , which can greatly impact the video quality.

Also, this extension affects deeply the video compressibility. This is justified by the fact that a trivial way to explore temporal correlation and convert it into spatial correlation in videos is to encode the frame difference, as the process described in previous chapter. This can

(37)

3.2. ADVANCED EXTENSIONS 36 be noticed by the homogeneity of the encrypted frames shown in Figure 3.3 when using this extension.

(a) original encrypted algorithm (b) histogram hiding method

(c) original frame

Figure 3.3: Histogram hiding extension

3.2.4 Dealing with frame losses

The original scheme was built under the assumption that no frame is lost in the transmis-sion, since the decryption of a given frame is entirely dependent on its previous frame. However, this scenario is unrealistic for online video applications. Therefore, it needs an adaption to be able to deal with such issue. Also this scheme should be adapted to allow a more efficient decryption of a given frame without needing to decrypt all previous frames before.

Socek et al. propose to encrypt each group of pictures separately, which means transmit-ting the initial frame of every GOP through ChS, like the protocol setup. Since the GOP size is usually small enough, the decryption should allow efficient random frame access process and, in case a frame is lost in transmission, the decryption will not work only for the incoming frames of the current GOP and as soon as a new GOP is started, the decryption process will restore

(38)

3.2. ADVANCED EXTENSIONS 37 normally. A more convenient protocol can be implemented under this perspective that could, for instance, detect the loss of frames and reset a new GOP instantly.

(39)

38 38 38

4

Extended encryption

The scheme proposed by Socek et al. described in previous chapter was intended to be used by a spatial-only codec. It was out of scope of the original work to conduct a deeper study on the impact of motion in the algorithm and the technique was designed especially for low motion video sequences. In this chapter, the impact of motion in the scheme’s performance is investigated and an extension is designed to enhance the scheme’s performance for high motion video sequences.

4.1 Unique sorting permutation

One of the main aspects of the algorithm is the computation of the unique sorting permutation. A single frame can have multiple sorting permutations and different sorting permutations impact on how the next frame will be almost-sorted.

A sorting algorithm is classified as stable if it guarantees the order of same value elements in the sequence is preserved in the ordered vector (HORVATH, 1978). Algorithms like Quicksort are usually unstable, just like the procedure described earlier. Among different sorting permutations, a stable sorting permutation is preferred because of the spatial correlation principle, since keeping the order of same value elements is likely to reduce the average distance between the pixel original position and its permuted one.

Counting Sortis a stable sorting permutation, with linear performance on W × H and I_max (CORMEN et al., 2001). The proposed procedure described in Algorithm 4.1 computes the permutation based on the histogram of the frame. Given the cumulative histogram, the offset vector is derived to define the position to which a pixel of given intensity should be shifted in the permutation. This offset is incremented each time a pixel of such intensity is found.

By examining the pseudocode, it is easy to notice that the order pixels with given intensity is preserved from the original frame. The algorithm performs faster than the quicksort method described in previous chapter, and it also requires less space overhead in most cases, since it does not need a copy of the frame, but only a cumulative histogram computation of the frame.

(40)

4.2. MOTION SENSITIVITY 39 Algorithm 4.1: Unique sorting permutation method based on Counting Sort

1 function unique_sorting_permutation (F): Input :Frame F

Output :Unique sorting permutation P

2 Initialize histogram H₁to a vector of zeros for each pixel intensity 3 Initialize cumulative histogram H₂to a vector of size I_max+ 1 4 Initialize permutation P to a vector of size W × H

5 for each 0 ≤ x < W do 6 for each 0 ≤ y < H do 7 H₁[F[x][y]] ← H₁[F[x][y]] + 1 8 end 9 end 10 H₂[0] ← 0

11 for each 0 ≤ x ≤ I_max do 12 H₂[x + 1] ← H₁[x] + H₂[x] 13 end 14 for each 0 ≤ x < W do 15 for each 0 ≤ y < H do 16 P[x][y] ← H₂[F[x][y]] 17 H₂[F[x][y]] ← H₂[F[x][y]] + 1 18 end 19 end 20 return P 21 end

4.2 Motion sensitivity

The spatial correlation of the encrypted frame is directly dependent on the almost-sorting permutation quality. The quality of the almost-sorting permutation is associated to the frame differences, which are essentially histogram difference and motion (unaligned objects between consecutive frames, because of object movements, zoom or camera translation). Therefore, the compressibility and video quality performance of the cryptographic scheme depends on the amount of motion in the video.

One of the extensions proposed by the authors considers a global camera translation, however the motion in video usually consists of some local motion from object movements and global motion, such as camera translation, zooming, rotation. Also, the extension designed by the authors require the transmission of the translation parameters aside from the video bitstream, since that information can not be encoded into the bitstream in a codec independent way. An additional security issue is that information about the scene is being exposed with these translation parameters which would require concern about encryption and encoding of such

(41)

4.3. MOTION ESTIMATION 40 parameters.

4.3 Motion estimation

Interframe prediction is very useful in video coding to explore the large amount of temporal and spatial correlation existing in video sequences. Most video codecs essentially encode the differences between the current and predicted frames, which are based on the previous frames. As the prediction becomes more accurate, the smaller is the prediction error to be encoded and the higher the compression rate will be.

In still scenes, the previous frame is usually a very good prediction for the next frame. However, when there is significant amount of motion, a better prediction would be a frame where the elements that moved between both frames are aligned. This concept of adjusting or aligning displaced objects in frames is referred to motion compensation. This process usually involves the detection of the motion parameters, which is known as motion estimation, for later compensation.

Motion estimation techniques are basically classified in two main methods (ARMITANO; FLORENCIO; SCHAFER, 1996):

forward motion estimation (FME) : bases the motion estimation on both the current frame and a previously transmitted frame. Since the current frame is not known by the receiver side, the motion parameters need to be transmitted.

backward motion estimation (BME) : bases motion estimation only on frames pre-viously transmitted.

Motion estimation methods fall under two main categories (IRANI; ANANDAN, 2000; TORR; ZISSERMAN, 2000):

Direct methods : is meant to compute motion parameters in two aspects at once: the camera motion and the correspondence of every pixel in the frames.

Indirect methods : works on a feature basis. Instead of computing parameters for every pixel in the image, it focus on key features which are simpler to track in the frames. On the other hand, it needs additional concern on getting good features to extract motion parameters and usually involves a bigger computational cost.

With respect to the motion parameters semantics, the techniques usually can be classified into one of the four main types:

Global motion estimation : the parameters indicate usually camera related motion, such as translation, rotation and zoom. Hence, these parameters affect every pixel in the image. It is more suited for video sequences with essentially camera motion.

(42)

4.4. BLOCK-BASED MOTION ESTIMATION ALGORITHMS 41

Region-based motion estimation : segments the frame into a set of regions and for every region, computes motion parameters, by attempting to find a correspondence in the reference image. Usually involves the process of image segmentation, which can make it very computationally expensive.

Block-based motion estimation : splits the frame into blocks of fixed or variable size and for each of them, indicates a matching block in the reference frame, hence computing motion parameters for every block in the image. The computational cost by these approaches is usually very low and it is also very parallelizable. Most video codecs rely on this type of motion estimation to perform interframe prediction coding. Pixel-based motion estimation : is essentially a pixel correspondence problem and

computes motion parameters for every pixel in the image. The motion parameters in this approach are usually large and can be very redundant among pixels in same region with similar motion.

Considering the level of granularity of the motion parameters, the general purpose type, low computational cost and the intensive research on the area, block-based methods were preferred in the development of the extension.

4.4 Block-based motion estimation algorithms

The block-based techniques are essentially based in the block matching problem. This problem can be defined as to locate for each macroblock in the current frame the best matching block in a reference frame. The blocks can be defined by dividing the image frame into non-overlapping rectangular regions of a given size W × H. The motion parameters defined by these algorithms are referred as motion vectors, which model the horizontal and vertical movement or position displacement between the matched blocks.

4.4.1 Minimization of dissimilarity function

The block matching problem can be treated as a minimization problem. The objective function to be minimized is usually related to the dissimilarity of the blocks. So an optimal solution is to find for each block in the image, the block in the reference image with minimum dissimilarity value.

There are many dissimilarity functions used in the literature and it is one of the distinct aspects among the different block matching algorithms. The most commonly used functions are Sum of Absolute Differences(SAD), Sum of Squared Differences or Sum of Squared Errors (SSD or SSE), Mean Absolute Difference (MAD) and Mean Squared Difference or Mean Squared Error(MSD or MSE), which are defined in Equations 4.1, 4.2, 4.3 and 4.4.

(43)

4.4. BLOCK-BASED MOTION ESTIMATION ALGORITHMS 42 SAD(F, G) = W−1

∑

x=0 H−1

∑

y=0 |F(x, y) − G(x, y)| 4.1 SSE(F, G) = W−1

∑

x=0 H−1

∑

y=0 (F(x, y) − G(x, y))2 4.2 MAD(F, G) = SAD(F, G) W× H 4.3 MSE(F, G) = SSE(F, G) W× H 4.4 Finding the optimal solution for this problem is usually computationally inviable to be applied in real time context even at a high level of parallel computing. One of the main straightforward block matching algorithms is Full Search, which considers all the candidate blocks in limited rectangular region, computing the dissimilarity function for each of them and returning the best matching block in the region.

Despite guaranteeing the global minimum in the region, Full Search algorithm is ex-tremely inneficient and motivated many different algorithms to be proposed (PO; MA, 1996; JAMKAR et al., 2002; JAIN; JAIN, 1981; LI; ZENG; LIOU, 1994; ZHU; MA, 2000; KOGA et al., 1981). These algorithms are suboptimal and usually fall into an acceptable local mini-mum, with a fast implementation. Two of them are detailed here: Three Step Search and Two Dimensional Logarithmic Search.

4.4.2 Three Step Search (TSS) algorithm

Proposed in 1981 (KOGA et al., 1981), this algorithm is efficient, quite simple and obtains a near optimal block in most scenarios. The algorithm was intended to be used in video conference applications. Figure 4.1 shows a visual example of the algorithm’s process, which can be summarized in the following steps:

1. Choose an initial step size, which is usually 4 pixels (so that the algorithm ends up performing only three steps) and let the current search point be the same as the block’s location.

2. Take the eight pixels with horizontal or vertical distance of step size to the center and consider the block with center on each of these pixels as a candidate block.

3. If the minimum dissimilarity is in one of the blocks of the eight surrounding pixels, move the search point to it. The step size is then halved and step 2 is repeated until step size is smaller than 1.

The main problem of this method is usually for small motion parameters, because it only starts to move in the correct motion direction when the step size is too small and then the

(44)

4.4. BLOCK-BASED MOTION ESTIMATION ALGORITHMS 43

Figure 4.1: Example of Three Step Search algorithm’s convergence: the numbers indicate the center of candidate blocks at each execution step.

algorithm probably will most likely stop (step size becoming lower than 1) before it has the chance of locating the precise direction for the optimal block.

4.4.3 Two Dimensional Logarithmic Search (TDL) algorithm

This algorithm (JAIN; JAIN, 1981) is conceptually similar to the TSS algorithm, however usually more accurate with an average slightly higher cost. An example of the algorithm’s convergence is shown in Figure 4.2 and it can be described by the following steps:

1. Choose an initial step size and let the current search point be the original’s block location.

2. Consider the following five pixels: the current search point and the ones with vertical or horizontal distance equal to the step size (diagonals excluded). The five candidate blocks are the ones with center in these five chosen pixels.

3. If the best matching block is the one in the center, the step size is halved. Otherwise, move the search point to the best match position and repeat step 2. When step size becomes 1, execute step 2 but considering the diagonals similarly to TSS and return the block with best matching among the nine candidate ones.

Motion compensated permutation-based video encryption

Caio César Sabino Silva

MOTION COMPENSATED PERMUTATION-BASED VIDEO

ENCRYPTION

Caio César Sabino Silva

MOTION COMPENSATED PERMUTATION-BASED VIDEO

ENCRYPTION

Acknowledgements

Resumo

Abstract

List of Figures

List of Tables

List of Algorithms

Contents

1

Introduction

1.1

Multimedia data security

1.2

Video encryption

1.3

Objective of the research

1.4

Organization

2

Digital video processing

2.1

Definitions

2.2

Video coding

2.2.1

Data redundancy

2.2.2

Video frame types

2.2.3

Encoding and decoding module

2.3

Permutations in digital videos

3

Permutation-based video encryption

3.1

Cryptographic scheme

3.1.1

Unique sorting permutation

3.1.2

Encryption and decryption algorithms

3.2

Advanced extensions

3.2.1

Perception quality control

3.2.2

Handling constant camera translation

3.2.3

Histogram hiding method

3.2.4

Dealing with frame losses

4

Extended encryption

4.1

Unique sorting permutation

4.2

Motion sensitivity

4.3

Motion estimation

4.4

Block-based motion estimation algorithms

4.4.1

Minimization of dissimilarity function

∑

∑

∑

∑

4.4.2

Three Step Search (TSS) algorithm

4.4.3

Two Dimensional Logarithmic Search (TDL) algorithm