Analysis and coding of visual objects: new concepts and new tools

(1)

UNIVERSIDADE T ´

ECNICA DE LISBOA

INSTITUTO SUPERIOR T ´

ECNICO

Analysis and Coding of Visual Objects:

New Concepts and New Tools

Manuel Pinto da Silva Menezes de Sequeira

(Mestre)

Dissertaç ão para obtenç ão do grau de doutor em

Engenharia Electrot ´ecnica e de Computadores

Orientador:

Doutor Carlos Eduardo do Rego da Costa Salema,

Professor Catedr ático do Instituto Superior T écnico da Universidade T écnica de Lisboa

Co-Orientador:

Doutor Augusto Afonso de Albuquerque,

Professor Catedr ´atico do Instituto Superior das Ci ˆencias do Trabalho e da Empresa

Constituiç ão do J úri:

Presidente:

Reitor da Universidade T ´ecnica de Lisboa

Vogais:

Doutor Augusto Afonso de Albuquerque,

Professor Catedr ´atico do Instituto Superior das Ci ˆencias do Trabalho e da Empresa

Doutor Jos ´e Manuel Nunes Leit ˜ao,

Professor Catedr ático do Instituto Superior T écnico da Universidade T écnica de Lisboa

Doutor Luis Ant ´onio Pereira Menezes Corte-Real,

Professor Auxiliar da Faculdade de Engenharia da Universidade do Porto

Doutor M ´ario Alexandre Teles de Figueiredo,

Professor Auxiliar do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa

Doutor Jorge dos Santos Salvador Marques,

Professor Auxiliar do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa

Doutora Maria Paula dos Santos Queluz Rodrigues,

Professora Auxiliar do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa

(2)

(3)

(4)

(5)

(6)

(7)

Itissometimessaidthattheorderbywhi hnamesarementionedinthea knowledgmentshasno

spe ial meaning. Not in this ase. Without Prof. Augusto Albuquerque'sfriendship, without

his en ouragements, and without his supervision, this thesis would not have been possible.

Thanks.

To Jo~aoLuis Sobrinhoand CarlosPires,fortheirfriendshipand support.

To all my olleagues in ISCTE, for their good will. Spe ial thanks to Luis Nunes and Filipe

Santos,they knowwhy.

To the staof theImage Groupof theUniversitatPolite ni ade Catalunya, fora epting me

astheirownduring a month of valuabledis ussions. It was afundamental month.

ToProf. CarlosSalema,forhiswillingnesstosupervisethisthesisallthistime. Forhispatien e.

To theImage Groupof IST, fortheirfriendshipand for thepleasant lun hdis ussions, always

stimulating,seldomte hni al...

To theInstitutode Tele omuni a ~oes,forthe oÆ e andequipmentI usedforsolong.

To Prof. Fernando Pereira whom, by allowingmetowork fortheCEC RACE MAVT proje t,

greatlyfa ilitatedmy onta tswith theinternational video oding ommunity.

To myfriends,who alwaystrusted memorethanI didmyself.

This thesis was partially funded by the CIENCIA and PRAXIS programs of JNICT. A

sub-stantial part of this work was done while I was working for the CEC MAVT RACE proje t,

(8)

(9)

Costuma-sedizernosagrade imentosqueaordempelaqualosnomesapare em n~aotem

qual-quer signi ado. N~ao e o aso. Sem a amizade do Prof. Augusto Albuquerque, sem os seus

en orajamentose sem a suaorienta ~ao,esta tese n~ao teriarealmentesido possvel. Bemhaja.

Ao Jo~ao Luis Sobrinhoe ao CarlosPires,pelo apoio eamizade.

Aos olegas do ISCTE, agrade o a suaenormeboa vontade. Agrade imentos espe iais ao Luis

Nunes eao Filipe Santos,elessabemporqu^e.

AtodososinvestigadoresdoGrupode Imagemda UniversidadePolite ni ada Catalunha,que

mea eitaram no seuseioduranteumm^esde prof uasdis uss~oes. Foium m^esfundamental.

Ao Prof. CarlosSalema,porse dispor a orientaresta tese, ao longo de todo este tempo. Pela

suapa i^en ia.

Ao grupo de imagem, pela amizade e amaradagem, e pelas dis uss~oes ao almo o, sempre

estimulantes,raramentete ni as...

Ao Instituto de Tele omuni a ~oes, pelas infraestruturas e equipamento informati o que me

permitiuutilizar.

Ao Prof. Fernando Pereira,por, atravesda parti ipa ~ao noproje to MAVT,me terpermitido

manter onta tos ient osinterna ionaismuitomais dif eis de outraforma.

Aosmeusamigos,que sempre a reditaramem mimmuito maisdo queeu proprio.

Estatesefoipar ialmentenan iadapelaJNICT,programasCIENCIAePRAXIS. Umaparte

substan ialdestetrabalhofoirealizadano^ambitoproje to RACE MAVT,etambemnaminha

(10)

(11)

Video odinghasbeenunderintenses rutinyduringthelastyears. Thepublishedinternational

standards rely on low-level vision on epts, thusbeing rst-generation. Re ently

standardiza-tion startedin se ond-generationvideo oding,supported on mid-levelvision on eptssu has

obje ts.

This thesis presentsnew ar hite tures for se ond-generation video ode s and some of the

re-quiredanalysis and odingtools.

Thegraphtheoreti foundationsofimageanalysisarepresentedand algorithmsforgeneralized

shortest spanning tree problems are proposed. In this light, it is shown that basi versions

of severalregion-orientedsegmentationalgorithmsaddress thesameproblem. Globalizationof

informationisstudiedandshownto onferdierentpropertiestothesealgorithms,andto

trans-formregionmerginginre ursiveshortestspanningtreesegmentation(RSST).RSSTalgorithms

attempting to minimize global approximation error and using aÆne region models are shown

to be very ee tive. A knowledge-based segmentation algorithm for mobile videotelephony is

proposed.

Anew ameramovementestimationalgorithmisdevelopedwhi hisee tiveforimage

stabiliza-tion and s ene ut dete tion. A ameramovement ompensation te hniquefor rst-generation

ode sis also proposed.

A systematization of partition types and representations is performed with whi h partition

oding tools are overviewed. A fast approximate losed ubi spline algorithm is developed

withappli ationsinpartition oding.

Keywords: visual oding,se ond-generationvideo oding,imageanalysis,imagesegmentation,

(12)

(13)

A odi a ~aodevdeotemsidointensamenteestudadanosultimosanos. Asnormas

interna io-naisjapubli adasbaseiam-seem on eitosda vis~aode baixonvel, sendoportantode primeira

gera ~ao. Come oure entementeanormaliza ~aode te ni asde odi a ~ao desegunda gera ~ao,

suportada em on eitosda vis~ao demedio nvel tais omoobje tos.

Estatese apresentanovasarquite turaspara odi adoresde vdeo desegunda gera ~ao e

algu-mas das orrespondentesferramentas deanalisee odi a ~ao.

Apresentam-sefundamentosdeteoriadosgrafos apli adaaanalisedeimagemeprop~oem-se

al-goritmosparageneraliza ~oesdoproblemadaarvore abrangentemnima. Mostra-sequevers~oes

basi as de varios algoritmos de segmenta ~ao orientados para a regi~ao resolvem o mesmo

pro-blema. Estuda-sea globaliza ~aode informa ~ao e mostra-seque onferepropriedades diferentes

a esses algoritmos, transformando o algoritmo de fus~ao de regi~oes no algoritmo de arvores

abrangentes mnimasre ursivas (RSST). Mostra-sea e a ia de algoritmos RSST quetentam

minimizar o erro global de aproxima ~ao e que usam modelos de regi~ao ans. Prop~oe-se um

algoritmo baseadoem onhe imentoprevioparasegmenta ~aoem vdeo-telefonia movel.

Desenvolve-seumumalgoritmode estima ~aode movimentosde ^amarae azna estabiliza ~ao

deimagemenadete ~aodemudan asde ena. Prop~oe-setambemumate ni ade ompensa ~ao

de movimentosde ^amara para odi adoresde primeira-gera ~ao.

Sistematizam-seostiposeasrepresenta ~oesderegi~oes,revendo-sedepoiste ni asde odi a ~ao

de parti ~oes. Desenvolve-seumalgoritmo rapido e aproximado para al ulo de splines ubi as

fe hadas.

Palavras have: odi a ~ao visual, odi a ~ao de vdeo de segunda gera ~ao, analise de

(14)

(15)

A knowledgements i

Agrade imentos iii

Abstra t v

Resumo vii

1 Introdu tion 1

1.1 Stru ture of thethesis . . . 2

2 Video and multimedia ommuni ations 5 2.1 Trendsof multimedia ommuni ations . . . 5

2.1.1 Distributionmethods . . . 6 2.1.2 A tivityparadigms . . . 7 2.1.3 Convergen e tenden ies . . . 7 2.1.4 Adistributed database. . . 8 2.2 Media representation . . . 9 2.3 Visual analysis . . . 10

2.3.1 Levels of visualanalysis . . . 11

2.3.2 Toolsforvisualanalysis . . . 12

2.4 Visual oding . . . 12

2.4.1 Obje tives. . . 13

(16)

2.4.3 Generations . . . 15

2.5 Standards . . . 16

2.5.1 Standardization hallenges. . . 17

2.5.2 Evolutionof visual odingstandards . . . 19

2.5.3 Consequen esof standardization . . . 19

2.5.4 Standardsand generations. . . 20

2.6 Analysis and odingtools . . . 20

3 Graphtheoreti foundations for image analysis 23 3.1 Color per eption . . . 23

3.1.1 Colorspa es . . . 24

3.2 Imagesand sequen es . . . 26

3.2.1 Analogimages . . . 26

3.2.2 Digitalimages . . . 27

3.2.3 Latti es,samplinglatti es and aspe tratio . . . 27

3.3 Grids, graphs,and trees . . . 29

3.3.1 Graphoperations . . . 31

3.3.2 Walks, trails,paths, ir uits,and onne tivityingraphs . . . 33

3.3.3 Eulertrailsand graphs. . . 34

3.3.4 Subgraphs, omplements, and onne ted omponents. . . 35

3.3.5 Rankand nullity . . . 36

3.3.6 Cutverti es, separability,andblo ks . . . 36

3.3.7 Cutsand utsets . . . 36

3.3.8 Isomorphism,2-isomorphism,and homeomorphism . . . 37

3.3.9 Trees and forests . . . 38

3.3.10 Graphsand latti es . . . 68

3.4 Planargraphs and duality . . . 71

(17)

3.4.3 Four- olortheorem . . . 75

3.5 Maps. . . 76

3.5.1 Operationsonthe dualRAMG andRBPG graphs . . . 76

3.5.2 Denitionofmap . . . 77

3.5.3 Algorithms . . . 78

3.6 Partitionsand ontours . . . 78

3.6.1 Partitions and segmentation. . . 79

3.6.2 Classesand regions . . . 79

3.6.3 Regionand lassgraphs . . . 81

3.6.4 Equivalen eand equalityof partitions . . . 83

3.7 Con lusions . . . 87

4 Spatial analysis 89 4.1 Introdu tiontosegmentation . . . 90

4.2 Hierar hizingthesegmentationpro ess . . . 91

4.2.1 Operatorlevel . . . 92

4.2.2 Te hniquelevel . . . 103

4.2.3 Algorithmlevel . . . 104

4.2.4 Con lusions . . . 106

4.2.5 Pre-pro essing . . . 107

4.3 Region-and ontour-orientedsegmentationalgorithms . . . 108

4.3.1 Contour-orientedsegmentation . . . 108

4.3.2 Region-orientedsegmentation . . . 109

4.3.3 SSTsas aframework ofsegmentationalgorithms . . . 114

4.3.4 Globalizationstrategies . . . 118

4.3.5 Algorithmsandthe dualgraphs. . . 132

4.3.6 Con lusions . . . 133

(18)

4.4.2 Algorithmdes ription . . . 136 4.4.3 Computationaleort . . . 151 4.4.4 Results . . . 151 4.4.5 Con lusions . . . 153 4.5 RSST segmentationalgorithms . . . 153 4.5.1 Pre-pro essing . . . 154 4.5.2 Segmentationalgorithm . . . 155

4.5.3 Resultsand dis ussion . . . 158

4.5.4 Con lusions . . . 175

4.6 Supervisedsegmentation . . . 175

4.6.1 RSSTextensionusing seeds . . . 175

4.6.2 Results . . . 176

4.6.3 Con lusions . . . 176

4.7 Time- oherent analysis . . . 178

4.7.1 ExtensionofRSST tomovingimages . . . 178

4.7.2 Results . . . 180

4.7.3 Con lusions . . . 180

5 Timeanalysis 183 5.1 Cameramovements. . . 184

5.1.1 Transformations onthe digitalimage . . . 186

5.2 Blo kmat hingestimation. . . 189

5.2.1 Errormetri s . . . 191

5.2.2 Algorithms . . . 191

5.3 Estimating amera movement . . . 191

5.3.1 Leastsquaresestimation . . . 192

5.3.2 Outlierdete tion . . . 196

(19)

5.4.1 Results . . . 202

5.5 Image stabilization . . . 215

5.5.1 Viewingwindow . . . 215

5.5.2 Viewingwindowdispla ement . . . 216

5.5.3 Results . . . 217

6 Coding 225 6.1 Cameramovement ompensation . . . 225

6.1.1 Quantizing ameramotionfa tors . . . 226

6.1.2 ExtensionstotheH.261 re ommendation . . . 228

6.1.3 En oding ontrol . . . 229

6.1.4 Resultsand on lusions . . . 230

6.2 Taxonomyof partitiontypesand representations . . . 231

6.2.1 Partitiontype. . . 232

6.2.2 Partitionrepresentation . . . 232

6.2.3 Representationproperties . . . 236

6.3 Overviewof partition oding te hniques . . . 237

6.3.1 Losslessorlossy oding . . . 237

6.3.2 Mosai vs.binarypartitions . . . 238

6.3.3 Partitionmodels . . . 238

6.3.4 Class oding . . . 239

6.3.5 Label oding . . . 239

6.3.6 Contour oding . . . 241

6.4 A qui k ubi splineimplementation . . . 246

6.4.1 2D losedsplinedenition . . . 246

6.4.2 Determinationof thespline oeÆ ients. . . 247

(20)

6.4.5 Results . . . 252

7 Con lusions: Proposal for a new ode ar hite ture 255 7.1 Proposal fora se ond-generation ode ar hite ture. . . 255

7.1.1 Sour emodel . . . 256

7.1.2 Code ar hite ture . . . 257

7.1.3 Con lusions . . . 263

7.2 Suggestions forfurtherwork . . . 263

7.2.1 Code ar hite ture . . . 263

7.2.2 Graphtheoreti foundations forimageanalysis . . . 265

7.2.3 Spatialanalysis . . . 265

7.2.4 Timeanalysis . . . 267

7.2.5 Coding . . . 267

7.3 Listof ontributions . . . 267

7.3.1 Graphtheoreti foundations forimageanalysis . . . 268

7.3.2 Spatialanalysis . . . 268

7.3.3 Timeanalysis . . . 268

7.3.4 Coding . . . 268

A Test sequen es 271 A.1 Video formats . . . 271

A.1.1 Aspe tratios . . . 272

A.2 Testsequen es . . . 273

B The Frames video oding library 279 B.1 Librarymodules . . . 280

B.1.1 types.h . . . 282

(21)

B.1.4 mem.h . . . 283 B.1.5 matrix.h . . . 283 B.1.6 sequen e.h . . . 284 B.1.7 ontour.h . . . 284 B.1.8 d t.h . . . 284 B.1.9 filters.h . . . 284 B.1.10 graph.h . . . 285 B.1.11 heap.h . . . 285 B.1.12 motion.h . . . 286 B.1.13 splitmerge.h . . . 286 B.2 An exampleof use . . . 287

(22)

(23)

2D... Two-dimensional

3D... Three-dimensional

ADSL... Asymmetri DigitalSubs riberLine

ANSI... Ameri anNational StandardsInstitute (astandards body)

CAG... ClassAdja en yGraph

CATV... CableTelevision (formerlyCommunityAntenna Television)

CCIR... ComiteConsultatifInternationaledesRadioCommuni ations (nowITU-R)

CCITT... Comite Consultatif Internationalede Telegraphique etTelephonique (now

ITU-T)

CD... Compa tDisk

CD... CommitteeDraft (ofanISOstandard)

CEC... EuropeanCommunityCommission

CERN... ConseilEuropeen pourlaRe her he Nu leaire

CIE... CommissionInternationalede l'

E lairage

CIF... CommonIntermediateFormat

CPU... CentralPro essingUnit

CRT... Cathode-RayTube

DCT... Dis reteCosine Transform

DFS... Depth FirstSear h

DFS... Dis reteFourierSeries

(24)

DSig... DigitalSignatures (ofW3C)

http://www.w3.org/DSig /O ver vi ew .h tm l

FCT... Four-Color Theorem

FIR... FiniteImpulseResponse

FLC... FixedLength Code

FSF... FreeSoftwareFoundation

FTP... FileTransferProto ol

GOB... GroupofBlo ks (asyntaxelement inITU-T H.261)

GPL... GNU GeneralPubli Li en e (ofFSF)

HDTV... HighDenitionTelevision

HSV ... Hue,Saturation, and Value (a olorspa e)

HTML... Hypertext Markup Language (ofW3C)

HTTP... Hypertext Transfer Proto ol (ofIETFand of W3C)

HVS... HumanVisual System

IEC... International Ele trote hni alCommission (astandards body)

http://www.ie . h/

IEEE... TheInstituteof Ele tri aland Ele troni sEngineers,INC.

http://www.ieee.org/

IETF... InternetEngineering Task For e (astandards body)

http://www.ietf.org/

IIR... Innite ImpulseResponse

IP... InternetProto ol

IPR... Intelle tualPropertyRights

IRC... InternetRelayChat

ISDN... Integrated Servi esDigital Network (ofITU-T)

ISO... International OrganizationforStandardization (astandardsbody)

http://www.iso. h/

ITU... International Tele ommuni ationUnion (astandardsbody)

http://www.itu. h/

(25)

ITU-T... ITUTele ommuni ationStandardizationSe tor (astandardsbody,seeITU)

http://www.itu. h/ITU- T/

JPEG... JointPhotographi ExpertsGroup (ofISOand IEC)

LMedS... LeastMedianof Squares

LoG... Lapla ianof Gaussian

LSF... LongestSpanning Forest

LST... LongestSpanning Tree

MAVT... MobileAudio-VisualTerminal

MB... Ma roBlo k (ablo kof1616pixels,onITU-TandISO/IECvideo oding

standards)

MBA... Ma roBlo kAddress (asyntax element inITU-T H.261)

MC... Motion Compensated

MF... Model Failure

MMREAD.... Modied MREAD (see MREAD)

MPEG... MovingPi ture ExpertsGroup (ofISOand IEC)

MREAD... Modied Relative Element Address Designate

MTYPE... Ma roblo kType (asyntax elementinITU-T H.261)

MVD... Motion Ve torData (asyntax elementinITU-T H.261)

NTSC... National TelevisionSystemsCommittee (astandards body)

OALDCE... OxfordAdvan ed Learner'sDi tionaryof Current English

OCR... Opti alChara terRe ognition

PAL... PhaseAlternating Line

PC... Personal Computer

PEI... Pi tureExtraInsertion Information (asyntax elementin ITU-TH.261)

PICS... PlatformforInternetContent Sele tion (ofW3C)

http://www.w3.org/PICS /

PNG... Portable NetworkGraphi s (ofW3C)

(26)

PSNR... Peak SignaltoNoiseRatio

PSPARE... Pi tureSpare Information (asyntax element inITU-T H.261)

PSTN... Publi Swit hedTelephoneNetwork

PTYPE... Pi tureType (asyntaxelement inITU-T H.261)

QCIF... Quarter-CIF (see CIF)

QPT... Quarti Pi tureTree

RAG... RegionAdja en y Graph

RAMG... RegionAdja en y MultiGraph

RBPG... RegionBorder PseudoGraph

R GB... Red,Green,and Blue (a olorspa e)

RLE... Run-LengthEn oding

RM8... Referen eModel8 (areferen e modelforITU-T H.261)

RSST... Re ursive SST (seeSST)

SIF... StandardInter hangeFormat

SMIL... Syn hronized MultimediaIntegrationLanguage (ofW3C)

http://www.w3.org/TR/W D- smi l

SSF... ShortestSpanning Forest

SSkT... ShortestSpanning k-Tree

SSSkT... ShortestSeeded Spanning k-Tree

SSSSkT... SmallestShortest Seeded Spanningk-Tree

SST... ShortestSpanning Tree

TCP... Transmission ControlProto ol

TR-RSST... Time-Re ursiveRSST (see RSST)

TV... Television

UMTS... UniversalMobileTele ommuni ationServi e

UPC... UniversitatPolite ni ade Catalunya

URC... UniformResour e Chara teristi s (ofIETF)

(27)

URN... UniformResour e Name (ofIETF)

VLC... VariableLength Code

VO... Video Obje t (asyntaxelement inISO/IECMPEG-4)

VOD... Video-On-Demand

VOL... Video Obje t Layer (asyntax element inISO/IECMPEG-4)

VOP... Video Obje t Plane (asyntax elementin ISO/IECMPEG-4)

VQ... Ve torQuantization

VRML... Virtual RealityModelingLanguage (astandard ofISOand IEC)

W3C... World WideWeb Consortium (astandards body)

(28)

(29)

Introdu tion

\The time has ome,"the Walrus said,

\To talk of many things:"

Lewis Carroll

The performan e of lassi al video oding algorithms, in terms of the lassi al oding riteria

(bitrate, distortion, and ost), seems to be rea hing a plateau [161,3℄. That is, the marginal

performan egainsof tuning thesealgorithmsare nownearly negligible. A ording toAdelson

et al. [3, 195℄, the lassi al approa hes use on epts usually related to low-level vision, su h

as luminan e, olor, spatial frequen y, temporal frequen y, lo al motion, and low-level

opera-tors su h as linear ltering and transforms. New approa hes, using mid-level visual on epts,

su h as regions, textures, surfa es, depth, global motion, and lighting, are deemed ne essary

fora breakthroughinvideo odingperforman e. This needhasbeenre ognizedforsometime

now[144,141,96℄,thoughlimited omputing apabilitieshavehinderedsomewhattheadvan es

towards the implementation of ompletemid-level vision video(se ond-generation) oding

al-gorithms.

Duringthelastyears,andfollowingtheeverin reasingadvan esofte hnology,theuseofimage

and videoineverydaylifehasbeengrowing ontinuously. This hassparklednew needs among

users: intera tivity, ontentediting,and ontentbasedindexingarejustafewexamples. These

needs requirethe a essto the ontentof video sequen es. This a ess may,in some ases, be

done after en oding and de oding, i.e., by performing analysis at the re eiver side. In most

ases, though,itis essentialtohavethis apabilitydire tlyatbit stream level. Content a ess

shouldthusbedonewithaminimumofeort: a\fourth[ oding℄ riterion"hasbeenidentied,

oinedby Pi ard[162℄as\ ontent a esseort." This riterionisrelated tothe omplexityor

eortrequiredtoa ess thevideo ontent,and hen etoprovide ontent-basedfa ilities.

The resultsobtained untilnow by mid-levelvision video oding algorithms, though extremely

(30)

However, thisapparent la kof su ess istruly amisjudgment,sin ethe performan e hasbeen

measured, until now, using only the bitrate, distortion, and ost riteria. When the fourth

riterion is introdu ed, the newly developed algorithms ertainly have a leading edge overthe

lassi alones: obje tsandregions,ratherthansquareblo ks,arewhatanuserwantstointera t

with.

The new users' needs have also been re ognized by MPEG-4. These ideas were introdu ed

in MPEG-4 [138℄ by asking for some \new or improved fun tionalities" [139℄: ontent-based

manipulationandbit streamediting, ontent-basedmultimediadataa esstools,and

ontent-based s alability.

This thesis summarizes a series of proposals towards oding of visual obje ts. The work has

progressed over a number of years and an be seen as a ontribution to the development of

se ond-generationvisual odingstandards ofwhi hMPEG-4is an example.

1.1 Stru ture of the thesis

Chapter 2,\Video and multimedia ommuni ations", ontains a brief overview of multimedia,

theInternetand video ommuni ations. It anbeseenasamotivationfortheworkdeveloped.

Video ode sare lassiedasrst-, se ond-,orthirdgenerationa ording totheanalysis tools

required: rst-generation for low-level vision analysis, se ond-generation for mid-level vision

analysis, and third-generation for high-level vision analysis. A brief summary of the analysis

and odingtoolsproposedinthisthesis,organizeda ordingtothepresentedstru ture, anbe

foundinSe tion2.6.

Chapter 3, \Graph theoreti foundations for image analysis", denes most of the theoreti al

on epts that are used throughout. In this hapter the important theory of spanning trees,

a bran h of graph theory, and related on epts using seeds, is dis ussed together with the

orrespondingalgorithms. Anamortizedlineartimealgorithmisalsopresentedforanimportant

lassof spanningtree problems.

Chapter4,\Spatial analysis", ontainsproposalsforaknowledge-basedmobilevideotelephony

segmentationalgorithm,an extended RSST (Re ursive SST) segmentationalgorithm using an

aÆne region model, a supervised RSST segmentation algorithm, i.e.,a RSST algorithm using

seeds,andatime-re ursiveversionoftheRSSTalgorithmprovidingtime oherentsegmentation

ofmovingimages. The lassi alsegmentationalgorithms,su h asregiongrowing,region

merg-ing,edgedete tionfollowedby ontour losing,arealldes ribedintheframeworkofthetheory

of spanning trees introdu ed inthe previous hapter. The relations between thesealgorithms

is dis ussed in the ommon framework of spanning trees. The ee ts on these algorithms of

globalizationof information arealso dis ussed.

Chapter 5, \Timeanalysis", proposes a simple algorithm for estimating amera movement in

movingimagesandamethodforits an ellation(imagestabilization)toimproveimagequality

(31)

Chapter 6, \Coding", proposes a method of en oding amera movement information using

a simple extension to the H.261 standard (the dis ussions on quantization are general and

transposable toany other ode using motion ve tor eldswith redu edresolution relative to

thatof theunderlyingimages)and reviews theimportant issueof partitionrepresentationand

oding. A fast approximation to the al ulation of losed ubi splines is also proposed. The

analysis and odingtools presentedin thisand theprevioustwo hapters anbe seen assteps

towards thebuilding of toolsfora new ode ar hite ture.

Chapter 7, \Con lusions: Proposal for a new ode ar hite ture", proposes a new

se ond-generation ode ar hite ture, makes some suggestions for future work, and lists the thesis

ontributions.

Finally, Appendix A des ribes the test sequen es used and their formats, and Appendix B

(32)

(33)

Video and multimedia

ommuni ations

It is supposed that be ause a thing isthe rule it

isright.

Os ar Wilde

2.1 Trends of multimedia ommuni ations

\Medium" literally means\middle". A ording tothe OALDCE (Oxford Advan ed Learner's

Di tionaryofCurrent English)[71℄,itmeans\thatbywhi hsomethingisexpressed,"i.e.,that

bywhi ha messageisexpressed,sin e, a ordingtoNegroponte[142℄, \themedium isnotthe

message." Messages an be expressed using a variety of media. Multimediais the pro ess of

expressing a messageusing several media. In this sense, multimedia is not new. Multimedia

existssin etherearebookswithimages, 1

a tuallyevenbeforethat,sin ehumans ommuni ate

byspee h andgestures.

Untillast enturyourabilitytostoreandtransmitmessageswasverylimited. Onlytextandstill

imagesanddiagrams ouldbestoredforfutureuse(e.g.,inbooks),andlongrangetransmission

waslimitedtophysi altransportofprintedorhandwrittenmaterial,withrareex eptions. The

telegraph, for long range transmission of text, the telephone, for long range transmission of

voi e, the radio,for long range transmissionof sound, hanged that pi ture onsiderably. But

perhaps the most important inventions of the last entury were the phonograph, for storing

sounds,and the inematograph, by whi hstoring of movingimages be amepossible.

1

Dierent media ansharethesame sense(or\ hannel") intothe humanbrain. Textand imagery, though dierent media,arebothsensedusingvision.

(34)

Inthebeginningofthis enturyitwaspossible,atleastinprin iple,toexpressmessages using

multimedia as we know it today and store them for future use. In pra ti e, this happened

only in the thirties, with the introdu tion of sound syn hronized with image in the inema.

Stereos opi imagerywasalsoavailableatthattime.

2.1.1 Distribution methods

A message, asexpressed through movingimages and sound ina lm, ismeant tobe onveyed

to a re eptor. Although movie theatersare still a very su essful and protable way of doing

it, they involve onsiderable delayand trouble. Using Negroponte's [142℄ \bits" and \atoms"

denitions,theprodu erdistributesthelm artridges(atoms) ontainingen odedimages and

sounds (bits)whi h arethenbroad astedfroma s reenandspeakerstoa restri tedaudien e. 2

A newdistribution paradigmwas learlyne essary.

TV (Television) partiallysolved the distribution problem,by using radio broad ast of

analog-i ally en oded moving images and sound. However, TV also introdu ed some new problems:

beingbroad asted,anybodywitha TVset ould enjoyit. Who (andhow) shouldthenpayfor

the ontent onveyed? From TV taxes (virtually un hargeable), to in ome taxes (in the ase

of subsidized television), through advertisements and mixtures thereof, several solutions have

beenproposed,mostof whi harestillbeingusedtothisday. Thesesolutionswerenotenough.

Point-to-point ommuni ation,su hasthat providedbythetelephone,was ne essary.

Computer networks, providing point-to-point ommuni ations in a dierent framework, were

alsoanimportantdevelopment. Inthe1970'stheTCP(TransmissionControlProto ol)/IP

(In-ternetProto ol) proto olsweredevelopedandputtousemostlybythegovernmentand

edu a-tionalinstitutions inthe USA.By the eightiesit wasspreadall overthe world,though mostly

restri tedtothea ademi world. Inthebeginningofthenineties,followingthedevelopmentby

theCERN(ConseilEuropeenpourlaRe her heNu leaire) ofthesuiteofWWW(WorldWide

Web)proto olsandformats,viz.UR*, 3

HTTP(HypertextTransferProto ol),andHTML

(Hy-pertext Markup Language), theWeb exploded: itbe ameattra tive tothe ommon user, and

hen ee onomi allyviable.

In the late forties, TV started to be distributed by able in areas where the broad ast signal

ould not be re eived with normal antennas( ommunityantenna television). Cable television

was soon found to oer onsiderable advantages relative to broad ast television: in reased

quality,in reasednumberof hannelsthroughalargeravailablebandwidth,noneedforantennas

and thuslowervisual impa t(important in ertain urbanareas), et . Re ently,CATV (Cable

Television) operators, typi ally diusion oriented, realized they had deployed over the years

an almost ubiquitous broadband network whi h ould be improved with smallinvestmentsto

provideup-linkstotheuser. Thus,withthehelpof able modems,providers startedbuildinga

sortof \residentialareanetworks", onne tingusersintheneighborhoodtothe ablehead-end

2

Imagesandsoundsin lmare mostlyen odedin ananalogformat,eventhoughdigitalsoundisexpanding qui kly. Theseimagesandsounds ontaininformation, whi h anbe measuredin bits,evenifdigital en oding isnotused.

3

(35)

and then e totheworld.

The explosionof theWeb inthe nineties,together withthe personal omputer andthe almost

ubiquitous wide band CATV networks, suddenly allowed dierent ontent to be delivered to

dierent onsumers. Consumers ouldnow hooseandevenintera twiththematerialdelivered

(and pay a ordingly): theage of the Web, teleshopping, PPV(Pay-Per-View), VOD

(Video-On-Demand)and WebTV wasborn.

2.1.2 A tivity paradigms

There are essentially two a tivityparadigms for information provided to the onsumers. The

push paradigm, when the information provider pushes the information to a passive user, and

thepull paradigm,when thea tiveuser requests informationfrom theservi eprovider.

TV broad ast is push, sin e the information is pushed tothe onsumerwithoutrequiring any

a tiononhis part(besidesturning theTVonand hoosinga hannel). However,VOD ispull,

sin etheuser requestswhateverinterestsher.

TheWeb,untilre ently,exhibitedonlythepullbehavior. Allthea tionwasonthepartofthe

enduser,whi hwouldalwaysmakespe i requestsastowhatinformationshouldbedelivered

to him. Nowadays, the push paradigm has been implementedby most browsers, through the

on eptof automati allyupdated hannels,in a learparallel withTVdiusion.

2.1.3 Convergen e tenden ies

Convergen e of distribution methods and te hnologies

A wealth of ommuni ation servi es exist today. Most of the hannels involved in these

ser-vi es are slowly being enhan ed to provide bidire tional ommuni ationsand improved

band-width. For instan e, CATV networks now provide bidire tional data hannels through able

modems,satellite onstellationsarebeingdeployedforpersonalmobile ommuni ations,andthe

UMTS (Universal Mobile Tele ommuni ation Servi e), providing a wider bandwidth than

to-day's ellularphones,isexpe tedinthenearfuture. Also,theanalog hannelsoeredbytheold

PSTN (Publi Swit hed Telephone Network) are slowly being digitized to provide ISDN

(In-tegrated Servi es Digital Network). Re ently, ADSL (Asymmetri Digital Subs riber Line)

started tobe usedtoestablish widebanddata hannels onthe telephoni opperloop.

At the same time, all elds of multimedia and ommuni ations are being enhan ed through

the use of digital te hnology. Digital re orded sound is already used in the movie theaters

(probablytobe followed soon bydigitalmovingimages)and digitalTVwillsoonbe available,

(36)

Convergen e of servi es and a tivity paradigms

The servi es availablearealso onverging. Thereis a tenden ytosupportbothbroad ast and

point-to-pointdistribution,dierent media,andbothpushand pullparadigms. Cablemodems

allowpoint-to-point ommuni ationswhereformerlybroad ast wastherule, andmodemsover

POTS(PlainOldTelephoneServi e)allowbroad ast(oratleasttheWeb equivalentof

broad- ast,multi ast)where formerlyonlypoint-to-point ommuni ations wasused. Videotelephone

over POTS is now possible (and soon will be also possible on ellular phones), and the TV

servi e was long ago upgraded to in lude teletext. Videotelephony, on the other hand, is also

possibly on the Web, and supplements the old pear-to-pear ommuni ations servi es of the

Internetsu h asemailand (ele troni ) talk,and, morere ently,IRC (Internet RelayChat).

Convergen e of ontents

On the demand side, onsumers require high quality ontent. The produ tion of multimedia

ontent, inwhi htheentertainmentindustries (TV, inema andgames) ex el,is thusthriving.

Consumers are also demanding more and more intera tive ontrol over the information they

re eive, an issue whi h is a spe ialtyof the informati s(software/ omputer) industries. Thus

thetenden yformergersanda quisitionsbetween ompaniesintheentertainment,informati s

and networkbusinesses.

Consumers also require mobilityand ompatibility. Thuslargeinformati s ompaniesare also

investingonglobal,satellitebased,mobilenetworks,andmoreandmore areistakennowadays

with standardization and ompatibilityby ontent providers, TV ompanies, and informati s

ompanies.

2.1.4 A distributed database

Itseemsreasonabletoexpe tthatthe onvergen epro esswillleadtouniversala essto

infor-mation. Therewillprobably be littledieren ebetweenTV,phone, fax,and thePC(Personal

Computer). In fa t,thePC isalready doublingasTV, aphone, and afax. The Web will

on-ne talmosteverythingand everyone. It is expe ted toprovidepeople witha ompleteleisure,

work,and so ialenvironment,a essed throughawealthof dierentinterfa es,su hass reens

together with remote ontrols, desktop s reens, keyboards, pointing devi es, mi rophones and

speakers,voi e ontrolledhand-held devi eswithhandwritten hara ter re ognition (e.g.,

evo-lutions of the PalmPilot TM

), data-gloves, et . Su h a network of information an be seen a

(37)

2.2 Media representation

Take a CD (Compa t Disk) of an or hestra playingMozart's symphony 41, Jupiter. What is

the essentialpart: the s oreorthe sound (aunique interpretation)? Although 600megabytes

areusedinatypi alCDtostorethesound,the orrespondings oremaybestoredinmu hless

spa e. CD audio does noten ode the stru ture: iten odes, asfaithfully aspossible,a opy of

theoriginalsound. Thesamethinghappenswithfax: Istillnditfrustratingtoexplaintousers

offaxmodemswhyitisthey annotimportthere eivedfaxes(mostlytext)dire tlyintotheir

word pro essors, without using the (still) error prone OCR (Opti al Chara ter Re ognition)

software.

Consider,however, thatfaxesweremade intelligent: theywouldanalyzetheinputpage,dete t

textzones, re ognize thetext, and en ode it astext, instead of bla k and white raster images

of hara ters. 4

This would learly lead to improved usability, if not also to a redu tion of

transmissiontime.

Visualdata,espe iallyvideo(takenhereassequen esofimagessampledfromthenaturalworld

s ene),isaveryimportantpartoftoday'smultimedia,anditsimportan etendstoin reasewith

the onvergen e of entertainment and informati s industries. However, video is still en oded

with the same \blindness" that ae ts fax and CD sound: the stru tured ontents of video

s enes are simply ignoredin the en oding pro ess, leading to a representationwhi h is notat

all stru tural[142℄, faithfulasit maybetothe original.

Visual analyzers would do thesamefor videoas thehypotheti alfax analyzerfor a bla k and

whiteimage: fromasequen eofvideoimages,theywouldextra tastru turalrepresentationof

thes enetherein,thes ene's\s ore"plus\interpretationnuan es". Su hastru tural

represen-tation, asidefromthe expe ted e onomiesinen odedsize, would allowtheuser tomanipulate

thes ene atwill: a bigstep towards ompleteintera tivity.

The exponential growth of digital te hnology, where lo k frequen ies dupli ate almost every

year and memorydensities (bits per volume) almosttripli ateinthe sameperiod of time,has

led to an ever in reasing use of omputers by ontent providers (su h as lm produ ers and

TV ompanies). Syntheti imagery orresponds nowadays to an important part of the bits

ex hanged worldwide. However, not mu h eort was put untilnow intothe eÆ ient (soon to

bedened) representationof syntheti data, whi his inherentlystru tural.

Hen e,two important problemsmustbe solved urgently: how toobtainstru tural

representa-tionsfrom naturaldata (thes oreand the interpretationnuan esfrom a symphonyre ording,

the text from a printed do ument, the des ription of the s ene seen ina video sequen e) and

how toeÆ iently en ode stru tural representations, either syntheti or obtained from natural

data.

The rst of these problems is analysis. In the ase of visual data, analysis is addressed by

omputer visionwhi h, a ording to [68, Harali k and Shapiro℄, is \the s ien e that develops

thetheoreti aland algorithmi basisbywhi husefulinformationabouttheworld anbe

auto-4

(38)

mati ally extra ted and analyzed from an observed image,image set, orimage sequen e from

omputationsmade byspe ial-purposeorgeneral-purpose omputers."

The se ond problem is related to the en oding of the stru tural des ription of the data. In

the ase of visual data, several en oding methodshave beendevised inthe past,ranging from

the analog television standards su h as NTSC (National Television Systems Committee) and

PAL (Phase Alternating Line), to the digital video oding standards ITU-T (ITU T

ele om-muni ationStandardization Se tor)H.261 [62℄, ISO(International Organization for

Standard-ization)/IEC (International Ele trote hni al Commission) MPEG-1 [136℄, and, more re ently,

ITU-T H.263 [63℄ and ISO/IEC MPEG-2 [137℄ (also ITU-T H.262). These standards have

typi ally dealt with non-stru tural representations of imagery. The rst standard to address

stru turedmovingimagerepresentationswillbeISO/IEC MPEG-4.

2.3 Visual analysis

Even though syntheti data amounts to a relevant part of the availablemultimediamaterial,

naturaldatawillalwaysbepresent. Naturaldata orrespondstodatawhi hisobtained,usually

through sampling, from the real world. While it is reasonableto expe t that sensors, su h as

video ameras,willin reasein omplexityovertheyears,forinstan ebyin orporatingdistan e

or depth sensors, it is unlikely that they will ever provide a stru tural representation of the

sampled dataattheiroutput.

Hen e, analysis, that is, the de omposition of the input data into a meaningful set of some

model parameters, is a very important task. Automati visual analysis, as stated before, is

almost the same as omputer vision: \building a des ription of the shapes and positions of

things from images" [107℄. With one dieren e, however. The purpose of omputer vision is

ultimately the omprehension of the s ene aptured by the amera, through an emulation of

theHVS(HumanVisual System),whileanalysis usuallyhasmoremodest obje tives.

Analysis,asstated,istheidenti ationofsomemodelparameters. Thismakesmodelingoneof

themostimportant tasksinresear h leadingtoautomati analysisof videosequen es,sin eit

seems learthat sophisti ated models an leadto a very a urate representationof the world,

butonlyatthe ostofaverysophisti ated,orevenimpossible,analysis: visualanalysisisoften

an ill-posedproblem[187, 9℄.

Visual analysis anhave several purposes [29℄:

Analysis for oding

Theobtainingofaparametri des riptionoftheobserveds ene. Thedes ription anlater

beusedtore onstru tthes enesothatlittleornoinformationislost. Thedes ription an

also be en oded and de oded eÆ iently(analysisforbandwidth saving), and analso be

manipulated(analysisforeasya ess),sothattheuser anintera twiththerepresented

world. The analogy with fax helps here. With \blind" fax, su h asexists today, to edit

(39)

Analysis for des ription or indexing

The obtaining of a parametri des ription, though in this ase it is not ne essary to be

able to re onstru t theobserved s ene orat least the originalsampled (or sensed) data.

The parameters of thedes ription have mostly a semanti meaning,whi h mayhelp the

taskofsear hingvisualdatainadatabase. Themodelparameters,orfeatures,estimated

oridentied,willbe usedas keys ofthedatabase.

Analysis for understanding

The pro ess leading tounderstanding of the observed s ene. While visual analysis tools

ingeneralaretools leadingtoarti ial intelligen e,orsooneexpe ts,analysis for

under-standingis arti ialintelligen eproper.

Analysis an be manual, automati , or partially automati , when an automati algorithm is

guided by user input (hints). An usual path in the resear h in this area, whi h, though it

progresses very qui kly, hasstill a long way togo, is to allow thealgorithms tobe supervised

and then attempt to make them automati . This is a polemi issue, however, as an be seen

in the arti le \Ignoran e, myopia, and naivete in omputer vision systems" [81℄ and in the

subsequent dialoguein[7℄and [94℄.

2.3.1 Levels of visual analysis

Some authors divide the visionpro essinto levels [107, 195℄ whi h are related tothe types of

models orprimitivesassumed:

Low-level 5

vision

The model is a sequen e of pixel matrixes. The orrelation between pixels is assumed

tobe high. Evolution from oneimage to thenext is des ribed by a simplemotion eld,

uniformalmosteverywhere.

Mid-level 6

vision

Themodelisapossiblyhierar hi alsetofedgesegments,blobs,uniformlytexturedregions

(or equivalently boundaries) or regions of uniform motion. Surfa es and their relative

position may also be used. Motion an be asso iated with segments and/or edges or

boundaries.

High-level 7

vision

The modelis aset of 3-Dobje ts arranged hierar hi ally. Obje ts are semanti ally

iden-tied. Ea h obje thas anasso iated omplexmotion.

Understanding

The role, lassoridentityof(almost allof) theobje ts isknown.

5

Orimage. 6

Orprimalplus21=2-Dsket hes. 7

(40)

Thisdivisionishereamerematterof onvenien e. Itisalsosomewhatarbitrary,sin efeedba k

me hanismsseemtoexist betweentheupperand thelowerlevelsof thevisionpro ess. Visual

analysiswillbe lassiedinthefollowinga ordingtotherstthreelevels,sin eunderstanding

is not oneof the purposes here. The terms low-level, mid-level and high-level analysis willbe

usedthroughout thisthesis.

2.3.2 Tools for visual analysis

Analysis anbe seenas being doneat three levels: low-,mid-, and high-level. Dierent image

analysis tools have been developed over theyears whi h anbe lassiedas belonging to ea h

of theselevels. Restri tingattention tothose tools more losely relatedto analysisfor oding,

thefollowing(rather in omplete) lassi ation anbe used:

Low-level vision analysis

Lineartransformations(transforms),frequen yanalysis,motionestimation(opti al ow,

blo kmat hing),et .

Mid-level vision analysis

Edgedete tion, ontourdete tion,segmentationintosynta ti allyuniformregions,motion

estimation(motionofedges and regions), et .

High-levelvision analysis

3D(Three-dimensional)stru ture fromshading and motion,3Dstru turefrom disparity

(stereo vision),et .

2.4 Visual oding

Coding 8

is thepro essof translatinga sequen eof symbols belonging toa given alphabet, the

message, 9

into a sequen e of symbols of a dierent alphabet (usually the binary alphabet).

Coding issaid tobe losslessifthe originalmessage anbere overed exa tlyfrom theen oded

one.

Visual odingisthepro essbywhi htheparametersofthestru turalrepresentationofavisual

s ene obtained either by analysis or dire tly, in the ase of syntheti imagery, are en oded.

Whentherepresentationisobtainedbyanalysisof naturaldata,thetermvideo odingifoften

used.

8

Coding should always be understood as referring to sour e oding throughout this thesis, as opposed to hannel oding.

9

Notethedierent meaningsofthe word \message". Ina ommuni ationsframework, itis theset ofideas expressedusingagivenmediumorensembleof media. Inthe ontextofinformation theory,itisasequen eof

(41)

2.4.1 Obje tives

En oding, the translation between one alphabet and another, an have several obje tives. It

an be seen as the pro ess of minimizing a ost fun tionalgiven some onstraints. There are

severalmeasureswhi h anbeusedtoexpressboththe ostfun tionalandthe onstraints, and

whi h,weighted dierently,re e ttheobje tivesofea hparti ular odings heme:

Compression ratio (or,inversely,bitrate)

Thesizeoftheoriginalmessagedividedbythesizeoftheen odedmessage,bothexpressed

in bits. By maximizing ompression, the bandwidth or spa e requirements are redu ed,

a ording towhetherthe dataistransmittedor stored.

Quality (or,inversely,distortion)

Ameasureofthedieren ebetweentheoriginalmessageandtheoneobtainedbyde oding

theen oded message. Errorresilien eisa ountedforinthismeasure byallowingerrors

toae ttheen odeddata.

Cost

The ostof theen oder andde oder(weightedappropriately).

Content a ess eort

A measure of the easiness with whi h only spe ied parts of the original message an

be re overed from the en oded message. By maximizingease of a ess,simpleterminals

an stillallow the user to manipulate the s ene. Video tri k modes an also be seen as

requiring easya essto ontents (inthis ase tosinglevideoimages).

Delay

Theintervalbetweentheinstanta symbolof theoriginalmessageisinputtotheen oder

and the orresponding symbolis outputfrom thede oder, assumingno hanneldelay.

Quality is perhaps the most diÆ ultmeasure to make, in the ase of visual oding. How an

an obje tive measure of quality re e t the quality of the re onstru teds ene as per eived by

humans? Even though studies have been ondu ted overtheyears todevelop su ha measure,

basedonthepropertiesoftheHVS,nosingleuniversallya eptedmeasureexists. Twomeasures

ofqualityaretypi allyusedtodayinthe aseofvideo oding: asimpleobje tivemeasure, alled

PSNR(Peak SignaltoNoiseRatio),and subje tivequalitymeasuresbased onevaluation bya

signi ant setof persons.

Cost is related mostly to implementation of en oders and de oders, though it an be related

also totherequiredbandwidth,whi hisdependent onthe ompressionratio,and thusalready

onsidered through that measure. Implementation osts an be related to the memory and

CPU (CentralPro essing Unit)powerrequired forbothen odersand de oders.

The ostfun tionaland onstraints anbe onstru tedfromthemeasuresabovesoastore e t

the dierent requirements of an appli ation. Some appli ations may require quality as high

(42)

appli ationswhere ontentisen oded on eandde oded manytimesputa largerweightonthe

ost ofde oders.

2.4.2 Main ode blo ks

Figure 2.1shows a typi alblo k stru tureofa ode . The en oder part onsistsof an analysis

blo k, whi hobtainsa stru turals ene representationfrom givennaturaldata, followed bythe

en oder,whi hen odesthisrepresentationsoastobesentdownalogi al hannel(eitherareal

hannel or some physi al storage medium). If syntheti data is available, it is input dire tly

to the en oder without being analyzed, provided it is already des ribed in an appropriately

stru tured way. The de oder performs the opposite tasks. The en oded data is de oded so

as to obtain the stru tural s ene representation whi h is then used by the renderer to reate

appropriatestimulitothehumanre eivers,whi h anhavedierentlevels ofintera tivitywith

thesystem.

Oftensome pro essing is performed on thenatural data before the analysis proper. This

pro- essing usually intends to lter or ondition the data so as to render the analysis simpler or

moreee tive. Sin e it takes pla e before analysis and en oding,it is alled pre-pro essing. It

isoften takenasbeing partof theanalysisitself.

The worden oderisused herewith twodierent meanings: inthe ase ofnatural data, whi h

requiresanalysis,en oder anbothmeanthe ompletesystem,fromnaturaldatarepresentation

totheresultingen oded message,orsimplytheblo kwhi htranslates thestru tural

represen-tationintotheen oded message, whi h isthe stri tmeaning. Inthe sequelthe exa tmeaning

willbe evidentfrom the ontext.

Anen oder, inthebroadsenseoftheword,servestwomainpurposes. Firstly,itissupposedto

stripirrelevant information(fromthepoint ofview oftheassumed re eiveroftheinformation,

usually the HVS) from the input. Irrelevan y removal is done by the analysis blo k, sin e,

a ording to Marr [107℄, \vision is a pro essthat produ es from images of theexternal world

a des ription that is useful to the viewer and not luttered with irrelevant information [our

emphasis℄," and to emulate visionis the ultimate purpose of analysis. Se ondly,the en oder,

again in the broad sense, is supposed to remove redundan y. This is a role whi h is shared

by the analysis and the en oder blo ks, though the kind of redundan y removed is dierent.

Theanalysisblo kremovesrepresentationredundan ybyttingtheinput datatoa stru tural

model. For instan e, the highly redundant image of a sphere an be des ribed, with an

ap-propriate model, by the position and size of the sphere, its surfa e hara teristi s, and a set

of light sour es. Su h a des ription is mu h less redundant than the original array of pixels.

The en oder blo k, on the other hand, removes statisti al redundan y from the sequen e of

symbols orresponding tothe stru tural representation. It must be stressed here thatremoval

(43)

Encoding

Analysis

Encoding

synthetic

structured data

natural

unstructured data

to channel or

storage device

Analysis

user

Pre-processing

supervision

(a)En oder.

Rendering

Decoding

from channel or

storage device

to display

devices

user

interaction

to uplink

channel

(b)De oder.

Figure2.1: Basi blo kstru tureof a ode .

2.4.3 Generations

Inthe aseofnaturals enes,i.e.,video oding,analysisisperformedbeforeen odingproper,as

anbeseeninFigure2.1. Videoen odingte hniques anthusbe lassieda ordingtothelevel

ofanalysis typi allyrequired. Theterms rst-andse ond-generationvideo odingwere oined

byKunt et al. [96℄, and orrespond approximately to thetwo rst levels of analysis presented

before. The requirements interms of analysis of thesetwo generations of video oders, plus a

thirdonerelated withhigh-levelanalysis areasfollows:

First-generation

Coders whi h require low-level analysis. Hybrid oders [64℄ and motion ompensated

hybrid oders[145℄belongtothisgeneration. Thefundamentaltoolsusedinthese oders,

(44)

belong tothisgeneration.

Se ond-generation

Coders whi hrequire mid-levelanalysis. Thistypeof analysis is typi allymore omplex

than low-level analysis. Even though a lot of eort has been put into thiseld, a truly

reliable set of mid-levelanalysis tools is not yet mature. This thesis ontributesmainly

totheproblem ofdevelopingtools atthislevel.

Third-generation

Coders whi hrequirehigh-levelanalysis. No trulyreliableautomati analysissetof tools

existsatthislevel. Mosttoolsstillrelyonhumansupervision,anditprobablywillremain

sofora few moreyears: mostof thesemanti features/des riptors anonlybe extra ted

byhumansatthepresenttime [29℄.

This lassi ation, though useful, is somewhat arti ial. For instan e, a mid- or high-level

analysistool anbe usedtoenhan earst-generationvideoen oder. Thisoftenhappenswhen

videoen odingalgorithmsarebeingenhan ed.

2.5 Standards

Standards are fundamental for universality of servi e and interworking, both of whi h are of

paramount importan e fortheend onsumer. Standardization, however, is a time- riti al

pro- ess: ifdonetoosoon,itmaynotbenetfromtheongoingresear hinthearea,ifdonetoolate,

it may have to fa e proprietary solutions proposed by industries of suÆ ient weight to make

thestandarduseless.

Standardsmaybeoftwoverydierentnatures. Englishisadefa tolanguagestandardinmost

of thewestern world. Fren h,on theother hand,is ade jurestandard,atleastinFran e: itis

standardizedbytheA ademieFran aiseandimposedbytheFren hstateinoÆ ialdo uments.

The asewith te hnologiesissimilar.

Standards, whether de fa to or de jure, an be reated indierent ways. Some aredeveloped

byan open groupof ompanies, universitiesand individuals whi h work towards thestandard

undersomenational,e.g.,ANSI(Ameri anNationalStandardsInstitute),orinternational,e.g.,

ISO, standardization body. Others are developed by similar groups, though working on the

frameworkofnon-oÆ ialorganizationssu hastheW3C(World WideWebConsortium)orthe

IETF(Internet Engineering Task For e). Others stillare developed bysingle institutions and

theirspe i ation made publi and a eptedas defa to standards by therestof themembers

of the market. Often de fa to standards are later a epted as de jure standards by oÆ ial

standardizationbodies.

In the world of multimedia, examples an be found in ea h of these ases. The video oding

standards MPEG-1 and MPEG-2, and H.261 and H.263, were developed under international

standard organizations, viz.ISO and ITU (InternationalTele ommuni ationUnion), and thus

are de jure standards. The Java TM

(45)

beenproposedto ISOtobe ome ade jurestandard). The Webstandards, su h asHTTP and

HTML,arebeingdeveloped intheframeworkoftheIETFand W3Cnon-oÆ ialorganizations.

Convergen e analsobefoundintheworldof standards: theMPEG(MovingPi tureExperts

Group) ommunity,traditionallyvideo-oriented, and theWWW ommunity,moremultimedia

oriented, are onverging. The MPEG ommunity is nalizing the rst version of MPEG-4.

MPEG-4 version 1 willbe mu h morethan video and audio odingwith a multiplexing layer,

as MPEG-1 and MPEG-2 were: MPEG-4 will standardize audio-visual 3D s ene des ription

methods, by in lusion of the ISO/IEC 14772 VRML (Virtual Reality Modeling Language)

standard. TheWWW ommunity,ontheotherhand,isissuingdo uments,whi hwillprobably

be omede fa to standards, that address similarsubje ts: PNG (Portable Network Graphi s)

for en oding of still images, support of VRML for 3D virtual worlds (whi h in ludes video

nodes),andSMIL(Syn hronizedMultimediaIntegrationLanguage)forsyn hronizingdierent

multimediaobje ts inasinglepresentation. Morethana onvergen e, whatis being witnessed

is an overlap, a ompetition. The future will tell whether the minimalist, text-based, W3C

and IETF standards orthe overwhelmingMPEG standards will win. Marketdoes notalways

hoosethebestte hnology: often timing,asmentionedbefore,is the riti al fa tor.

2.5.1 Standardization hallenges

Nowadays standardization of multimedia ommuni ations fa es several hallenges. Dierent

te hnologies(some ofthem standards), bydierentorganizations, willaddress distin tsubsets

of the hallengeslistedbelow:

Content

Interesting ontentswillsoonin lude omplex3Ds enes, ontainingamixtureofsyntheti

andnaturaldynami obje ts,whi h anbemanipulatedbytheenduser. Whowillprovide

thistype ofinformation or ontent? How? I.e.,using what tools?

Bandwidth

Network bandwidth and mass storage apa ities both ontinue to grow exponentially.

Evenintheunlikelyeventthatthey will ontinuetoin reaseexponentiallyforever,

\te h-nologi almalthusianism"tellsusthatthebandwidth/ apa itywillneverbeenough,sin e

ontentwillalwaysgrowatafasterpa e. Hen e,therewillalwaysbemoneytobegained,

orspared,through ompressionof themultimediadatatransmitted orstored.

The issueof ompressionhastypi allybeenmu hmoreof a on ernforthevideorather

thanthemultimediapeople. Anumberofstandards,aimingatdierentappli ations,have

beendevelopedforthe ompressionofvideoandstillimages: H.261 andH.263,MPEG-1,

MPEG-2, MPEG-4(soontobe born),and ISO/IEC JPEG(Joint Photographi Experts

Group). Fromthemultimediaworld,less on erned,unfortunately,withbandwidthwaste,

littlemorethantheW3C PNGexiststoday.

A ess

(46)

ommunityhasonlyre entlystartedtoaddressinathoroughway,inMPEG-4. Thereare

goodreasonsforthelate onvergen e: ompressionandeasya essarequitein ompatible,

andforsometimebandwidthwasmoreimportantthanintera tivity. Thebalan eislikely

to hange.

Classi ation

TheWeb isahuge,distributeddatabase, whosesizetendstoin reaseexponentially. How

an users navigate through this apparent haos in a useful way? How an multimedia

informationsu hastext,2D(Two-dimensional)pi tures,2Ddrawings,2Dvideos,sound

lips,movies,TV programs,3Dobje ts, andmixturesthereof, be indexed,sear hed, and

retrievedinameaningfulway? Willtheindexing,or lassi ation,bedoneautomati ally?

This is an issue whi h is being simultaneously addressed by W3C and MPEG, through

the re ently born MPEG-7 eort. W3C is working on Metadata, or information about

information,while MPEG-7aimsatstandardizingmultimediaindexingmethods.

Rights prote tion

Providers of interesting ontent, individual authors or ompanies, will be interested in

getting paid. How an IPR (Intelle tual Property Rights) be prote ted on the Web?

What will the network e onomi s be like? How will IPR information be in luded on

multimediaobje ts?

A ess ontrol and rating

Should all information on the Web be available to all? Who should ontrol? How to

ontrol? How torateinformation? Howto iphersensitive information?

W3Chas addressed thisquestionthrough a type of Metadata alledPICS (Platformfor

InternetContent Sele tion),whi haims atstandardizing themethod of in ludingrating

information (labels)intoWeb ontent.

Trust

Is the information available on the Web trustworthy? How to as ertain its real origin?

How aninformation be ertied? How anone assurethat a signature ertiesa given

pie eofinformation and thatthisinformation hasnot hangedin anyway?

W3C is also working on DSig (Digital Signatures), and there are some CEC (European

CommunityCommission)fundedproje tsworkingonwatermarkingofvisualinformation.

Interworking

How toavoid needlessdupli ation of hardware/software needed to a ess information of

thesame type stored indierent formats? This is the basi obje tive of standardization

eorts.

Evolution

Howtoprodu estandardsthaten ourage,ratherthanprevent, ompetitionandte hni al

evolution?

MPEG-4had theprovisionforevolution asoneofits obje tives. However, due totiming

problems, MPEG-4 was divided into two phases. Phase 1, whi h is s heduled for the

(47)

2.5.2 Evolution of visual oding standards

Se tion2.4.1presentedthevariousmeasureswhi h anbe usedto onstru tthe ost fun tional

thatvideo oders minimize(oratleastattempt tominimize). Most ofthem have beenusedin

oneformoranotherbyen oders ompliantwiththeavailablevideo odingstandards. However,

easiness of a ess to ontent was rst onsidered onlyin MPEG-1 and MPEG-2, in the form

of provisionfor qui k a ess to an hor images. These images, known as I images (I of Intra),

are independentlyen oded and spread evenly in time,thus allowingthe so- alled tri k modes

of video re orders: fast-forward, ba ktra k, et . This allowed onlyfor a rather tersea ess to

ontent. It wasonlyMPEG-4whi hstartedto onsidera moreusefulformof ontent,obje ts,

and whi h provided means for expressing omplex 3D audio-visual s enes with mixtures of

2D and 3D obje ts, natural or syntheti . The real revolution was from MPEG-2 to

MPEG-4. MPEG-2 was essentially a revamped version of MPEG-1, using the same basi tools, but

allowing for in reased resolution [95℄: HDTV (High Denition Television) required it. True

breakthroughsinthevideo odingareahavebeenquiterare. Mostofthetoolsusedbyen oders

ompliant to MPEG standards, even MPEG-4, are small variants, however well-engineered,

of tools developed de ades ago [149℄, e.g., DCT and blo k mat hing motion ompensation.

However, the integral of all the in remental hardware and software te hnology advan es over

thelastde ades orrespondstoan impressiveevolution.

2.5.3 Consequen es of standardization

Standardsdon't spe ifyen oders: theyspe ifyabitstream syntaxand ade oder. Hen e,they

impli itly dene a model for the stru tural data to be en oded. In this sense, video oding

standards analso be lassiedasbelongingtooneof thethree generations presentedbefore.

Inaslightlymoreformalway,letBbethespa eofbitstreams ompliantwithagivenstandard.

Let E be thespa e of the en oders ompliant with thesame standard. Then, a given en oder

e(), in the broad sense, is a fun tion from the spa e R , of s ene representation, to B, i.e.,

e():R !B. Spa e E is thus learly limited by thenature of B. Standardsspe ify de oders,

that is, they spe ify a fun tion d() from B ba kto R . Typi ally, spa e E, though restri ted

by the nature of B, is very large. Even if one restri ts it to the spa e of ompliant en oders

providing appropriate re onstru tion, that is, su h that d(e()) is approximately the identity,

thespa e istoo large.

One an pose the en oding problem mathemati ally, though the omplexity of the solution

usuallyleadstoheuristi solutions: howtoen odeagivens enerepresentationr? Thisquestion

an be answered by nding argmin

b2B

z(d(b);r), where z is a distortion measure. However,

thisin ludesonlyadistortion,orquality,measure. Onemaybe interestedinminimizing other

measures. The generi problem is to nd a generi en oder, i.e., an en oder leading to good

de oding. A possibilityistond argmin e2E

max r2R

z(d(e(r));r).

Whatever the approa h taken, heuristi or optimizing, it is lear that standards introdu e

restri tionsintothe spa e of possible en oders. It also learthat they also leavea lot of room

(48)

data. Thedesignofde oderswithgooderror on ealmentstrategiesandthedesignofen oders

providing forgood errorresilien eatthede oderisopen to ompetition.

2.5.4 Standards and generations

Standards an be lassied as rst-, se ond- or third- generation, a ording to the hara ter

of the ompliant en oders. However, nothing prevents the building of a se ond generation

en oder (i.e.,an en oderusing mid-levelanalysis) whi h generatesbitstreams ompliant with

rst-generation standards. For instan e, MPEG-1 and MPEG-2 belong learly to the rst

generation, while MPEG-4, whi h requires more sophisti ated analysis tools but still uses a

lassi alapproa htoen odethetextureoftheobje ts, anbesaidtobeasteptowards

se ond-generationstandards. A tually,thishasbeenthetypi alroadofevolution,assomeofthework

inthisthesisdemonstrates. Whentoolsaimedatbeingusedinoneofthesetransitionen oders

aredeveloped,onemay lassify them asbelonging totransitionsbetweengenerations.

2.6 Analysis and oding tools

Figure 2.2 shows the analysis, pre-pro essing and oding tools proposed or dis ussed in this

thesis. The gure lassies these tools into the three generations, with two transition layers

added. The tools are also listed below, together with pointersto the se tions where they are

des ribed:

Analysis tools:

{ Transition tose ond-generation:

1. Knowledge-basedsegmentation[123,125, 124℄(Se tion 4.4).

2. Cameramovement estimation[129,127,130,128,113, 122℄ (Se tion5.3).

{ Se ond-generation:

1. RSSTsegmentation[32,33℄(Se tion 4.5).

2. TR-RSST(Time-Re ursiveRSST) segmentation[119℄(Se tion 4.7).

{ Transition tothird-generation:

1. RSSTwithhumansupervision [33℄(Se tion 4.6).

Pre-pro essing tools:

1. Imagestabilization[127,130,128,113, 122℄(Se tion 5.5).

Coding tools:

(49)

{ Se ond-generation:

1. Shape oding: ataxonomyandanoverviewof odingte hniques[120,121℄

(Se -tions6.2and6.3),parametri urve odingtools [116,114,79,78℄(Se tion6.4).

amera movement ompensation knowledge-based segmentation stabilization image fast losed ubi splines RSST segmentation representations typesand ofpartition taxonomy ode ar hite ture estimation movement amera TR-RSSTsegmentation supervisedRSSTsegmentation analysis: oding: pre-pro essing: rst-generation transitionto se ond-generation se ond-generation transitionto third-generation

(50)

(51)

Graph theoreti foundations for

image analysis

N~ao devemos nun a pro urar ser mais pre isos

eexa tos doque o problema em ausa requer.

Karl Popper

This hapter denes themain on epts usedthroughout thisthesis. It is divided intose tions

dealing with images, image latti es, image graphs, et . Con epts are introdu ed, whenever

possible,ina bottom-upmanner: on eptsaredened by usingpreviously dened on epts.

OftentheeÆ ien yofalgorithmsknowntosolve problemsrelatedtothedenitionsgiven here

isdis ussed: theusual O() notationof algorithmi sisused[28℄.

3.1 Color per eption

Thereare twotypesoflight sensor ells intheretina: rodsand ones. Rodsare usedfornight

(s otopi ) vision, while ones are used for daylight (photopi ) vision. Both are known to be

usedintwilight(mesopi ) vision.

Rods greatly outnumber ones. However, the distribution of the rod ells is su h that its

density is nearly zero in the fovea, that is,the zone on the retina orresponding to the enter

of attention. In this zone ones are densely pa ked. 1

Rods are mu h more sensitive to light

than ones: a single quantum is known to be suÆ ient to ex ite a rod. The dierent density

distribution of rods and ones seems to be an evolutionary ompromise between a ura y of

1

Asimple experiment onrmstheabsen eof rods in thefovea. Look dire tlyatadim starandthen look slightlytoitsside: itsapparentlightnesswillin rease.

(52)

vision(fundamental during daytime)and abilitytodete tthreats(fundamentalduring dusk).

While rods are sensitive to a wide range of light frequen ies, they all have the same type of

response, hen e s otopi visionis essentially \bla kand white": olors are not dis riminated.

Cones, on the other hand, are really three dierent types of ells with dierent frequen y

responses. Onetypeof ones,say\red" ones,isespe iallysensitivetofrequen iesaroundpure

red, another, \green" ones, to frequen ies around pure green, and the last, \blue" ones, to

frequen ies around pure blue, where\pure" means onsisting of single frequen y. The overall

response of the ones spans the visible light spe trum. However, the maximum sensitivity of

the ombineda tionof oneso ursata slightlyhigherwavelength(towards red)thanthat of

rods(towards blue): it is the so- alled Purkinje wavelength shift. This seems to be related to

thefa tthatduringtwilightlightismorebluishthanduringdaytime,sin eitismostlyindire t

light dira tedbytheatmosphereparti les.

In the framework of image ommuni ations and multimedia, photopi (daytime) vision is the

rule,sothattheresponseof rods anbemostlyignored. Theresponseof ones anbemodeled

asa nonlinearfun tion of theinner produ t of a spe tralsensitivityfun tion, whi his a

har-a teristi of the given type of sensor ell in a StandardObserver, and the power spe trum of

thelight attainingthesensors (see for instan e [189℄). \Red", \green",and \blue" ones have

dierent spe tralsensitivityfun tions whi hpartiallyoverlapinfrequen y.

Furtherinformation on olorper eptionmaybe foundin[164,1,27℄.

3.1.1 Color spa es

Color reprodu tionuses thefa t that theHVShas onlythree types of ones. In order fortwo

light sour es to be per eived as having equal olor it is not ne essary for their power spe tra

tobeequal: they onlyhave toprodu ethe sameresponse forea h of thethree typesof ones.

Hen e,mostimage dataisavailableinathree omponent format.

Colordata isoftenpresentedina CRT(Cathode-RayTube). Sin ethepoweremittedbysu h

s reensistypi allyproportionaltoa(arithmeti )poweroftheinputvoltage(theexponentbeing

the so- alled gamma value), amerasare usually designed to perform gamma orre tion. The

orre tion spe ied by ITU-R (ITU Radio ommuni ation Se tor)

2 Re ommendation BT.709-2 [80℄follows I 0 = ( 4:5I if0I 0:018, and 1:099I 0:45 0:099 if0:018<I 1, (3.1)

whi h istheinverse ofthe idealmonitorpowerfun tion

I = 8 < : I 0 4:5 if 0I 0 0:081,and I 0 +0:099 1:099 1 0:45 if 0:081<I 0 1, 2