UNIVERSIDADE T ´
ECNICA DE LISBOA
INSTITUTO SUPERIOR T ´
ECNICO
Analysis and Coding of Visual Objects:
New Concepts and New Tools
Manuel Pinto da Silva Menezes de Sequeira
(Mestre)
Dissertac¸ ˜ao para obtenc¸ ˜ao do grau de doutor em
Engenharia Electrot ´ecnica e de Computadores
Orientador:
Doutor Carlos Eduardo do Rego da Costa Salema,
Professor Catedr ´atico do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa
Co-Orientador:
Doutor Augusto Afonso de Albuquerque,
Professor Catedr ´atico do Instituto Superior das Ci ˆencias do Trabalho e da Empresa
Constituic¸ ˜ao do J ´uri:
Presidente:
Reitor da Universidade T ´ecnica de Lisboa
Vogais:
Doutor Augusto Afonso de Albuquerque,
Professor Catedr ´atico do Instituto Superior das Ci ˆencias do Trabalho e da Empresa
Doutor Jos ´e Manuel Nunes Leit ˜ao,
Professor Catedr ´atico do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa
Doutor Luis Ant ´onio Pereira Menezes Corte-Real,
Professor Auxiliar da Faculdade de Engenharia da Universidade do Porto
Doutor M ´ario Alexandre Teles de Figueiredo,
Professor Auxiliar do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa
Doutor Jorge dos Santos Salvador Marques,
Professor Auxiliar do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa
Doutora Maria Paula dos Santos Queluz Rodrigues,
Professora Auxiliar do Instituto Superior T ´ecnico da Universidade T ´ecnica de Lisboa
Itissometimessaidthattheorderbywhi hnamesarementionedinthea knowledgmentshasno
spe ial meaning. Not in this ase. Without Prof. Augusto Albuquerque'sfriendship, without
his en ouragements, and without his supervision, this thesis would not have been possible.
Thanks.
To Jo~aoLuis Sobrinhoand CarlosPires,fortheirfriendshipand support.
To all my olleagues in ISCTE, for their good will. Spe ial thanks to Luis Nunes and Filipe
Santos,they knowwhy.
To the staof theImage Groupof theUniversitatPolite ni ade Catalunya, fora epting me
astheirownduring a month of valuabledis ussions. It was afundamental month.
ToProf. CarlosSalema,forhiswillingnesstosupervisethisthesisallthistime. Forhispatien e.
To theImage Groupof IST, fortheirfriendshipand for thepleasant lun hdis ussions, always
stimulating,seldomte hni al...
To theInstitutode Tele omuni a ~oes,forthe oÆ e andequipmentI usedforsolong.
To Prof. Fernando Pereira whom, by allowingmetowork fortheCEC RACE MAVT proje t,
greatlyfa ilitatedmy onta tswith theinternational video oding ommunity.
To myfriends,who alwaystrusted memorethanI didmyself.
This thesis was partially funded by the CIENCIA and PRAXIS programs of JNICT. A
sub-stantial part of this work was done while I was working for the CEC MAVT RACE proje t,
Costuma-sedizernosagrade imentosqueaordempelaqualosnomesapare em n~aotem
qual-quer signi ado. N~ao e o aso. Sem a amizade do Prof. Augusto Albuquerque, sem os seus
en orajamentose sem a suaorienta ~ao,esta tese n~ao teriarealmentesido possvel. Bemhaja.
Ao Jo~ao Luis Sobrinhoe ao CarlosPires,pelo apoio eamizade.
Aos olegas do ISCTE, agrade o a suaenormeboa vontade. Agrade imentos espe iais ao Luis
Nunes eao Filipe Santos,elessabemporqu^e.
AtodososinvestigadoresdoGrupode Imagemda UniversidadePolite ni ada Catalunha,que
mea eitaram no seuseioduranteumm^esde prof uasdis uss~oes. Foium m^esfundamental.
Ao Prof. CarlosSalema,porse dispor a orientaresta tese, ao longo de todo este tempo. Pela
suapa i^en ia.
Ao grupo de imagem, pela amizade e amaradagem, e pelas dis uss~oes ao almo o, sempre
estimulantes,raramentete ni as...
Ao Instituto de Tele omuni a ~oes, pelas infraestruturas e equipamento informati o que me
permitiuutilizar.
Ao Prof. Fernando Pereira,por, atravesda parti ipa ~ao noproje to MAVT,me terpermitido
manter onta tos ient osinterna ionaismuitomais dif eis de outraforma.
Aosmeusamigos,que sempre a reditaramem mimmuito maisdo queeu proprio.
Estatesefoipar ialmentenan iadapelaJNICT,programasCIENCIAePRAXIS. Umaparte
substan ialdestetrabalhofoirealizadano^ambitoproje to RACE MAVT,etambemnaminha
Video odinghasbeenunderintenses rutinyduringthelastyears. Thepublishedinternational
standards rely on low-level vision on epts, thusbeing rst-generation. Re ently
standardiza-tion startedin se ond-generationvideo oding,supported on mid-levelvision on eptssu has
obje ts.
This thesis presentsnew ar hite tures for se ond-generation video ode s and some of the
re-quiredanalysis and odingtools.
Thegraphtheoreti foundationsofimageanalysisarepresentedand algorithmsforgeneralized
shortest spanning tree problems are proposed. In this light, it is shown that basi versions
of severalregion-orientedsegmentationalgorithmsaddress thesameproblem. Globalizationof
informationisstudiedandshownto onferdierentpropertiestothesealgorithms,andto
trans-formregionmerginginre ursiveshortestspanningtreesegmentation(RSST).RSSTalgorithms
attempting to minimize global approximation error and using aÆne region models are shown
to be very ee tive. A knowledge-based segmentation algorithm for mobile videotelephony is
proposed.
Anew ameramovementestimationalgorithmisdevelopedwhi hisee tiveforimage
stabiliza-tion and s ene ut dete tion. A ameramovement ompensation te hniquefor rst-generation
ode sis also proposed.
A systematization of partition types and representations is performed with whi h partition
oding tools are overviewed. A fast approximate losed ubi spline algorithm is developed
withappli ationsinpartition oding.
Keywords: visual oding,se ond-generationvideo oding,imageanalysis,imagesegmentation,
A odi a ~aodevdeotemsidointensamenteestudadanosultimosanos. Asnormas
interna io-naisjapubli adasbaseiam-seem on eitosda vis~aode baixonvel, sendoportantode primeira
gera ~ao. Come oure entementeanormaliza ~aode te ni asde odi a ~ao desegunda gera ~ao,
suportada em on eitosda vis~ao demedio nvel tais omoobje tos.
Estatese apresentanovasarquite turaspara odi adoresde vdeo desegunda gera ~ao e
algu-mas das orrespondentesferramentas deanalisee odi a ~ao.
Apresentam-sefundamentosdeteoriadosgrafos apli adaaanalisedeimagemeprop~oem-se
al-goritmosparageneraliza ~oesdoproblemadaarvore abrangentemnima. Mostra-sequevers~oes
basi as de varios algoritmos de segmenta ~ao orientados para a regi~ao resolvem o mesmo
pro-blema. Estuda-sea globaliza ~aode informa ~ao e mostra-seque onferepropriedades diferentes
a esses algoritmos, transformando o algoritmo de fus~ao de regi~oes no algoritmo de arvores
abrangentes mnimasre ursivas (RSST). Mostra-sea e a ia de algoritmos RSST quetentam
minimizar o erro global de aproxima ~ao e que usam modelos de regi~ao ans. Prop~oe-se um
algoritmo baseadoem onhe imentoprevioparasegmenta ~aoem vdeo-telefonia movel.
Desenvolve-seumumalgoritmode estima ~aode movimentosde ^amarae azna estabiliza ~ao
deimagemenadete ~aodemudan asde ena. Prop~oe-setambemumate ni ade ompensa ~ao
de movimentosde ^amara para odi adoresde primeira-gera ~ao.
Sistematizam-seostiposeasrepresenta ~oesderegi~oes,revendo-sedepoiste ni asde odi a ~ao
de parti ~oes. Desenvolve-seumalgoritmo rapido e aproximado para al ulo de splines ubi as
fe hadas.
Palavras have: odi a ~ao visual, odi a ~ao de vdeo de segunda gera ~ao, analise de
A knowledgements i
Agrade imentos iii
Abstra t v
Resumo vii
1 Introdu tion 1
1.1 Stru ture of thethesis . . . 2
2 Video and multimedia ommuni ations 5 2.1 Trendsof multimedia ommuni ations . . . 5
2.1.1 Distributionmethods . . . 6 2.1.2 A tivityparadigms . . . 7 2.1.3 Convergen e tenden ies . . . 7 2.1.4 Adistributed database. . . 8 2.2 Media representation . . . 9 2.3 Visual analysis . . . 10
2.3.1 Levels of visualanalysis . . . 11
2.3.2 Toolsforvisualanalysis . . . 12
2.4 Visual oding . . . 12
2.4.1 Obje tives. . . 13
2.4.3 Generations . . . 15
2.5 Standards . . . 16
2.5.1 Standardization hallenges. . . 17
2.5.2 Evolutionof visual odingstandards . . . 19
2.5.3 Consequen esof standardization . . . 19
2.5.4 Standardsand generations. . . 20
2.6 Analysis and odingtools . . . 20
3 Graphtheoreti foundations for image analysis 23 3.1 Color per eption . . . 23
3.1.1 Colorspa es . . . 24
3.2 Imagesand sequen es . . . 26
3.2.1 Analogimages . . . 26
3.2.2 Digitalimages . . . 27
3.2.3 Latti es,samplinglatti es and aspe tratio . . . 27
3.3 Grids, graphs,and trees . . . 29
3.3.1 Graphoperations . . . 31
3.3.2 Walks, trails,paths, ir uits,and onne tivityingraphs . . . 33
3.3.3 Eulertrailsand graphs. . . 34
3.3.4 Subgraphs, omplements, and onne ted omponents. . . 35
3.3.5 Rankand nullity . . . 36
3.3.6 Cutverti es, separability,andblo ks . . . 36
3.3.7 Cutsand utsets . . . 36
3.3.8 Isomorphism,2-isomorphism,and homeomorphism . . . 37
3.3.9 Trees and forests . . . 38
3.3.10 Graphsand latti es . . . 68
3.4 Planargraphs and duality . . . 71
3.4.3 Four- olortheorem . . . 75
3.5 Maps. . . 76
3.5.1 Operationsonthe dualRAMG andRBPG graphs . . . 76
3.5.2 Denitionofmap . . . 77
3.5.3 Algorithms . . . 78
3.6 Partitionsand ontours . . . 78
3.6.1 Partitions and segmentation. . . 79
3.6.2 Classesand regions . . . 79
3.6.3 Regionand lassgraphs . . . 81
3.6.4 Equivalen eand equalityof partitions . . . 83
3.7 Con lusions . . . 87
4 Spatial analysis 89 4.1 Introdu tiontosegmentation . . . 90
4.2 Hierar hizingthesegmentationpro ess . . . 91
4.2.1 Operatorlevel . . . 92
4.2.2 Te hniquelevel . . . 103
4.2.3 Algorithmlevel . . . 104
4.2.4 Con lusions . . . 106
4.2.5 Pre-pro essing . . . 107
4.3 Region-and ontour-orientedsegmentationalgorithms . . . 108
4.3.1 Contour-orientedsegmentation . . . 108
4.3.2 Region-orientedsegmentation . . . 109
4.3.3 SSTsas aframework ofsegmentationalgorithms . . . 114
4.3.4 Globalizationstrategies . . . 118
4.3.5 Algorithmsandthe dualgraphs. . . 132
4.3.6 Con lusions . . . 133
4.4.2 Algorithmdes ription . . . 136 4.4.3 Computationaleort . . . 151 4.4.4 Results . . . 151 4.4.5 Con lusions . . . 153 4.5 RSST segmentationalgorithms . . . 153 4.5.1 Pre-pro essing . . . 154 4.5.2 Segmentationalgorithm . . . 155
4.5.3 Resultsand dis ussion . . . 158
4.5.4 Con lusions . . . 175
4.6 Supervisedsegmentation . . . 175
4.6.1 RSSTextensionusing seeds . . . 175
4.6.2 Results . . . 176
4.6.3 Con lusions . . . 176
4.7 Time- oherent analysis . . . 178
4.7.1 ExtensionofRSST tomovingimages . . . 178
4.7.2 Results . . . 180
4.7.3 Con lusions . . . 180
5 Timeanalysis 183 5.1 Cameramovements. . . 184
5.1.1 Transformations onthe digitalimage . . . 186
5.2 Blo kmat hingestimation. . . 189
5.2.1 Errormetri s . . . 191
5.2.2 Algorithms . . . 191
5.3 Estimating amera movement . . . 191
5.3.1 Leastsquaresestimation . . . 192
5.3.2 Outlierdete tion . . . 196
5.4.1 Results . . . 202
5.5 Image stabilization . . . 215
5.5.1 Viewingwindow . . . 215
5.5.2 Viewingwindowdispla ement . . . 216
5.5.3 Results . . . 217
5.6 Con lusions . . . 222
6 Coding 225 6.1 Cameramovement ompensation . . . 225
6.1.1 Quantizing ameramotionfa tors . . . 226
6.1.2 ExtensionstotheH.261 re ommendation . . . 228
6.1.3 En oding ontrol . . . 229
6.1.4 Resultsand on lusions . . . 230
6.2 Taxonomyof partitiontypesand representations . . . 231
6.2.1 Partitiontype. . . 232
6.2.2 Partitionrepresentation . . . 232
6.2.3 Representationproperties . . . 236
6.3 Overviewof partition oding te hniques . . . 237
6.3.1 Losslessorlossy oding . . . 237
6.3.2 Mosai vs.binarypartitions . . . 238
6.3.3 Partitionmodels . . . 238
6.3.4 Class oding . . . 239
6.3.5 Label oding . . . 239
6.3.6 Contour oding . . . 241
6.4 A qui k ubi splineimplementation . . . 246
6.4.1 2D losedsplinedenition . . . 246
6.4.2 Determinationof thespline oeÆ ients. . . 247
6.4.5 Results . . . 252
6.5 Con lusions . . . 252
7 Con lusions: Proposal for a new ode ar hite ture 255 7.1 Proposal fora se ond-generation ode ar hite ture. . . 255
7.1.1 Sour emodel . . . 256
7.1.2 Code ar hite ture . . . 257
7.1.3 Con lusions . . . 263
7.2 Suggestions forfurtherwork . . . 263
7.2.1 Code ar hite ture . . . 263
7.2.2 Graphtheoreti foundations forimageanalysis . . . 265
7.2.3 Spatialanalysis . . . 265
7.2.4 Timeanalysis . . . 267
7.2.5 Coding . . . 267
7.3 Listof ontributions . . . 267
7.3.1 Graphtheoreti foundations forimageanalysis . . . 268
7.3.2 Spatialanalysis . . . 268
7.3.3 Timeanalysis . . . 268
7.3.4 Coding . . . 268
A Test sequen es 271 A.1 Video formats . . . 271
A.1.1 Aspe tratios . . . 272
A.2 Testsequen es . . . 273
B The Frames video oding library 279 B.1 Librarymodules . . . 280
B.1.1 types.h . . . 282
B.1.4 mem.h . . . 283 B.1.5 matrix.h . . . 283 B.1.6 sequen e.h . . . 284 B.1.7 ontour.h . . . 284 B.1.8 d t.h . . . 284 B.1.9 filters.h . . . 284 B.1.10 graph.h . . . 285 B.1.11 heap.h . . . 285 B.1.12 motion.h . . . 286 B.1.13 splitmerge.h . . . 286 B.2 An exampleof use . . . 287
2D... Two-dimensional
3D... Three-dimensional
ADSL... Asymmetri DigitalSubs riberLine
ANSI... Ameri anNational StandardsInstitute (astandards body)
CAG... ClassAdja en yGraph
CATV... CableTelevision (formerlyCommunityAntenna Television)
CCIR... ComiteConsultatifInternationaledesRadioCommuni ations (nowITU-R)
CCITT... Comite Consultatif Internationalede Telegraphique etTelephonique (now
ITU-T)
CD... Compa tDisk
CD... CommitteeDraft (ofanISOstandard)
CEC... EuropeanCommunityCommission
CERN... ConseilEuropeen pourlaRe her he Nu leaire
CIE... CommissionInternationalede l'
E lairage
CIF... CommonIntermediateFormat
CPU... CentralPro essingUnit
CRT... Cathode-RayTube
DCT... Dis reteCosine Transform
DFS... Depth FirstSear h
DFS... Dis reteFourierSeries
DSig... DigitalSignatures (ofW3C)
http://www.w3.org/DSig /O ver vi ew .h tm l
FCT... Four-Color Theorem
FIR... FiniteImpulseResponse
FLC... FixedLength Code
FSF... FreeSoftwareFoundation
FTP... FileTransferProto ol
GOB... GroupofBlo ks (asyntaxelement inITU-T H.261)
GPL... GNU GeneralPubli Li en e (ofFSF)
HDTV... HighDenitionTelevision
HSV ... Hue,Saturation, and Value (a olorspa e)
HTML... Hypertext Markup Language (ofW3C)
HTTP... Hypertext Transfer Proto ol (ofIETFand of W3C)
HVS... HumanVisual System
IEC... International Ele trote hni alCommission (astandards body)
http://www.ie . h/
IEEE... TheInstituteof Ele tri aland Ele troni sEngineers,INC.
http://www.ieee.org/
IETF... InternetEngineering Task For e (astandards body)
http://www.ietf.org/
IIR... Innite ImpulseResponse
IP... InternetProto ol
IPR... Intelle tualPropertyRights
IRC... InternetRelayChat
ISDN... Integrated Servi esDigital Network (ofITU-T)
ISO... International OrganizationforStandardization (astandardsbody)
http://www.iso. h/
ITU... International Tele ommuni ationUnion (astandardsbody)
http://www.itu. h/
ITU-T... ITUTele ommuni ationStandardizationSe tor (astandardsbody,seeITU)
http://www.itu. h/ITU- T/
JPEG... JointPhotographi ExpertsGroup (ofISOand IEC)
LMedS... LeastMedianof Squares
LoG... Lapla ianof Gaussian
LSF... LongestSpanning Forest
LST... LongestSpanning Tree
MAVT... MobileAudio-VisualTerminal
MB... Ma roBlo k (ablo kof1616pixels,onITU-TandISO/IECvideo oding
standards)
MBA... Ma roBlo kAddress (asyntax element inITU-T H.261)
MC... Motion Compensated
MF... Model Failure
MMREAD.... Modied MREAD (see MREAD)
MPEG... MovingPi ture ExpertsGroup (ofISOand IEC)
MREAD... Modied Relative Element Address Designate
MTYPE... Ma roblo kType (asyntax elementinITU-T H.261)
MVD... Motion Ve torData (asyntax elementinITU-T H.261)
NTSC... National TelevisionSystemsCommittee (astandards body)
OALDCE... OxfordAdvan ed Learner'sDi tionaryof Current English
OCR... Opti alChara terRe ognition
PAL... PhaseAlternating Line
PC... Personal Computer
PEI... Pi tureExtraInsertion Information (asyntax elementin ITU-TH.261)
PICS... PlatformforInternetContent Sele tion (ofW3C)
http://www.w3.org/PICS /
PNG... Portable NetworkGraphi s (ofW3C)
PSNR... Peak SignaltoNoiseRatio
PSPARE... Pi tureSpare Information (asyntax element inITU-T H.261)
PSTN... Publi Swit hedTelephoneNetwork
PTYPE... Pi tureType (asyntaxelement inITU-T H.261)
QCIF... Quarter-CIF (see CIF)
QPT... Quarti Pi tureTree
RAG... RegionAdja en y Graph
RAMG... RegionAdja en y MultiGraph
RBPG... RegionBorder PseudoGraph
R GB... Red,Green,and Blue (a olorspa e)
RLE... Run-LengthEn oding
RM8... Referen eModel8 (areferen e modelforITU-T H.261)
RSST... Re ursive SST (seeSST)
SIF... StandardInter hangeFormat
SMIL... Syn hronized MultimediaIntegrationLanguage (ofW3C)
http://www.w3.org/TR/W D- smi l
SSF... ShortestSpanning Forest
SSkT... ShortestSpanning k-Tree
SSSkT... ShortestSeeded Spanning k-Tree
SSSSkT... SmallestShortest Seeded Spanningk-Tree
SST... ShortestSpanning Tree
TCP... Transmission ControlProto ol
TR-RSST... Time-Re ursiveRSST (see RSST)
TV... Television
UMTS... UniversalMobileTele ommuni ationServi e
UPC... UniversitatPolite ni ade Catalunya
URC... UniformResour e Chara teristi s (ofIETF)
URN... UniformResour e Name (ofIETF)
VLC... VariableLength Code
VO... Video Obje t (asyntaxelement inISO/IECMPEG-4)
VOD... Video-On-Demand
VOL... Video Obje t Layer (asyntax element inISO/IECMPEG-4)
VOP... Video Obje t Plane (asyntax elementin ISO/IECMPEG-4)
VQ... Ve torQuantization
VRML... Virtual RealityModelingLanguage (astandard ofISOand IEC)
W3C... World WideWeb Consortium (astandards body)
Introdu tion
\The time has ome,"the Walrus said,
\To talk of many things:"
Lewis Carroll
The performan e of lassi al video oding algorithms, in terms of the lassi al oding riteria
(bitrate, distortion, and ost), seems to be rea hing a plateau [161,3℄. That is, the marginal
performan egainsof tuning thesealgorithmsare nownearly negligible. A ording toAdelson
et al. [3, 195℄, the lassi al approa hes use on epts usually related to low-level vision, su h
as luminan e, olor, spatial frequen y, temporal frequen y, lo al motion, and low-level
opera-tors su h as linear ltering and transforms. New approa hes, using mid-level visual on epts,
su h as regions, textures, surfa es, depth, global motion, and lighting, are deemed ne essary
fora breakthroughinvideo odingperforman e. This needhasbeenre ognizedforsometime
now[144,141,96℄,thoughlimited omputing apabilitieshavehinderedsomewhattheadvan es
towards the implementation of ompletemid-level vision video(se ond-generation) oding
al-gorithms.
Duringthelastyears,andfollowingtheeverin reasingadvan esofte hnology,theuseofimage
and videoineverydaylifehasbeengrowing ontinuously. This hassparklednew needs among
users: intera tivity, ontentediting,and ontentbasedindexingarejustafewexamples. These
needs requirethe a essto the ontentof video sequen es. This a ess may,in some ases, be
done after en oding and de oding, i.e., by performing analysis at the re eiver side. In most
ases, though,itis essentialtohavethis apabilitydire tlyatbit stream level. Content a ess
shouldthusbedonewithaminimumofeort: a\fourth[ oding℄ riterion"hasbeenidentied,
oinedby Pi ard[162℄as\ ontent a esseort." This riterionisrelated tothe omplexityor
eortrequiredtoa ess thevideo ontent,and hen etoprovide ontent-basedfa ilities.
The resultsobtained untilnow by mid-levelvision video oding algorithms, though extremely
However, thisapparent la kof su ess istruly amisjudgment,sin ethe performan e hasbeen
measured, until now, using only the bitrate, distortion, and ost riteria. When the fourth
riterion is introdu ed, the newly developed algorithms ertainly have a leading edge overthe
lassi alones: obje tsandregions,ratherthansquareblo ks,arewhatanuserwantstointera t
with.
The new users' needs have also been re ognized by MPEG-4. These ideas were introdu ed
in MPEG-4 [138℄ by asking for some \new or improved fun tionalities" [139℄: ontent-based
manipulationandbit streamediting, ontent-basedmultimediadataa esstools,and
ontent-based s alability.
This thesis summarizes a series of proposals towards oding of visual obje ts. The work has
progressed over a number of years and an be seen as a ontribution to the development of
se ond-generationvisual odingstandards ofwhi hMPEG-4is an example.
1.1 Stru ture of the thesis
Chapter 2,\Video and multimedia ommuni ations", ontains a brief overview of multimedia,
theInternetand video ommuni ations. It anbeseenasamotivationfortheworkdeveloped.
Video ode sare lassiedasrst-, se ond-,orthirdgenerationa ording totheanalysis tools
required: rst-generation for low-level vision analysis, se ond-generation for mid-level vision
analysis, and third-generation for high-level vision analysis. A brief summary of the analysis
and odingtoolsproposedinthisthesis,organizeda ordingtothepresentedstru ture, anbe
foundinSe tion2.6.
Chapter 3, \Graph theoreti foundations for image analysis", denes most of the theoreti al
on epts that are used throughout. In this hapter the important theory of spanning trees,
a bran h of graph theory, and related on epts using seeds, is dis ussed together with the
orrespondingalgorithms. Anamortizedlineartimealgorithmisalsopresentedforanimportant
lassof spanningtree problems.
Chapter4,\Spatial analysis", ontainsproposalsforaknowledge-basedmobilevideotelephony
segmentationalgorithm,an extended RSST (Re ursive SST) segmentationalgorithm using an
aÆne region model, a supervised RSST segmentation algorithm, i.e.,a RSST algorithm using
seeds,andatime-re ursiveversionoftheRSSTalgorithmprovidingtime oherentsegmentation
ofmovingimages. The lassi alsegmentationalgorithms,su h asregiongrowing,region
merg-ing,edgedete tionfollowedby ontour losing,arealldes ribedintheframeworkofthetheory
of spanning trees introdu ed inthe previous hapter. The relations between thesealgorithms
is dis ussed in the ommon framework of spanning trees. The ee ts on these algorithms of
globalizationof information arealso dis ussed.
Chapter 5, \Timeanalysis", proposes a simple algorithm for estimating amera movement in
movingimagesandamethodforits an ellation(imagestabilization)toimproveimagequality
Chapter 6, \Coding", proposes a method of en oding amera movement information using
a simple extension to the H.261 standard (the dis ussions on quantization are general and
transposable toany other ode using motion ve tor eldswith redu edresolution relative to
thatof theunderlyingimages)and reviews theimportant issueof partitionrepresentationand
oding. A fast approximation to the al ulation of losed ubi splines is also proposed. The
analysis and odingtools presentedin thisand theprevioustwo hapters anbe seen assteps
towards thebuilding of toolsfora new ode ar hite ture.
Chapter 7, \Con lusions: Proposal for a new ode ar hite ture", proposes a new
se ond-generation ode ar hite ture, makes some suggestions for future work, and lists the thesis
ontributions.
Finally, Appendix A des ribes the test sequen es used and their formats, and Appendix B
Video and multimedia
ommuni ations
It is supposed that be ause a thing isthe rule it
isright.
Os ar Wilde
2.1 Trends of multimedia ommuni ations
\Medium" literally means\middle". A ording tothe OALDCE (Oxford Advan ed Learner's
Di tionaryofCurrent English)[71℄,itmeans\thatbywhi hsomethingisexpressed,"i.e.,that
bywhi ha messageisexpressed,sin e, a ordingtoNegroponte[142℄, \themedium isnotthe
message." Messages an be expressed using a variety of media. Multimediais the pro ess of
expressing a messageusing several media. In this sense, multimedia is not new. Multimedia
existssin etherearebookswithimages, 1
a tuallyevenbeforethat,sin ehumans ommuni ate
byspee h andgestures.
Untillast enturyourabilitytostoreandtransmitmessageswasverylimited. Onlytextandstill
imagesanddiagrams ouldbestoredforfutureuse(e.g.,inbooks),andlongrangetransmission
waslimitedtophysi altransportofprintedorhandwrittenmaterial,withrareex eptions. The
telegraph, for long range transmission of text, the telephone, for long range transmission of
voi e, the radio,for long range transmissionof sound, hanged that pi ture onsiderably. But
perhaps the most important inventions of the last entury were the phonograph, for storing
sounds,and the inematograph, by whi hstoring of movingimages be amepossible.
1
Dierent media ansharethesame sense(or\ hannel") intothe humanbrain. Textand imagery, though dierent media,arebothsensedusingvision.
Inthebeginningofthis enturyitwaspossible,atleastinprin iple,toexpressmessages using
multimedia as we know it today and store them for future use. In pra ti e, this happened
only in the thirties, with the introdu tion of sound syn hronized with image in the inema.
Stereos opi imagerywasalsoavailableatthattime.
2.1.1 Distribution methods
A message, asexpressed through movingimages and sound ina lm, ismeant tobe onveyed
to a re eptor. Although movie theatersare still a very su essful and protable way of doing
it, they involve onsiderable delayand trouble. Using Negroponte's [142℄ \bits" and \atoms"
denitions,theprodu erdistributesthelm artridges(atoms) ontainingen odedimages and
sounds (bits)whi h arethenbroad astedfroma s reenandspeakerstoa restri tedaudien e. 2
A newdistribution paradigmwas learlyne essary.
TV (Television) partiallysolved the distribution problem,by using radio broad ast of
analog-i ally en oded moving images and sound. However, TV also introdu ed some new problems:
beingbroad asted,anybodywitha TVset ould enjoyit. Who (andhow) shouldthenpayfor
the ontent onveyed? From TV taxes (virtually un hargeable), to in ome taxes (in the ase
of subsidized television), through advertisements and mixtures thereof, several solutions have
beenproposed,mostof whi harestillbeingusedtothisday. Thesesolutionswerenotenough.
Point-to-point ommuni ation,su hasthat providedbythetelephone,was ne essary.
Computer networks, providing point-to-point ommuni ations in a dierent framework, were
alsoanimportantdevelopment. Inthe1970'stheTCP(TransmissionControlProto ol)/IP
(In-ternetProto ol) proto olsweredevelopedandputtousemostlybythegovernmentand
edu a-tionalinstitutions inthe USA.By the eightiesit wasspreadall overthe world,though mostly
restri tedtothea ademi world. Inthebeginningofthenineties,followingthedevelopmentby
theCERN(ConseilEuropeenpourlaRe her heNu leaire) ofthesuiteofWWW(WorldWide
Web)proto olsandformats,viz.UR*, 3
HTTP(HypertextTransferProto ol),andHTML
(Hy-pertext Markup Language), theWeb exploded: itbe ameattra tive tothe ommon user, and
hen ee onomi allyviable.
In the late forties, TV started to be distributed by able in areas where the broad ast signal
ould not be re eived with normal antennas( ommunityantenna television). Cable television
was soon found to oer onsiderable advantages relative to broad ast television: in reased
quality,in reasednumberof hannelsthroughalargeravailablebandwidth,noneedforantennas
and thuslowervisual impa t(important in ertain urbanareas), et . Re ently,CATV (Cable
Television) operators, typi ally diusion oriented, realized they had deployed over the years
an almost ubiquitous broadband network whi h ould be improved with smallinvestmentsto
provideup-linkstotheuser. Thus,withthehelpof able modems,providers startedbuildinga
sortof \residentialareanetworks", onne tingusersintheneighborhoodtothe ablehead-end
2
Imagesandsoundsin lmare mostlyen odedin ananalogformat,eventhoughdigitalsoundisexpanding qui kly. Theseimagesandsounds ontaininformation, whi h anbe measuredin bits,evenifdigital en oding isnotused.
3
and then e totheworld.
The explosionof theWeb inthe nineties,together withthe personal omputer andthe almost
ubiquitous wide band CATV networks, suddenly allowed dierent ontent to be delivered to
dierent onsumers. Consumers ouldnow hooseandevenintera twiththematerialdelivered
(and pay a ordingly): theage of the Web, teleshopping, PPV(Pay-Per-View), VOD
(Video-On-Demand)and WebTV wasborn.
2.1.2 A tivity paradigms
There are essentially two a tivityparadigms for information provided to the onsumers. The
push paradigm, when the information provider pushes the information to a passive user, and
thepull paradigm,when thea tiveuser requests informationfrom theservi eprovider.
TV broad ast is push, sin e the information is pushed tothe onsumerwithoutrequiring any
a tiononhis part(besidesturning theTVonand hoosinga hannel). However,VOD ispull,
sin etheuser requestswhateverinterestsher.
TheWeb,untilre ently,exhibitedonlythepullbehavior. Allthea tionwasonthepartofthe
enduser,whi hwouldalwaysmakespe i requestsastowhatinformationshouldbedelivered
to him. Nowadays, the push paradigm has been implementedby most browsers, through the
on eptof automati allyupdated hannels,in a learparallel withTVdiusion.
2.1.3 Convergen e tenden ies
Convergen e of distribution methods and te hnologies
A wealth of ommuni ation servi es exist today. Most of the hannels involved in these
ser-vi es are slowly being enhan ed to provide bidire tional ommuni ationsand improved
band-width. For instan e, CATV networks now provide bidire tional data hannels through able
modems,satellite onstellationsarebeingdeployedforpersonalmobile ommuni ations,andthe
UMTS (Universal Mobile Tele ommuni ation Servi e), providing a wider bandwidth than
to-day's ellularphones,isexpe tedinthenearfuture. Also,theanalog hannelsoeredbytheold
PSTN (Publi Swit hed Telephone Network) are slowly being digitized to provide ISDN
(In-tegrated Servi es Digital Network). Re ently, ADSL (Asymmetri Digital Subs riber Line)
started tobe usedtoestablish widebanddata hannels onthe telephoni opperloop.
At the same time, all elds of multimedia and ommuni ations are being enhan ed through
the use of digital te hnology. Digital re orded sound is already used in the movie theaters
(probablytobe followed soon bydigitalmovingimages)and digitalTVwillsoonbe available,
Convergen e of servi es and a tivity paradigms
The servi es availablearealso onverging. Thereis a tenden ytosupportbothbroad ast and
point-to-pointdistribution,dierent media,andbothpushand pullparadigms. Cablemodems
allowpoint-to-point ommuni ationswhereformerlybroad ast wastherule, andmodemsover
POTS(PlainOldTelephoneServi e)allowbroad ast(oratleasttheWeb equivalentof
broad- ast,multi ast)where formerlyonlypoint-to-point ommuni ations wasused. Videotelephone
over POTS is now possible (and soon will be also possible on ellular phones), and the TV
servi e was long ago upgraded to in lude teletext. Videotelephony, on the other hand, is also
possibly on the Web, and supplements the old pear-to-pear ommuni ations servi es of the
Internetsu h asemailand (ele troni ) talk,and, morere ently,IRC (Internet RelayChat).
Convergen e of ontents
On the demand side, onsumers require high quality ontent. The produ tion of multimedia
ontent, inwhi htheentertainmentindustries (TV, inema andgames) ex el,is thusthriving.
Consumers are also demanding more and more intera tive ontrol over the information they
re eive, an issue whi h is a spe ialtyof the informati s(software/ omputer) industries. Thus
thetenden yformergersanda quisitionsbetween ompaniesintheentertainment,informati s
and networkbusinesses.
Consumers also require mobilityand ompatibility. Thuslargeinformati s ompaniesare also
investingonglobal,satellitebased,mobilenetworks,andmoreandmore areistakennowadays
with standardization and ompatibilityby ontent providers, TV ompanies, and informati s
ompanies.
2.1.4 A distributed database
Itseemsreasonabletoexpe tthatthe onvergen epro esswillleadtouniversala essto
infor-mation. Therewillprobably be littledieren ebetweenTV,phone, fax,and thePC(Personal
Computer). In fa t,thePC isalready doublingasTV, aphone, and afax. The Web will
on-ne talmosteverythingand everyone. It is expe ted toprovidepeople witha ompleteleisure,
work,and so ialenvironment,a essed throughawealthof dierentinterfa es,su hass reens
together with remote ontrols, desktop s reens, keyboards, pointing devi es, mi rophones and
speakers,voi e ontrolledhand-held devi eswithhandwritten hara ter re ognition (e.g.,
evo-lutions of the PalmPilot TM
), data-gloves, et . Su h a network of information an be seen a
2.2 Media representation
Take a CD (Compa t Disk) of an or hestra playingMozart's symphony 41, Jupiter. What is
the essentialpart: the s oreorthe sound (aunique interpretation)? Although 600megabytes
areusedinatypi alCDtostorethesound,the orrespondings oremaybestoredinmu hless
spa e. CD audio does noten ode the stru ture: iten odes, asfaithfully aspossible,a opy of
theoriginalsound. Thesamethinghappenswithfax: Istillnditfrustratingtoexplaintousers
offaxmodemswhyitisthey annotimportthere eivedfaxes(mostlytext)dire tlyintotheir
word pro essors, without using the (still) error prone OCR (Opti al Chara ter Re ognition)
software.
Consider,however, thatfaxesweremade intelligent: theywouldanalyzetheinputpage,dete t
textzones, re ognize thetext, and en ode it astext, instead of bla k and white raster images
of hara ters. 4
This would learly lead to improved usability, if not also to a redu tion of
transmissiontime.
Visualdata,espe iallyvideo(takenhereassequen esofimagessampledfromthenaturalworld
s ene),isaveryimportantpartoftoday'smultimedia,anditsimportan etendstoin reasewith
the onvergen e of entertainment and informati s industries. However, video is still en oded
with the same \blindness" that ae ts fax and CD sound: the stru tured ontents of video
s enes are simply ignoredin the en oding pro ess, leading to a representationwhi h is notat
all stru tural[142℄, faithfulasit maybetothe original.
Visual analyzers would do thesamefor videoas thehypotheti alfax analyzerfor a bla k and
whiteimage: fromasequen eofvideoimages,theywouldextra tastru turalrepresentationof
thes enetherein,thes ene's\s ore"plus\interpretationnuan es". Su hastru tural
represen-tation, asidefromthe expe ted e onomiesinen odedsize, would allowtheuser tomanipulate
thes ene atwill: a bigstep towards ompleteintera tivity.
The exponential growth of digital te hnology, where lo k frequen ies dupli ate almost every
year and memorydensities (bits per volume) almosttripli ateinthe sameperiod of time,has
led to an ever in reasing use of omputers by ontent providers (su h as lm produ ers and
TV ompanies). Syntheti imagery orresponds nowadays to an important part of the bits
ex hanged worldwide. However, not mu h eort was put untilnow intothe eÆ ient (soon to
bedened) representationof syntheti data, whi his inherentlystru tural.
Hen e,two important problemsmustbe solved urgently: how toobtainstru tural
representa-tionsfrom naturaldata (thes oreand the interpretationnuan esfrom a symphonyre ording,
the text from a printed do ument, the des ription of the s ene seen ina video sequen e) and
how toeÆ iently en ode stru tural representations, either syntheti or obtained from natural
data.
The rst of these problems is analysis. In the ase of visual data, analysis is addressed by
omputer visionwhi h, a ording to [68, Harali k and Shapiro℄, is \the s ien e that develops
thetheoreti aland algorithmi basisbywhi husefulinformationabouttheworld anbe
auto-4
mati ally extra ted and analyzed from an observed image,image set, orimage sequen e from
omputationsmade byspe ial-purposeorgeneral-purpose omputers."
The se ond problem is related to the en oding of the stru tural des ription of the data. In
the ase of visual data, several en oding methodshave beendevised inthe past,ranging from
the analog television standards su h as NTSC (National Television Systems Committee) and
PAL (Phase Alternating Line), to the digital video oding standards ITU-T (ITU T
ele om-muni ationStandardization Se tor)H.261 [62℄, ISO(International Organization for
Standard-ization)/IEC (International Ele trote hni al Commission) MPEG-1 [136℄, and, more re ently,
ITU-T H.263 [63℄ and ISO/IEC MPEG-2 [137℄ (also ITU-T H.262). These standards have
typi ally dealt with non-stru tural representations of imagery. The rst standard to address
stru turedmovingimagerepresentationswillbeISO/IEC MPEG-4.
2.3 Visual analysis
Even though syntheti data amounts to a relevant part of the availablemultimediamaterial,
naturaldatawillalwaysbepresent. Naturaldata orrespondstodatawhi hisobtained,usually
through sampling, from the real world. While it is reasonableto expe t that sensors, su h as
video ameras,willin reasein omplexityovertheyears,forinstan ebyin orporatingdistan e
or depth sensors, it is unlikely that they will ever provide a stru tural representation of the
sampled dataattheiroutput.
Hen e, analysis, that is, the de omposition of the input data into a meaningful set of some
model parameters, is a very important task. Automati visual analysis, as stated before, is
almost the same as omputer vision: \building a des ription of the shapes and positions of
things from images" [107℄. With one dieren e, however. The purpose of omputer vision is
ultimately the omprehension of the s ene aptured by the amera, through an emulation of
theHVS(HumanVisual System),whileanalysis usuallyhasmoremodest obje tives.
Analysis,asstated,istheidenti ationofsomemodelparameters. Thismakesmodelingoneof
themostimportant tasksinresear h leadingtoautomati analysisof videosequen es,sin eit
seems learthat sophisti ated models an leadto a very a urate representationof the world,
butonlyatthe ostofaverysophisti ated,orevenimpossible,analysis: visualanalysisisoften
an ill-posedproblem[187, 9℄.
Visual analysis anhave several purposes [29℄:
Analysis for oding
Theobtainingofaparametri des riptionoftheobserveds ene. Thedes ription anlater
beusedtore onstru tthes enesothatlittleornoinformationislost. Thedes ription an
also be en oded and de oded eÆ iently(analysisforbandwidth saving), and analso be
manipulated(analysisforeasya ess),sothattheuser anintera twiththerepresented
world. The analogy with fax helps here. With \blind" fax, su h asexists today, to edit
Analysis for des ription or indexing
The obtaining of a parametri des ription, though in this ase it is not ne essary to be
able to re onstru t theobserved s ene orat least the originalsampled (or sensed) data.
The parameters of thedes ription have mostly a semanti meaning,whi h mayhelp the
taskofsear hingvisualdatainadatabase. Themodelparameters,orfeatures,estimated
oridentied,willbe usedas keys ofthedatabase.
Analysis for understanding
The pro ess leading tounderstanding of the observed s ene. While visual analysis tools
ingeneralaretools leadingtoarti ial intelligen e,orsooneexpe ts,analysis for
under-standingis arti ialintelligen eproper.
Analysis an be manual, automati , or partially automati , when an automati algorithm is
guided by user input (hints). An usual path in the resear h in this area, whi h, though it
progresses very qui kly, hasstill a long way togo, is to allow thealgorithms tobe supervised
and then attempt to make them automati . This is a polemi issue, however, as an be seen
in the arti le \Ignoran e, myopia, and naivete in omputer vision systems" [81℄ and in the
subsequent dialoguein[7℄and [94℄.
2.3.1 Levels of visual analysis
Some authors divide the visionpro essinto levels [107, 195℄ whi h are related tothe types of
models orprimitivesassumed:
Low-level 5
vision
The model is a sequen e of pixel matrixes. The orrelation between pixels is assumed
tobe high. Evolution from oneimage to thenext is des ribed by a simplemotion eld,
uniformalmosteverywhere.
Mid-level 6
vision
Themodelisapossiblyhierar hi alsetofedgesegments,blobs,uniformlytexturedregions
(or equivalently boundaries) or regions of uniform motion. Surfa es and their relative
position may also be used. Motion an be asso iated with segments and/or edges or
boundaries.
High-level 7
vision
The modelis aset of 3-Dobje ts arranged hierar hi ally. Obje ts are semanti ally
iden-tied. Ea h obje thas anasso iated omplexmotion.
Understanding
The role, lassoridentityof(almost allof) theobje ts isknown.
5
Orimage. 6
Orprimalplus21=2-Dsket hes. 7
Thisdivisionishereamerematterof onvenien e. Itisalsosomewhatarbitrary,sin efeedba k
me hanismsseemtoexist betweentheupperand thelowerlevelsof thevisionpro ess. Visual
analysiswillbe lassiedinthefollowinga ordingtotherstthreelevels,sin eunderstanding
is not oneof the purposes here. The terms low-level, mid-level and high-level analysis willbe
usedthroughout thisthesis.
2.3.2 Tools for visual analysis
Analysis anbe seenas being doneat three levels: low-,mid-, and high-level. Dierent image
analysis tools have been developed over theyears whi h anbe lassiedas belonging to ea h
of theselevels. Restri tingattention tothose tools more losely relatedto analysisfor oding,
thefollowing(rather in omplete) lassi ation anbe used:
Low-level vision analysis
Lineartransformations(transforms),frequen yanalysis,motionestimation(opti al ow,
blo kmat hing),et .
Mid-level vision analysis
Edgedete tion, ontourdete tion,segmentationintosynta ti allyuniformregions,motion
estimation(motionofedges and regions), et .
High-levelvision analysis
3D(Three-dimensional)stru ture fromshading and motion,3Dstru turefrom disparity
(stereo vision),et .
2.4 Visual oding
Coding 8
is thepro essof translatinga sequen eof symbols belonging toa given alphabet, the
message, 9
into a sequen e of symbols of a dierent alphabet (usually the binary alphabet).
Coding issaid tobe losslessifthe originalmessage anbere overed exa tlyfrom theen oded
one.
Visual odingisthepro essbywhi htheparametersofthestru turalrepresentationofavisual
s ene obtained either by analysis or dire tly, in the ase of syntheti imagery, are en oded.
Whentherepresentationisobtainedbyanalysisof naturaldata,thetermvideo odingifoften
used.
8
Coding should always be understood as referring to sour e oding throughout this thesis, as opposed to hannel oding.
9
Notethedierent meaningsofthe word \message". Ina ommuni ationsframework, itis theset ofideas expressedusingagivenmediumorensembleof media. Inthe ontextofinformation theory,itisasequen eof
2.4.1 Obje tives
En oding, the translation between one alphabet and another, an have several obje tives. It
an be seen as the pro ess of minimizing a ost fun tionalgiven some onstraints. There are
severalmeasureswhi h anbeusedtoexpressboththe ostfun tionalandthe onstraints, and
whi h,weighted dierently,re e ttheobje tivesofea hparti ular odings heme:
Compression ratio (or,inversely,bitrate)
Thesizeoftheoriginalmessagedividedbythesizeoftheen odedmessage,bothexpressed
in bits. By maximizing ompression, the bandwidth or spa e requirements are redu ed,
a ording towhetherthe dataistransmittedor stored.
Quality (or,inversely,distortion)
Ameasureofthedieren ebetweentheoriginalmessageandtheoneobtainedbyde oding
theen oded message. Errorresilien eisa ountedforinthismeasure byallowingerrors
toae ttheen odeddata.
Cost
The ostof theen oder andde oder(weightedappropriately).
Content a ess eort
A measure of the easiness with whi h only spe ied parts of the original message an
be re overed from the en oded message. By maximizingease of a ess,simpleterminals
an stillallow the user to manipulate the s ene. Video tri k modes an also be seen as
requiring easya essto ontents (inthis ase tosinglevideoimages).
Delay
Theintervalbetweentheinstanta symbolof theoriginalmessageisinputtotheen oder
and the orresponding symbolis outputfrom thede oder, assumingno hanneldelay.
Quality is perhaps the most diÆ ultmeasure to make, in the ase of visual oding. How an
an obje tive measure of quality re e t the quality of the re onstru teds ene as per eived by
humans? Even though studies have been ondu ted overtheyears todevelop su ha measure,
basedonthepropertiesoftheHVS,nosingleuniversallya eptedmeasureexists. Twomeasures
ofqualityaretypi allyusedtodayinthe aseofvideo oding: asimpleobje tivemeasure, alled
PSNR(Peak SignaltoNoiseRatio),and subje tivequalitymeasuresbased onevaluation bya
signi ant setof persons.
Cost is related mostly to implementation of en oders and de oders, though it an be related
also totherequiredbandwidth,whi hisdependent onthe ompressionratio,and thusalready
onsidered through that measure. Implementation osts an be related to the memory and
CPU (CentralPro essing Unit)powerrequired forbothen odersand de oders.
The ostfun tionaland onstraints anbe onstru tedfromthemeasuresabovesoastore e t
the dierent requirements of an appli ation. Some appli ations may require quality as high
appli ationswhere ontentisen oded on eandde oded manytimesputa largerweightonthe
ost ofde oders.
2.4.2 Main ode blo ks
Figure 2.1shows a typi alblo k stru tureofa ode . The en oder part onsistsof an analysis
blo k, whi hobtainsa stru turals ene representationfrom givennaturaldata, followed bythe
en oder,whi hen odesthisrepresentationsoastobesentdownalogi al hannel(eitherareal
hannel or some physi al storage medium). If syntheti data is available, it is input dire tly
to the en oder without being analyzed, provided it is already des ribed in an appropriately
stru tured way. The de oder performs the opposite tasks. The en oded data is de oded so
as to obtain the stru tural s ene representation whi h is then used by the renderer to reate
appropriatestimulitothehumanre eivers,whi h anhavedierentlevels ofintera tivitywith
thesystem.
Oftensome pro essing is performed on thenatural data before the analysis proper. This
pro- essing usually intends to lter or ondition the data so as to render the analysis simpler or
moreee tive. Sin e it takes pla e before analysis and en oding,it is alled pre-pro essing. It
isoften takenasbeing partof theanalysisitself.
The worden oderisused herewith twodierent meanings: inthe ase ofnatural data, whi h
requiresanalysis,en oder anbothmeanthe ompletesystem,fromnaturaldatarepresentation
totheresultingen oded message,orsimplytheblo kwhi htranslates thestru tural
represen-tationintotheen oded message, whi h isthe stri tmeaning. Inthe sequelthe exa tmeaning
willbe evidentfrom the ontext.
Anen oder, inthebroadsenseoftheword,servestwomainpurposes. Firstly,itissupposedto
stripirrelevant information(fromthepoint ofview oftheassumed re eiveroftheinformation,
usually the HVS) from the input. Irrelevan y removal is done by the analysis blo k, sin e,
a ording to Marr [107℄, \vision is a pro essthat produ es from images of theexternal world
a des ription that is useful to the viewer and not luttered with irrelevant information [our
emphasis℄," and to emulate visionis the ultimate purpose of analysis. Se ondly,the en oder,
again in the broad sense, is supposed to remove redundan y. This is a role whi h is shared
by the analysis and the en oder blo ks, though the kind of redundan y removed is dierent.
Theanalysisblo kremovesrepresentationredundan ybyttingtheinput datatoa stru tural
model. For instan e, the highly redundant image of a sphere an be des ribed, with an
ap-propriate model, by the position and size of the sphere, its surfa e hara teristi s, and a set
of light sour es. Su h a des ription is mu h less redundant than the original array of pixels.
The en oder blo k, on the other hand, removes statisti al redundan y from the sequen e of
symbols orresponding tothe stru tural representation. It must be stressed here thatremoval
Encoding
Analysis
Encoding
synthetic
structured data
natural
unstructured data
to channel or
storage device
Analysis
user
Pre-processing
supervision
(a)En oder.Rendering
Decoding
from channel or
storage device
to display
devices
user
interaction
to uplink
channel
(b)De oder.Figure2.1: Basi blo kstru tureof a ode .
2.4.3 Generations
Inthe aseofnaturals enes,i.e.,video oding,analysisisperformedbeforeen odingproper,as
anbeseeninFigure2.1. Videoen odingte hniques anthusbe lassieda ordingtothelevel
ofanalysis typi allyrequired. Theterms rst-andse ond-generationvideo odingwere oined
byKunt et al. [96℄, and orrespond approximately to thetwo rst levels of analysis presented
before. The requirements interms of analysis of thesetwo generations of video oders, plus a
thirdonerelated withhigh-levelanalysis areasfollows:
First-generation
Coders whi h require low-level analysis. Hybrid oders [64℄ and motion ompensated
hybrid oders[145℄belongtothisgeneration. Thefundamentaltoolsusedinthese oders,
belong tothisgeneration.
Se ond-generation
Coders whi hrequire mid-levelanalysis. Thistypeof analysis is typi allymore omplex
than low-level analysis. Even though a lot of eort has been put into thiseld, a truly
reliable set of mid-levelanalysis tools is not yet mature. This thesis ontributesmainly
totheproblem ofdevelopingtools atthislevel.
Third-generation
Coders whi hrequirehigh-levelanalysis. No trulyreliableautomati analysissetof tools
existsatthislevel. Mosttoolsstillrelyonhumansupervision,anditprobablywillremain
sofora few moreyears: mostof thesemanti features/des riptors anonlybe extra ted
byhumansatthepresenttime [29℄.
This lassi ation, though useful, is somewhat arti ial. For instan e, a mid- or high-level
analysistool anbe usedtoenhan earst-generationvideoen oder. Thisoftenhappenswhen
videoen odingalgorithmsarebeingenhan ed.
2.5 Standards
Standards are fundamental for universality of servi e and interworking, both of whi h are of
paramount importan e fortheend onsumer. Standardization, however, is a time- riti al
pro- ess: ifdonetoosoon,itmaynotbenetfromtheongoingresear hinthearea,ifdonetoolate,
it may have to fa e proprietary solutions proposed by industries of suÆ ient weight to make
thestandarduseless.
Standardsmaybeoftwoverydierentnatures. Englishisadefa tolanguagestandardinmost
of thewestern world. Fren h,on theother hand,is ade jurestandard,atleastinFran e: itis
standardizedbytheA ademieFran aiseandimposedbytheFren hstateinoÆ ialdo uments.
The asewith te hnologiesissimilar.
Standards, whether de fa to or de jure, an be reated indierent ways. Some aredeveloped
byan open groupof ompanies, universitiesand individuals whi h work towards thestandard
undersomenational,e.g.,ANSI(Ameri anNationalStandardsInstitute),orinternational,e.g.,
ISO, standardization body. Others are developed by similar groups, though working on the
frameworkofnon-oÆ ialorganizationssu hastheW3C(World WideWebConsortium)orthe
IETF(Internet Engineering Task For e). Others stillare developed bysingle institutions and
theirspe i ation made publi and a eptedas defa to standards by therestof themembers
of the market. Often de fa to standards are later a epted as de jure standards by oÆ ial
standardizationbodies.
In the world of multimedia, examples an be found in ea h of these ases. The video oding
standards MPEG-1 and MPEG-2, and H.261 and H.263, were developed under international
standard organizations, viz.ISO and ITU (InternationalTele ommuni ationUnion), and thus
are de jure standards. The Java TM
beenproposedto ISOtobe ome ade jurestandard). The Webstandards, su h asHTTP and
HTML,arebeingdeveloped intheframeworkoftheIETFand W3Cnon-oÆ ialorganizations.
Convergen e analsobefoundintheworldof standards: theMPEG(MovingPi tureExperts
Group) ommunity,traditionallyvideo-oriented, and theWWW ommunity,moremultimedia
oriented, are onverging. The MPEG ommunity is nalizing the rst version of MPEG-4.
MPEG-4 version 1 willbe mu h morethan video and audio odingwith a multiplexing layer,
as MPEG-1 and MPEG-2 were: MPEG-4 will standardize audio-visual 3D s ene des ription
methods, by in lusion of the ISO/IEC 14772 VRML (Virtual Reality Modeling Language)
standard. TheWWW ommunity,ontheotherhand,isissuingdo uments,whi hwillprobably
be omede fa to standards, that address similarsubje ts: PNG (Portable Network Graphi s)
for en oding of still images, support of VRML for 3D virtual worlds (whi h in ludes video
nodes),andSMIL(Syn hronizedMultimediaIntegrationLanguage)forsyn hronizingdierent
multimediaobje ts inasinglepresentation. Morethana onvergen e, whatis being witnessed
is an overlap, a ompetition. The future will tell whether the minimalist, text-based, W3C
and IETF standards orthe overwhelmingMPEG standards will win. Marketdoes notalways
hoosethebestte hnology: often timing,asmentionedbefore,is the riti al fa tor.
2.5.1 Standardization hallenges
Nowadays standardization of multimedia ommuni ations fa es several hallenges. Dierent
te hnologies(some ofthem standards), bydierentorganizations, willaddress distin tsubsets
of the hallengeslistedbelow:
Content
Interesting ontentswillsoonin lude omplex3Ds enes, ontainingamixtureofsyntheti
andnaturaldynami obje ts,whi h anbemanipulatedbytheenduser. Whowillprovide
thistype ofinformation or ontent? How? I.e.,using what tools?
Bandwidth
Network bandwidth and mass storage apa ities both ontinue to grow exponentially.
Evenintheunlikelyeventthatthey will ontinuetoin reaseexponentiallyforever,
\te h-nologi almalthusianism"tellsusthatthebandwidth/ apa itywillneverbeenough,sin e
ontentwillalwaysgrowatafasterpa e. Hen e,therewillalwaysbemoneytobegained,
orspared,through ompressionof themultimediadatatransmitted orstored.
The issueof ompressionhastypi allybeenmu hmoreof a on ernforthevideorather
thanthemultimediapeople. Anumberofstandards,aimingatdierentappli ations,have
beendevelopedforthe ompressionofvideoandstillimages: H.261 andH.263,MPEG-1,
MPEG-2, MPEG-4(soontobe born),and ISO/IEC JPEG(Joint Photographi Experts
Group). Fromthemultimediaworld,less on erned,unfortunately,withbandwidthwaste,
littlemorethantheW3C PNGexiststoday.
A ess
ommunityhasonlyre entlystartedtoaddressinathoroughway,inMPEG-4. Thereare
goodreasonsforthelate onvergen e: ompressionandeasya essarequitein ompatible,
andforsometimebandwidthwasmoreimportantthanintera tivity. Thebalan eislikely
to hange.
Classi ation
TheWeb isahuge,distributeddatabase, whosesizetendstoin reaseexponentially. How
an users navigate through this apparent haos in a useful way? How an multimedia
informationsu hastext,2D(Two-dimensional)pi tures,2Ddrawings,2Dvideos,sound
lips,movies,TV programs,3Dobje ts, andmixturesthereof, be indexed,sear hed, and
retrievedinameaningfulway? Willtheindexing,or lassi ation,bedoneautomati ally?
This is an issue whi h is being simultaneously addressed by W3C and MPEG, through
the re ently born MPEG-7 eort. W3C is working on Metadata, or information about
information,while MPEG-7aimsatstandardizingmultimediaindexingmethods.
Rights prote tion
Providers of interesting ontent, individual authors or ompanies, will be interested in
getting paid. How an IPR (Intelle tual Property Rights) be prote ted on the Web?
What will the network e onomi s be like? How will IPR information be in luded on
multimediaobje ts?
A ess ontrol and rating
Should all information on the Web be available to all? Who should ontrol? How to
ontrol? How torateinformation? Howto iphersensitive information?
W3Chas addressed thisquestionthrough a type of Metadata alledPICS (Platformfor
InternetContent Sele tion),whi haims atstandardizing themethod of in ludingrating
information (labels)intoWeb ontent.
Trust
Is the information available on the Web trustworthy? How to as ertain its real origin?
How aninformation be ertied? How anone assurethat a signature ertiesa given
pie eofinformation and thatthisinformation hasnot hangedin anyway?
W3C is also working on DSig (Digital Signatures), and there are some CEC (European
CommunityCommission)fundedproje tsworkingonwatermarkingofvisualinformation.
Interworking
How toavoid needlessdupli ation of hardware/software needed to a ess information of
thesame type stored indierent formats? This is the basi obje tive of standardization
eorts.
Evolution
Howtoprodu estandardsthaten ourage,ratherthanprevent, ompetitionandte hni al
evolution?
MPEG-4had theprovisionforevolution asoneofits obje tives. However, due totiming
problems, MPEG-4 was divided into two phases. Phase 1, whi h is s heduled for the
2.5.2 Evolution of visual oding standards
Se tion2.4.1presentedthevariousmeasureswhi h anbe usedto onstru tthe ost fun tional
thatvideo oders minimize(oratleastattempt tominimize). Most ofthem have beenusedin
oneformoranotherbyen oders ompliantwiththeavailablevideo odingstandards. However,
easiness of a ess to ontent was rst onsidered onlyin MPEG-1 and MPEG-2, in the form
of provisionfor qui k a ess to an hor images. These images, known as I images (I of Intra),
are independentlyen oded and spread evenly in time,thus allowingthe so- alled tri k modes
of video re orders: fast-forward, ba ktra k, et . This allowed onlyfor a rather tersea ess to
ontent. It wasonlyMPEG-4whi hstartedto onsidera moreusefulformof ontent,obje ts,
and whi h provided means for expressing omplex 3D audio-visual s enes with mixtures of
2D and 3D obje ts, natural or syntheti . The real revolution was from MPEG-2 to
MPEG-4. MPEG-2 was essentially a revamped version of MPEG-1, using the same basi tools, but
allowing for in reased resolution [95℄: HDTV (High Denition Television) required it. True
breakthroughsinthevideo odingareahavebeenquiterare. Mostofthetoolsusedbyen oders
ompliant to MPEG standards, even MPEG-4, are small variants, however well-engineered,
of tools developed de ades ago [149℄, e.g., DCT and blo k mat hing motion ompensation.
However, the integral of all the in remental hardware and software te hnology advan es over
thelastde ades orrespondstoan impressiveevolution.
2.5.3 Consequen es of standardization
Standardsdon't spe ifyen oders: theyspe ifyabitstream syntaxand ade oder. Hen e,they
impli itly dene a model for the stru tural data to be en oded. In this sense, video oding
standards analso be lassiedasbelongingtooneof thethree generations presentedbefore.
Inaslightlymoreformalway,letBbethespa eofbitstreams ompliantwithagivenstandard.
Let E be thespa e of the en oders ompliant with thesame standard. Then, a given en oder
e(), in the broad sense, is a fun tion from the spa e R , of s ene representation, to B, i.e.,
e():R !B. Spa e E is thus learly limited by thenature of B. Standardsspe ify de oders,
that is, they spe ify a fun tion d() from B ba kto R . Typi ally, spa e E, though restri ted
by the nature of B, is very large. Even if one restri ts it to the spa e of ompliant en oders
providing appropriate re onstru tion, that is, su h that d(e()) is approximately the identity,
thespa e istoo large.
One an pose the en oding problem mathemati ally, though the omplexity of the solution
usuallyleadstoheuristi solutions: howtoen odeagivens enerepresentationr? Thisquestion
an be answered by nding argmin
b2B
z(d(b);r), where z is a distortion measure. However,
thisin ludesonlyadistortion,orquality,measure. Onemaybe interestedinminimizing other
measures. The generi problem is to nd a generi en oder, i.e., an en oder leading to good
de oding. A possibilityistond argmin e2E
max r2R
z(d(e(r));r).
Whatever the approa h taken, heuristi or optimizing, it is lear that standards introdu e
restri tionsintothe spa e of possible en oders. It also learthat they also leavea lot of room
data. Thedesignofde oderswithgooderror on ealmentstrategiesandthedesignofen oders
providing forgood errorresilien eatthede oderisopen to ompetition.
2.5.4 Standards and generations
Standards an be lassied as rst-, se ond- or third- generation, a ording to the hara ter
of the ompliant en oders. However, nothing prevents the building of a se ond generation
en oder (i.e.,an en oderusing mid-levelanalysis) whi h generatesbitstreams ompliant with
rst-generation standards. For instan e, MPEG-1 and MPEG-2 belong learly to the rst
generation, while MPEG-4, whi h requires more sophisti ated analysis tools but still uses a
lassi alapproa htoen odethetextureoftheobje ts, anbesaidtobeasteptowards
se ond-generationstandards. A tually,thishasbeenthetypi alroadofevolution,assomeofthework
inthisthesisdemonstrates. Whentoolsaimedatbeingusedinoneofthesetransitionen oders
aredeveloped,onemay lassify them asbelonging totransitionsbetweengenerations.
2.6 Analysis and oding tools
Figure 2.2 shows the analysis, pre-pro essing and oding tools proposed or dis ussed in this
thesis. The gure lassies these tools into the three generations, with two transition layers
added. The tools are also listed below, together with pointersto the se tions where they are
des ribed:
Analysis tools:
{ Transition tose ond-generation:
1. Knowledge-basedsegmentation[123,125, 124℄(Se tion 4.4).
2. Cameramovement estimation[129,127,130,128,113, 122℄ (Se tion5.3).
{ Se ond-generation:
1. RSSTsegmentation[32,33℄(Se tion 4.5).
2. TR-RSST(Time-Re ursiveRSST) segmentation[119℄(Se tion 4.7).
{ Transition tothird-generation:
1. RSSTwithhumansupervision [33℄(Se tion 4.6).
Pre-pro essing tools:
{ Transition tose ond-generation:
1. Imagestabilization[127,130,128,113, 122℄(Se tion 5.5).
Coding tools:
{ Transition tose ond-generation:
{ Se ond-generation:
1. Shape oding: ataxonomyandanoverviewof odingte hniques[120,121℄
(Se -tions6.2and6.3),parametri urve odingtools [116,114,79,78℄(Se tion6.4).
amera movement ompensation knowledge-based segmentation stabilization image fast losed ubi splines RSST segmentation representations typesand ofpartition taxonomy ode ar hite ture estimation movement amera TR-RSSTsegmentation supervisedRSSTsegmentation analysis: oding: pre-pro essing: rst-generation transitionto se ond-generation se ond-generation transitionto third-generation
Graph theoreti foundations for
image analysis
N~ao devemos nun a pro urar ser mais pre isos
eexa tos doque o problema em ausa requer.
Karl Popper
This hapter denes themain on epts usedthroughout thisthesis. It is divided intose tions
dealing with images, image latti es, image graphs, et . Con epts are introdu ed, whenever
possible,ina bottom-upmanner: on eptsaredened by usingpreviously dened on epts.
OftentheeÆ ien yofalgorithmsknowntosolve problemsrelatedtothedenitionsgiven here
isdis ussed: theusual O() notationof algorithmi sisused[28℄.
3.1 Color per eption
Thereare twotypesoflight sensor ells intheretina: rodsand ones. Rodsare usedfornight
(s otopi ) vision, while ones are used for daylight (photopi ) vision. Both are known to be
usedintwilight(mesopi ) vision.
Rods greatly outnumber ones. However, the distribution of the rod ells is su h that its
density is nearly zero in the fovea, that is,the zone on the retina orresponding to the enter
of attention. In this zone ones are densely pa ked. 1
Rods are mu h more sensitive to light
than ones: a single quantum is known to be suÆ ient to ex ite a rod. The dierent density
distribution of rods and ones seems to be an evolutionary ompromise between a ura y of
1
Asimple experiment onrmstheabsen eof rods in thefovea. Look dire tlyatadim starandthen look slightlytoitsside: itsapparentlightnesswillin rease.
vision(fundamental during daytime)and abilitytodete tthreats(fundamentalduring dusk).
While rods are sensitive to a wide range of light frequen ies, they all have the same type of
response, hen e s otopi visionis essentially \bla kand white": olors are not dis riminated.
Cones, on the other hand, are really three dierent types of ells with dierent frequen y
responses. Onetypeof ones,say\red" ones,isespe iallysensitivetofrequen iesaroundpure
red, another, \green" ones, to frequen ies around pure green, and the last, \blue" ones, to
frequen ies around pure blue, where\pure" means onsisting of single frequen y. The overall
response of the ones spans the visible light spe trum. However, the maximum sensitivity of
the ombineda tionof oneso ursata slightlyhigherwavelength(towards red)thanthat of
rods(towards blue): it is the so- alled Purkinje wavelength shift. This seems to be related to
thefa tthatduringtwilightlightismorebluishthanduringdaytime,sin eitismostlyindire t
light dira tedbytheatmosphereparti les.
In the framework of image ommuni ations and multimedia, photopi (daytime) vision is the
rule,sothattheresponseof rods anbemostlyignored. Theresponseof ones anbemodeled
asa nonlinearfun tion of theinner produ t of a spe tralsensitivityfun tion, whi his a
har-a teristi of the given type of sensor ell in a StandardObserver, and the power spe trum of
thelight attainingthesensors (see for instan e [189℄). \Red", \green",and \blue" ones have
dierent spe tralsensitivityfun tions whi hpartiallyoverlapinfrequen y.
Furtherinformation on olorper eptionmaybe foundin[164,1,27℄.
3.1.1 Color spa es
Color reprodu tionuses thefa t that theHVShas onlythree types of ones. In order fortwo
light sour es to be per eived as having equal olor it is not ne essary for their power spe tra
tobeequal: they onlyhave toprodu ethe sameresponse forea h of thethree typesof ones.
Hen e,mostimage dataisavailableinathree omponent format.
Colordata isoftenpresentedina CRT(Cathode-RayTube). Sin ethepoweremittedbysu h
s reensistypi allyproportionaltoa(arithmeti )poweroftheinputvoltage(theexponentbeing
the so- alled gamma value), amerasare usually designed to perform gamma orre tion. The
orre tion spe ied by ITU-R (ITU Radio ommuni ation Se tor)
2 Re ommendation BT.709-2 [80℄follows I 0 = ( 4:5I if0I 0:018, and 1:099I 0:45 0:099 if0:018<I 1, (3.1)
whi h istheinverse ofthe idealmonitorpowerfun tion
I = 8 < : I 0 4:5 if 0I 0 0:081,and I 0 +0:099 1:099 1 0:45 if 0:081<I 0 1, 2