Multimodal social scenario perception model for initial human-robot interaction : Modelo de percepção de cenário social multimodal para interação inicial humano-robô

(1)

Faculdade de Engenharia Elétrica e de Computação

Diego Cardoso Alves

Multimodal social scenario perception model

for initial human-robot interaction

Modelo de percepção de cenário social

multimodal para interação inicial humano-robô

Campinas

2019

(2)

Multimodal social scenario perception model for initial

human-robot interaction

Modelo de percepção de cenário social multimodal para

interação inicial humano-robô

Dissertation presented to department of Electrical Engineering and Computer Engi-neering of Universidade Estadual de Camp-inas as part of requirements to obtain the title of Master’s in Electrical Engineering, in the area of Computer Engineering.

Dissertação apresentada ao departamento de Engenharia Elétrica e Computação da Uni-versidade Estadual de Campinas como parte dos requisitos para obtenção do título de Mestre em Engenharia Elétrica, na área de Engenharia da Computação.

Advisor: Prof.a Dra. Paula Dornhofer Paro Costa

This work corresponds to the final dissertation presented by the student Diego Cardoso Alves, and oriented by Prof.a Dra. Paula Dornhofer Paro Costa

Campinas

2019

(3)

Biblioteca da Área de Engenharia e Arquitetura Luciana Pietrosanto Milla - CRB 8/8129

Alves, Diego Cardoso,

AL87m AlvMultimodal social scenario perception model for initial human-robot interaction / Diego Cardoso Alves. – Campinas, SP : [s.n.], 2019.

AlvOrientador: Paula Dornhofer Paro Costa.

AlvDissertação (mestrado) – Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação.

Alv1. Robótica - Aspectos sociais. 2. Interação social. 3. Visão de robô. I. Costa, Paula Dornhofer Paro, 1978-. II. Universidade Estadual de Campinas. Faculdade de Engenharia Elétrica e de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Modelo de percepção de cenário social multimodal para interação

inicial humano-robô

Palavras-chave em inglês:

Robotics - Social aspects Social interaction

Robot vision

Área de concentração: Engenharia de Computação Titulação: Mestre em Engenharia Elétrica

Banca examinadora:

Paula Dornhofer Paro Costa [Orientador] Esther Luna Colombini

Eric Rohmer

Data de defesa: 13-08-2019

Programa de Pós-Graduação: Engenharia Elétrica

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0002-2306-7430 - Currículo Lattes do autor: http://lattes.cnpq.br/7531367319867132

(4)

COMISSÃO JULGADORA - DISSERTAÇÃO DE MESTRADO

Candidato: Diego Cardoso Alves RA: 189729

Data da Defesa: 13 de agosto de 2019

Título da Dissertação: Modelo de percepção de cenário social multimodal para interação

inicial humano-robô.

Prof.ª Dr.ª Paula Dornhofer Paro Costa (Presidente, FEEC/Unicamp) Prof.ª Dr.ª Esther Luna Colombini (IC/Unicamp)

Prof. Dr. Eric Rohmer (FEEC/Unicamp)

A ata de defesa, com as respectivas assinaturas dos membros da Comissão Julgadora, encontra-se no SIGA (Sistema de Fluxo de Dissertação/Tese) e na Secretaria de Pós-Graduação da Faculdade de Engenharia Elétrica e de Computação.

(5)

indirectly, with my personal growth. For this reason, I would like to dedicate this work to everyone that at some point of the past years, helped me to go through this process and

(6)

My thanks to my parents, for all the support, affection and belief. Also for raising me with enough freedom to chose my own path. I am grateful for my family too, for the pride they’ve always kept for my achievements, always motivating me to go further. My wife, Leiliane Valeriano de Souza, for the love, attention and support in the delicate moments of this trajectory.

Paula Dornhofer Paro Costa, my supervisor and friend, for believing in me, trusting in my work, and always taking my best.

Among the collaborators I would like to thank the participants of the recording scenes, that helped me not only with the required data to the project but with relevant advice about the workflow.

Finally, my sincerely acknowledgments to people who collaborated with ideas or implementation suggestions.

(7)

anyone could see very simply see how any application of AI developed and why” (Ginni Rometty)

(8)

Human-robot interaction imposes many challenges and artificial intelligence researchers are demanded to improve scene perception, social navigation and engagement. Great at-tention is being dedicated to the development of computer vision and multimodal sensing approaches that are focused on the evolution of social robotic systems and the improve-ment of social model accuracy. Most recent works related to social robotics rely on the engagement process with a focus on maintaining a previously established conversation. This work brings up the study of initial human-robot interaction contexts, proposing a system that is able to analyze a social scenario through the detection and analysis of persons and surrounding features in a scene. RGB and depth frames, as well as audio data, were used in order to achieve better performance in indoor scene monitoring and human behavior analysis.

Keywords: Social Human-Robot Interaction; Human Detection and Tracking; RGB-D

(9)

A interação humano-robô impõe muitos desafios e pesquisadores da área de inteligência artificial são requisitados para melhorar a percepção da cena, a navegação social e o en-gajamento. Uma grande atenção está voltada ao desenvolvimento de abordagens de visão computacional e sensoriamento multi-modal focadas na evolução dos sistemas robóticos sociais e na melhoria da precisão do modelo social. Os trabalhos mais recentes relaciona-dos à robótica social abordam o processo de engajamento com foco na manutenção de uma conversa previamente estabelecida. Este trabalho traz o estudo dos contextos inici-ais de interação humano-robô, propondo um sistema capaz de analisar um cenário social através da detecção, análise de pessoas e características do entorno de uma cena. Dados de imagem RGB e de profundidade, assim como dados de áudio, foram usados para obter um melhor desempenho no monitoramento da cena interna e análise do comportamento humano.

Palavras-chaves: Interação Social Humano-Robô; Detecção e Rastreabilidade Humana;

(10)

Figure 1.1 – Robots Market Forecast: 2019-2024 . . . 15

Figure 1.2 – Pepper (left) and Nao (right) social robots . . . 16

Figure 2.1 – Sequences of human-robot interactions . . . 27

Figure 2.2 – Proximity zones during interpersonal relations. The concept is also applied to human-robot interactions to model people availability and robot awareness. . . 28

Figure 2.3 – User detection and approximation considering the safe margin concept. 31 Figure 3.1 – Demonstration of the stereoscopic concept. . . 35

Figure 3.2 – Structured light system containing one projector, one camera, and an object. . . 36

Figure 3.3 – Demonstration of the time-of-flight concept. . . 37

Figure 3.4 – Intel○c RealSenseTM _{R200 camera components. . . .} ₃₉

Figure 3.5 – Intel○c _RealSenseTM _{R200: Use of the cpp-config-ui application to} ana-lyze camera settings. . . 41

Figure 3.6 – Examples of features extracted with OpenFace . . . 43

Figure 3.7 – RGB Feature extraction workflow. . . 45

Figure 3.8 – Representation of the Energy and RMS of an audio sample included in the database. . . 46

Figure 3.9 – Representation of the number of zero crossings for a spectrum of 100 audio samples. . . 46

Figure 4.1 – The modules of the social robotic system architecture. . . 52

Figure 4.2 – World-Camera coordinates transformation. . . 56

Figure 4.3 – Vector directions of human attention. . . 58

Figure 4.4 – Data aggregation and attribute creation process. . . 61

Figure 5.1 – Multi Layer Perceptron (MLP) architecture . . . 65

Figure 5.2 – Early stopping: General rule demonstration . . . 67

Figure 5.3 – Extreme Learning Machine (ELM) architecture . . . 67

Figure 5.4 – Neural Network ensemble architecture . . . 71

Figure 5.5 – Filter approach for attributes selection. . . 72

Figure 5.6 – Wrapper approach for attributes selection. . . 75

(11)

Table 3.1 – Depth cameras: Features comparison . . . 39

Table 3.2 – R200 camera: Parameters configuration . . . 41

Table 3.3 – Facial Action Units features . . . 48

Table 3.4 – Description of the Action Units features. . . 49

Table 4.1 – Feature data types of the dataset. . . 53

Table 4.2 – Features with more than 30 percent of missing values. . . 54

Table 5.1 – ANOVA correlation p-values of the attributes. . . 74

Table 5.2 – Classifiers tunning parameters . . . 78

Table 5.3 – Classifiers performance . . . 78

(12)

ANOVA Analysis of Variance CSV Comma Separated Values DLT Direct Linear Transformation ELM Extreme Learning Machine ETL Extract, Transform, Load FOV Field of View

FPGA Field Programmable Gate Array FPS Frames per Second

GOMS Goals, Operators, Methods and Selectors HCI Human-Computer Interaction

HRI Human-Robot Interaction LDA Linear Discriminant Analysis

LIDAR Laser Illuminated Detection And Ranging MLP Multi Layer Perceptron

RGB Red, Green, Blue

RGB-D Red, Green, Blue - Depth RMS Root-Mean-Square

SLAM Simultaneous Localization and Mapping SSD Solid State Drive

SVM Support Vector Machines WAV Waveform Audio File Format

(13)

1 Introduction . . . 15 1.1 Motivation . . . 16 1.2 Research Problem . . . 17 1.3 Methodology . . . 18 1.4 Contributions . . . 18 1.5 Work Organization . . . 19 2 Related Works . . . 21 2.1 Human-Robot Interaction . . . 21 2.2 Social Robotics . . . 23 2.2.1 Affective Trust . . . 23

2.2.2 Human-aware interaction systems . . . 25

2.2.3 Interruption context . . . 26

2.3 Multimodal robot perception . . . 29

2.3.1 Audio data . . . 29 2.3.2 Visual data . . . 30 2.3.3 Multimodal data . . . 30 2.4 Concluding Remarks . . . 32 3 Training database . . . 33 3.1 Depth Cameras . . . 33 3.1.1 Stereoscopic Cameras . . . 34

3.1.2 Structured Light Cameras . . . 35

3.1.3 Time-of-Flight Cameras . . . 36

3.1.4 Hybrid Cameras . . . 37

3.2 Camera Choice . . . 38

3.3 Data Capturing Module . . . 40

3.4 Human-Robot Initial Interaction Multimodal Database . . . 42

3.4.1 Image information . . . 43

3.4.2 Audio information . . . 44

3.4.3 Dataset Consolidation . . . 47

3.5 Concluding Remarks . . . 49

4 Feature Engineering . . . 50

4.1 Model Development Methodology . . . 50

4.2 ETL process . . . 53

(14)

4.2.3 Group-Robot Interaction Intensities . . . 57

4.2.4 Data Aggregation . . . 59

4.3 Concluding Remarks . . . 62

5 Classification Model . . . 63

5.1 Classification module . . . 63

5.1.1 Multi Layer Perceptron . . . 64

5.1.2 Extreme Learning Machine . . . 66

5.1.3 Model Structure . . . 69 5.2 Attribute selection . . . 71 5.3 Performance results . . . 77 5.4 Concluding Remarks . . . 79 Conclusion . . . 81 Bibliography . . . 84

(15)

1 Introduction

The robotics market was valued at USD 31.78 billion in 2018 and is expected to register a compound annual growth rate (CAGR) of 25% over the forecast period of 2019-2024 (MORDOR, 2018) (Figure 1.1). Moreover, according to market intelligence firm Tractica (2016), the number of consumer robots shipped will grow to approximately 66 million units annually until 2025. Such consumer robots can be roughly divided into automated robots and social robots. Automated robots execute tasks in accordance with pre-established action planning, without adapting to changes, while social robots are required to interact with humans to fulfill their purpose (IEEE, 2015).

Source: (MORDOR, 2018)

Figure 1.1 – Robots Market Forecast: 2019-2024

Some of the most common and recent social robots in the market include Pep-per and Nao (Figure 1.2), both manufactured by SoftBank Robotics. Nao and PepPep-per are robots that are able to recognize shapes, people or voices, creating personalized ex-periences and performing complex motions and tasks. In addition, Pepper is capable of recognizing some basic human emotions and respond appropriately to moods.

In this scenario, there is an urge call to address the challenges in social robotics that cross disciplinary borders and require researchers from diverse background to con-tribute with their perspective, including computer science (artificial intelligence, computer vision and natural language processing) and social science (ethic, psychology, cognitive science and anthropology) (GOODRICH; SCHULTZ, 2007).

(16)

Source: (FUTURIST, 2019)

Figure 1.2 – Pepper (left) and Nao (right) social robots

1.1 Motivation

A central question in social robotics is how to promote a comfortable, engaging and long-lasting interaction between humans and intelligent robots, which are capable of performing tasks by sensing the environment, interacting with external sources and adapting their behaviour (STANDARDIZATION, 2012). In particular, recent advances in computer graphics, hardware performance and artificial intelligence are helping social robots to interact with people in a human-centered way.

Numerous social robotics applications focus on the collaboration between hu-mans and robots, their verbal interaction and robot navigation (CAMPA, 2016). In such applications, even common situations faced by humans in their daily lives, such as talking while walking, characterizes a challenging problem, since it requires the robot navigation in the scene, keeping away physical obstacles, and understanding person behavior while participating in a particular conversation. As a complicating factor, humans have an in-nate tendency to anthropomorphize surrounding entities, especially those that seems to present emotional, sensitive and communicative abilities (HUTSON, 2012). Robots that do not meet human expectations turn the interaction extremely frustrating.

Considering this, social robotics motivates the research in Human-Robot Inter-action (HRI) and Affective Computing. HRI is the research field dedicated to the design, the understanding and the evaluation of robotic systems during the interaction with hu-mans (GOODRICH; SCHULTZ, 2007). HRI studies people behavior and actions related to the robot, in respect to their physical characteristics and interactive possibilities

(17)

(DAUT-ENHAHN, 2013). On the other hand, Affective Computing is an interdisciplinary field focused on giving machines the ability to interpret the emotional state of humans, to adapt and to react to them accordingly (PICARD et al., 1995). Besides that, Affective Computing encompasses topics related to group-level attention and context-aware sys-tems.

The present work explores the intersection between HRI and Affective Com-puting focusing on very common but complex social situations experienced by humans and social robots, in our daily lives.

1.2 Research Problem

Human groups can be resilient or resistant to start an interaction, specially with an unknown entity. The understanding of a social situation is a key aspect to enhance the social robots ability to react accordingly based on the people behavior, also improving the perception of naturalness in its initial actions. Recent state-of-the-art models and algorithms allow the robot to interactively learn from human behaviors and determine which action to take at a given moment. However, the analysis of the initial interaction context is still an under-explored problem in social robotics.

Another challenging problem is the analysis of the social context involving a group of people. In fact, most works in social robotics usually refer to a controlled scenario composed by only one or two persons whereupon the main goal is to detect and monitor 1-1 conversation and emotional behavior (ESPOSITO et al., 2018; YOUSSEF et al., 2019; DEVILLERS et al., 2018).

Our approach consists of providing a scene analysis algorithm that is capable of deriving scene affective labels to be used by the robot to take decisions about the moment and the manner to start an interaction. Our model was based on artificial intelligence and social psychology concepts involving the context of interpersonal relationships.

In this broader context, our work focuses on solving the following research problems:

∙ Which social context labels are more suitable when dealing with an initial interaction approach?

∙ Is it possible to define interaction metrics between humans and robots?

∙ Which classification algorithm is more effective to handle with different situations? ∙ How can multimodal data contribute for this application?

(18)

1.3 Methodology

Methods related to human-robot interaction field, usually rely on RGB features such as person characteristics (gender, age, emotion), facial analysis (gaze tracking, face and eye landmarks) and body analysis (joint detection, activity estimation). The recent applications are also using multimodal data such as depth and audio information to achieve better results when classifying complex situations.

Many applications make use of deep learning techniques to classify person behavior or scene situation, with their focus based on low-level features and an abstract representation of the model scheme. Despite the fact that these methodologies achieve satisfactory results, they are hardly generalized for complex situations and their features could not be compared to the way that humans detect social engagement opportunities.

The methodology adopted in this work is based on macro features, whose the design and mapping are based on the main characteristics observed by a human during initial interactions. In this way, the system could be used into unknown environments with similar accuracy since it is not restricted by the number of persons and the type of scene.

We present a methodology flexible and robust to detect social scene charac-teristics including individual or group relation features, with the goal of predicting the appropriate strategy to be modeled by the robot intelligence in order to perform assertive initial interaction. The use of a multimodal dataset (RGB-D and audio data) increases the effectiveness of the model while dealing with situations in which computer vision alone was not sufficient due to severe occlusion in the scene or poor-quality images.

The final results show the data modelling techniques adopted including some transformations to derive representative attributes. These attributes give to the social robot the improvement of analyzing scene perception based on human group interaction. Moreover, the multimodal dataset obtained as consequence of this work is described in respect to its structure and recording steps.

Regarding the classifier problem, the model combines interaction metrics, with distance measurements and pose attributes to classify scenes and help the robot behavior to avoid undesired interactions. Spatio-temporal neural networks were used to work with a time window previously defined, achieving considerable results over the dataset.

1.4 Contributions

(19)

∙ The development of a novel macro feature concept that models the interaction level between a group of humans and their relationship with the robot;

∙ The database creation of post-processed RGB-D and audio data captured from typical social scenarios;

∙ The implementation of a social robot classification model based on temporal infor-mation that is able to determine the affective situation of a scene according to the short-time interaction analyses.

In addition, the present work also presents:

∙ An overview of the state-of-art of social robotics applications for initial interaction context;

∙ The conduction of an investigation study regarding feature sets that are appropriate to the classification problem;

∙ A methodology to determine group interaction intensities according to the attributes extracted;

∙ The evaluation of the model based on predefined test data and comparison with different results from techniques used.

1.5 Work Organization

The organization of this work brings a conceptual introduction and an overview of the state of the art projects, before diving into more specific topics such as feature engineering and neural networks. After that, it focus on the explanation of the methods used to evaluate the system, as well as the results of these assessments. Finally, the thesis summarizes the work with conclusions and future work.

The chapters are structured as follows:

∙ Chapter 2 - Related Works: It introduces a brief description of human-robot inter-action, social robotics and multimodal robot perception, presenting the revision of relevant related works and their contributions.

∙ Chapter 3 - Training Database: This chapter includes the description of the structure and the process used to capture the RGB-D and audio information of the social scenes. Moreover, it explains some feature transformation steps regarding images and audio for the construction of the final database.

(20)

∙ Chapter 4 - Feature Engineering: This chapter details the methodologies used to structure the data pipeline and select the most appropriate features.

∙ Chapter 5 - Classification Module: This chapter gives an overview of the model chosen to classify the affective social situation. Also, it displays the selected features according to their relevance to the model, as well as explains about the evaluation methods used to guarantee the model effectiveness.

∙ Conclusion: This chapter discusses the results obtained in the experiments. Besides that, the next steps are also presented in order to map the continuity of this work.

(21)

2 Related Works

The endeavor of creating social robotic systems has been influenced by the fields of human-robot interaction, social psychology, human-computer interaction (HCI) and affective computing. This chapter presents the research of related works that provided the scientific basis for the present work.

Section 2.1 presents the human-robot interaction concepts and goals. It also explains the main requirements to build a social robotic system that is able to understand the social cues while preserving the human integrity and following ethical principles.

Since these aspects are complex to analyze and require the social perception of the scene, Section 2.2 embraces the works involving affective and adaptation character-istics of social robotics. First, it describes social psychology aspects that are relevant to human interaction, including theory of mind and the different levels of human perspective in communication processes. Second, it explains the importance of cognitive and affective trust to provide confidence during communication. Some works regarding the human-aware and interruption context are also described, including the main contributions to the area.

While recent works related to the automated monitoring of social scenes are still based on the analysis of audio or image information alone, Section 2.3 brings into perspective that multimodal processing represent a challenging opportunity to improve the modeling of human perception.

Finally, Section 2.4 presents a final discusion regarding the components and influential disciplines related to social robotic systems.

2.1 Human-Robot Interaction

Human-Robot Interaction is considered a field of research dedicated to design, to understand and to evaluate robotic systems and its relationship with humans (GOODRICH; SCHULTZ, 2007). HRI studies people behavior and attitudes towards robots with respect to their physical, technological and interactive possibilities (DAUTENHAHN, 2013).

The research trajectory of the HRI field goes back to 1990, when significant advances have been made in Autonomous Robotics technology, with works dedicated to the development of behavior-based robots and the development of hybrid control architec-tures (BROOKS, 1986). Researches from that decade focused on robot mobility. After the

(22)

2000s, the studies were devoted to the development of robots with more realistic and so-cially accepted behaviors, and with anthropomorphic physical appearance; the so called humanoids (AMBROSE et al., 2000). The most important authors of HRI works con-sider that the research carried out in assistive robotics, rescue and exploration, led to the consolidation of human-robot interaction as a scientific field (GOODRICH; SCHULTZ, 2007).

The primary goal of the HRI is to develop robots whose interactions with people are efficient according to some criteria previously established by their architectures of control and systems. Also, the interaction should be acceptable in terms of behaviors, social cues and emotional aspects (DAUTENHAHN, 2013).

Based on this context, the human-robot interaction can be considered the confluence of five attributes: task configuration, autonomy, communication, participants structure and scene adaptation (GOODRICH; SCHULTZ, 2007). These attributes con-sider the objectives and the strategy regarding the activity performed by the robot, as well as the analysis of people hierarchy and behavior during each situation.

The contact during interactions depends on the level of communication be-tween humans and robots, which is influenced by the proximity bebe-tween them. The rela-tion of people locarela-tions in the scene and how they relate to the robot, in terms of cognitive and emotional aspects, are determinant for the correct choice of social interaction criteria. Therefore, the incorporation of social factors in the design of a robot (behav-iors, conventions and people cultural traits) is an indispensable requirement to improve these interactions, approximating social robots to humans. Moreover, in order to develop robots capable of naturally interact with people it is necessary to understand certain so-cial cues, that are a set of verbal and nonverbal aspects: faso-cial expressions, body posture, proximity, and physiological activities (RIOS-MARTINEZ et al., 2015).

These application aspects usually require the collection and analysis of personal information. For this reason, there is also a growing concern to establish legislation that safeguards the integrity and privacy of people related to human-robot interaction. Some institutions such as the Euron (European Robotics Research Network), the technical committee for robotic ethics of the IEEE Robotics and Automation Society, and the Open Roboethics, committed to develop guidelines that govern the ethical robot development, called Roboethics (ALVES; FILHO, 2016).

In addition to the discussions around the Roboethics, an internal HRI move-ment also has been dedicated to create a psychometric scale that measures the connection between human and robot. The existing scales measure confidence of the human in re-lation to a robot (confidence perception scale) (SCHAEFER et al., 2012) and human

(23)

assessments of the behavior of a robot (negative attitude robot scale) (SYRDAL et al., 2009). Although different, these scales point to the general scope, which is the measuring and identification of possible improvements to the robot, to make it more acceptable, functional, and incorporated into social conventions shared by humans.

We remember that the present work focus on the detection of affective aspects of human groups and assess the group’s openness to new interactions. In this sense, our work potentially contributes to the coexistence with robots and to ethical rules compli-ance.

The following section provides a review of psychology factors regarding humans relationship and systems that focus on the interaction considering the adaptability to hu-man behavior changes throughout the scene. Moreover, the works related to interruption context are based on situations that simulate the first contact with humans.

2.2 Social Robotics

2.2.1 Affective Trust

For an understanding of how the robot should interact with humans, it is important to study how humans interact with each other. During an interaction, it is necessary to attribute a mental state (in terms of thoughts, feelings, desires, motivations and intentions) to others with whom we interact. The notion of taking perspective, allow-ing the individual to think always considerallow-ing the point of view of another, is the ability known as mind theory, which is a topic studied by the developmental psychology.

The developmental psychology is one of the fields that aims to understand the human development. This concerns both the study of mental, behavioral and performance processes, as well as the evolution of skills during human life.

The human ability to hypothesize what the other person understands of the world in terms of visual perception, spatial description, affordances and beliefs, is a key aspect in the interaction with others (FLAVELL, 1977; TVERSKY et al., 1999). Studies carried out on individuals who do not have the cognitive mechanisms necessary for per-spective, such as young children or people with autism, highlighted the difficulties these people have in their daily social relationships, confirming the importance of this ability to interact adequately with other humans (FRICK et al., 2014).

Flavell (1977) describes two levels of perspective making: perceptual per-spective and conceptual perper-spective. The perceptual perper-spective refers to the ability of a human to understand that others have a different perception of the world. The conceptual perspective refers to the human ability to attribute beliefs and feelings to others

(24)

(BARON-COHEN et al., 1985).

Therefore, being able to perceive and reason about the surrounding environ-ment are necessary abilities for a social robot, but not enough when it interacts with humans. In order to understand social scenarios, researchers attempt to implement mind theory models, ensuring the robot the ability to have a general perspective of the environ-ment, also seeking to assimilate the human social relations under analysis (MARCHETTI

et al., 2018; LAZZERI et al., 2018; DEVIN; ALAMI, 2016).

In fact, the application of the mind theory can improve the relation among humans and machines. However, to build confidence in the human-robot interaction, it is also necessary to analyze other aspects. According to Azevedo et al. (2017), during an interaction, three elements in sequence are required, namely, explanation capability, mutual-understanding and mutual trust.

The capability of explanation of the social situation is based on the surround-ings and human characteristics. Through explanation it is possible to reach understanding among participants, which in turn allows for interpersonal trust.

Regarding the interpersonal trust general concepts, social psychology has stud-ied the main factors that contribute to its efficiency. Lewis e Weigert (1985), describes two relevant interpersonal aspects: cognitive and affective trust. Cognitive trust focuses on judgements of capability and reliability, while affective trust is based on interpersonal bonds and expected responses to the behavior of individuals.

Existing studies relate the increasing of cognitive trust to the repetition of expected cycles during an interaction, while affective trust is based on the initial im-pressions (HANCOCK et al., 2011; SCHAEFER et al., 2012). In this way, casual and spontaneous communication can improve user’s affective confidence regarding a previous unknown entity, specially when it is a robot.

As a way to increase the human reliability regarding the robot during an inter-action, two topics must be taken into account: understanding the way humans behave on the scene (human-aware perception) and the right time to interrupt them (interruption context). In this way, applications based on surrounding and emotional state of humans in a scene should have focus on affective trust concepts when leading with initial interac-tions. Thus, to minimize disturbance and maximize action timing response rates, robot perception should be able to detect the most convenient social group situation.

(25)

2.2.2 Human-aware interaction systems

Social robots will progressively coexist in our environment in the years to come. The robot understanding of the human behavior related to complex and even simple situations impact the effectivity of the interaction systems. The actual approaches have focused on methods to keep a human-robot engagement and improve social navigation, letting the primary actions analysis as an important field to explore.

In Ahmad et al. (2017), the authors discuss different researches related to the use of adaptative robots that have been designed for health care, education and private purposes. The majority of these works have in common the main goal of increasing the engagement by monitoring person features such as facial expressions, gaze behavior and body language, choosing the appropriate robot talk during pre-established conversations. In these cases, the interaction between the person and the robot can be considered limited due to the prerequisite of having a conversation already started without the preliminary human intention analysis. Apart from this fact, these approaches do not make use of mul-timodal data, restricting their applications only to video frames or audio signals analysis. The use of procedures that verify and interpret human movements in pop-ulated environments also have their importance to increase the human trust related to social robots. The systems should be able to re-plan and design a collision-free action model, adapting to current people activities. These methods usually rely on the use of Simultaneous Localization and Mapping (SLAM) methodology to locate and map the environment while simultaneously keeps track of an agent location within it (BAILEY; DURRANT-WHYTE, 2006).

A robot navigation approach in large-scale maps proposed by Charalampous

et al. (2016) is an example that proposes dynamic identification of person and group

activities (walking, working, conversation) in order to give the proper action to the robot, improving social reliability. Despite this, the method is limited to a containment scheme so it does not allow social robots to check if humans are intending to talk or move towards them. Furthermore, a scenario with complex situations and a high number of persons could cause artificial robot movements cutting down the human-aware adaptivity.

A project that attempted to provide a better understanding of social situations before initiating interactions with human groups, suggests that this approach is effective in recognizing the social cues and improving the engagement among individuals and the robot (CHAO et al., 2016). The application detects the human interrelationship during a first approximation, based on categories such as to-individual, individual-to-robot, robot-to-individual, group-individual-to-robot, robot-to-group, confidential discussion and group discussion. The social situations robot-to-individual and robot-to-group indicate

(26)

that the robot has intention to interact with people. The other two situations, individual-to-robot and group-individual-to-robot, mean that people may have intention to interact with the robot. The remaining social situations indicate that the robot should not bother the individuals.

Figure 2.1 illustrates two examples of categories encountered by the robot during scene perception: the robot-to-group (left) and the group-to-robot (right) situation. The system was able to extract group target features and infer the appropriate robot action, which corresponded to interact or not with the respective group. However, despite the use of RGB-D data to improve the location of the human group and recognition of the social context, this work did not use multimodal data. In addition, the data were collected in only one environment, being restricted to the same lighting conditions and scene characteristics, contributing to a possible bias of the model.

Based on the human-aware proposal of identifying social cues and primary actions related to people in the scene, the understanding of the interruption context theory is necessary. In this way, the study of the types of interruption and the different techniques applied in psychological experiments, contribute to the system development.

2.2.3 Interruption context

Some works have studied different ways to model human availability during initial interactions. The first category of techniques relies on task and experiential knowl-edge, while the second category explicitly estimates the availability and leverages people behavior, focusing on immediate social cues.

One of the known techniques is based on Goals, Operators, Methods and Se-lectors (GOMS) structures (CARD et al., 1980). A GOMS model is composed of methods that are used to achieve specific goals. These methods are composed of operators at the lowest level. The operators are specific steps that a user performs and are assigned to a specific execution time. If a goal can be achieved by more than one method, then selection rules are used to determine the proper method. Despite this method be widely used by usability specialists of human-computer interaction, it can produce predictions of how people will interact with a proposed system.

Another technique related to cognitive architectures is the ACT-R/E, that are computational systems based on theories of how human reasoning work, capturing known facts and constraints about the functioning of the mind, and connecting neuroscience data and psychological experiments (NRL, 2019). This method has been used to predict if humans will need assistance during task execution (TRAFTON et al., 2013). However, these approaches require the previous knowledge of the state of the human tasks and

(27)

Source: Adapted from (CHAO et al., 2016)

Figure 2.1 – Sequences of human-robot interactions

Robot-to-group situation (left): (a) Approach to interrupt the communication of a human group; (b) Ask whether people need drink service; (c) A person answers the question. Group-to-robot situation (right): (a) Approach to ask their need; (b) One person requires robot to introduce itself; (c) The robot starts introduction.

constant monitoring of its execution.

In robotics, there are also methods to estimate metrics of people intentions and robot awareness in applications such as companion robots (CHIANG et al., 2014), bar-tenders (FOSTER et al., 2017) and shopping mall assistants (BRŠČIĆ et al., 2017). The distance between two agents during an interaction, known as proxemics, is a fundamen-tal principle of social interaction. The proxemics contributes to these metric estimations through the mining of appropriate features such as distance, angle and speed for a robot while approaching a person. These parameters should change dynamically as the robot

(28)

becomes closer to the human, bringing the proximity zones concept to the scene analy-sis. Figure 2.2 shows the Argyle (2013) work which formalized four proximity zones and provided specific characteristics of how people modulated their behavior in each zone. A human who is talking with another person from across the room (Public Zone) may gesture and talk loudly in order to communicate, but as she enters the Personal Zone, her gestures and volume would significantly decrease in intensity. As another example, head poses and facial expressions which are acceptable in the Social Zone may be perceived as threatening or alarming in the Intimate Zone (HENKEL et al., 2014).

Source: (HENKEL et al., 2014)

Figure 2.2 – Proximity zones during interpersonal relations. The concept is also applied to human-robot interactions to model people availability and robot awareness. Existing works have also modeled people availability through methods known as contingency detection (KATO et al., 2015). These methods take the premise that a person is available performing a sequence of actions, and then reassessing the subsequent person state based on the their response. Assessing a person availability through contin-gency detection is complementary to estimating primary availability. In this work, since the focus is recognizing the initial social situation assuming the analysis of a complete unknown scene, we did not make use of this technique.

Moreover, interruption context has been cited in contexts whose the features are frequently captured to describe the user (FOSTER et al., 2017; SYKES, 2014), the en-vironment (SYKES, 2014), the task (IQBAL; BAILEY, 2006), the interruption (HORVITZ; APACIBLE, 2003), and the relationships between them when occurs an interruption. In robotics, interruption context has also been studied with the use of global audio-visual descriptors, such as GIST (OLIVA; TORRALBA, 2001), audio frequency and volume features, as proposal to classify interruptibility.

(29)

The present work adopts a methodology based on the measuring of people intentions and awareness to the robot. However, instead of using a limited number of human attention categories which are based on only one individual, we created a method that detect the people receptivity to start an interaction as a continuous feature. Moreover, we took advantage of the computer vision advances to obtain explicit and high-level social situation context based not only on the individuals behavior, but also in their group level relationship during interactions.

In order to allow the detection of humans location in the environment and understand the way they behave among themselves, and in respect to the robot, it is necessary to collect complex metrics related to body position, head pose, and scene audio characteristics. In this context, the use of multimodal data composed by RGB and depth frames, as well as audio information, allowed the development of an application able to determine how and when to interact with humans.

The following section presents an approach regarding the works that used only audio or visual data, as well as works that made use of multimodal data. The choice of the type of data to be collected depends on the challenge to be implemented, therewith the study of the different applications is also described in next section.

2.3 Multimodal robot perception

2.3.1 Audio data

The use of the hearing sense in robots was driven by the need of interaction with humans through speech comprehension. The use of language in robotic systems can be observed in medicine (NEUSTEIN AMY; BEER, 2014), rehabilitation (ROSATI et

al., 2011; ARDIANSYAH, 2016) and service robots (WANG et al., 2016).

Regarding social robotics, sound signals are frequently used for communica-tion with humans. The related work of Alameda-Pineda e Horaud (2015) involves the identification of source position in an environment with different persons using a hybrid model to associate the sounds to the correct persons. Speech recognition projects are also used to detect the human localization and verbal intention (MARTIN; SALICHS, 2011). One of the most difficult challenges related to speech analysis is the presence of noise such as alarms, vehicles and animals (MARTINSON; BROCK, 2013). However, in some situations even noise sounds must be considered since they represent valuable infor-mation to scene perception, specially while tracking social situation inside heterogeneous environments.

(30)

describe how is the environment state, recognizing if it is turbulent or quiet according to the level of noise and conversation. Since the people localization using audio signals is not as accurate as when using visual information, this work did not use this technique.

2.3.2 Visual data

The visual analysis process starts with the data receiving through sensors. Traditional optical cameras and RGB-D cameras represent some of the technologies used for image capturing. According to Lachat et al. (2015), in the last decade the RGB-D cameras have been largely used in robotic applications. Part of this growth is related to the emergence of low-cost cameras such as Kinect (Microsoft) and Realsense (Intel).

While RGB-D cameras embrace a greater number of application opportunities, they also add complexity to the data structure to be processed. RGB-D image information consists of a multidimensional matrix with the values of pixels and the depth information, which is a measure of the distance from the object to the camera.

The complexity of the data collected requires a fast processing made by hard-ware and algorithms that identify specific regions of interest in the environment. In Dias e Osório (2015), a hardware structure is based in Field Programmable Gate Array (FPGA) to increase the performance of autonomous vehicles visual navigation, processing rates of 43FPS (frames-per-second). In fact, the barrier break of 30FPS associated to human eye is a specific requirement of some areas such as sensing of vibration dynamics, three-dimensional visual inspection and biomedical applications (GU; ISHII, 2016). However, many applications related to social robotics field that are based on general contexts only require a limited number of frames per second specially when analyzing macro person features.

Although several applications use only RGB-D data to detect the location of people in the scene and how they interact with each other, the combined use of audio information can contribute to improve the model effectiveness when there is insufficient visual data in the sample or a high complexity due to problems of occlusion or luminosity in the scene.

2.3.3 Multimodal data

As mentioned before, in order to increase results reliability, information de-rived from different sources (RGB-D cameras, GPS, Sonar, audio, text, laser, etc) are used together. In Hussein et al. (2016), an alternative method with multimodal informa-tion determines the vehicle displacement using common images to associate the general scenario as well as stereo vision and laser data to detect obstacles. These results also are

(31)

currently applied to other applications related to assistive robotics, educational assistance and entertainment robotics.

Regarding social robotics, current academic projects had obtained good results using a multimodal model. Puente et al. (2018) present the experiences of operating a mobile robot with manipulation capabilities and an open set of tasks during contact with real users in home environments. The use of RGB-D data improved the perception and evaluation of the integrated system in terms of navigation, since the depth information tracked person location in a more assertive way. Figure 2.3 illustrates the importance of using the depth sensor to retrieve the person distance in the scene and determine a safe margin (concept defined by the authors as the minimal distance to guarantee safety sense between users) during interaction. First, the robot is on the predefined position, waiting to start an interaction (left). After detecting the user, the robot rotates its body to face her (middle). Thus, the robot approaches the user considering the safety margin, ensuring that the user is comfortable with the approximation (right).

Source: (PUENTE et al., 2018)

Figure 2.3 – User detection and approximation considering the safe margin concept. Left: robot in predefined position. Middle: robot detects the user and rotates its body to face her. Right: robot approaches the user considering safety margin.

Applications using RGB-D camera images are frequently compared to ap-proaches focused exclusively on color or depth images. In Zimmermann et al. (2018), experiments in real world settings demonstrated that the use of RGB-D information enabled a robot to imitate human pose actions observed from a human teacher with significant performance over other works.

The fusion of audio and image data is also mentioned in many HRI studies, since their complementary usage improve the accuracy of models. Zlatintsi et al. (2018) introduces a framework of a real-life scenario for elderly subjects supported by an assistive bathing robot, addressing health and hygiene care issues. It was collected RGB-D data for body pose estimation and visual tracking, as well as commands for audio-gestural

(32)

recognition. The results showed that audio information brought relevant information to the feature set since visual data was affected by objects occlusions in the scene.

This work used RGB-D data to identify characteristics of people in the scene, as well as their distance from the robot. Moreover, the audio data was used to assign a surrounding and conversational level to the scene, increasing the number of possibilities during the exploratory analysis. Through these features, it was possible to detect social relations among the individuals and assimilate the receptivity level to initiate interactions with the social robot.

2.4 Concluding Remarks

This chapter brought a historical perspective about the human-robot interac-tion concepts and principles related to the modeling of social robotic systems (Secinterac-tion 2.1). It also presented an overview of affective and adaptative interactions regarding social psy-chology aspects (Section 2.2).

In Section 2.3, works focused on audio and image data were presented, as well as multimodal systems. The importance of using different data sources and merging them together to increase the model effectivity was demonstrated by the application examples provided in Section 2.3.3.

Most of the social robotics works mentioned, have employed techniques and analysis to determine human behavior and robot navigation. Few current studies have stressed the need to analyze the initial interaction context to improve general social robot actions, due to the complexity of extracting the features that indicate an initial engage-ment and the social scene situation.

Previous works on human-robot interaction either required pre-established conversations or were limited to dyadic interactions, which represents the interaction between a pair of individuals. Instead, this work presents the implementation of a social robot action planning approach based on multi-person interactions and initial engagement analysis.

This work also shows that multimodal data can leverage the model results for complex situations. The fusion of hearing and vision senses, represented by audio and RGB-D features, demonstrates the importance of the data collected and their complemen-tarity. Based on this, the following chapter presents an overview of the data capturing methodology, and the reasons of choosing the RGB-D camera and audio information to develop our initial interaction model for social robots.

(33)

3 Training database

This chapter describes our approach to the construction of a training database for the modeling process described in Chapter 4. The details of each type of information included in the database are explained to facilitate the development of future improve-ments by interested researchers.

A first step in our methodology was to choose the appropriate camera to our target application. In Section 3.1, we describe and compare different depth cameras technologies, contributing to the conceptual understanding of their main characteristics. We justify the adoption of Intel RealSense R200 in the Section 3.2.

The Section 3.3 brings the details of each step performed to record the scenes, according to the capture protocol. Finally, Section 3.4 presents the structure of the final database regarding the data information stored about each participant and the surround-ing audio context. The chapter’s concludsurround-ing remarks are on Section 3.5.

3.1 Depth Cameras

Some imaging sensors are capable of providing 3D images by measuring the distance from the sensor to the objects, rather than the reflected or diffused intensity of the light on the surface. The output provided by these sensors is known as depth images. In this case, the value of each pixel represents the distance between a reference point and a point in a surface. These sensors provide RGB-D data, since the RGB components associated with each pixel in the image are recorded along with Distance (D) information. There are two types of sensors for depth acquisition data: sensors with and without a controlled source of energy, that is used to extract 3D information from objects. They are called active and passive sensors, respectively.

Passive sensors depend on external light source to record the electromag-netic information reflected by the physical surface. Such systems are based on stereo-cameras (HARTLEY; ZISSERMAN, 2003), whose distance metric is obtained with a triangulation technique that compares images captured by two cameras that are placed few centimeters apart.

Regarding active sensors, the electromagnetic energy is generated and emitted to interact with the surface. The emission and reception of the emitted energy pulse are determined, considering light speed information, the distance between the sensor and the surface is calculated.

(34)

One example of active sensor is based on structured light pattern systems (MAAS, 1993). Basically, the system projects a light in the form of a grid or lines on the object’s surface. The resulting reflection is recorded by a digital camera. In this way, an image cor-relation algorithm is used to calculate the desired information (REISS; TOMMASELLI, 2011).

Within the group of active and passive sensors, there are four different cate-gories composed by the stereoscopic, structured light, time-of-flight and hybrid sensors, described as follows.

3.1.1 Stereoscopic Cameras

Stereoscopic vision refers to the biological configuration where, in some species, the relative displacement of the eyes turns possible to infer the depth of objects and to de-velop a three-dimensional perception of the world, through a process of triangulation from two slightly displaced images from the same object generated by each eye (TIPPETTS et

al., 2016).

In computer vision, stereoscopic vision refers to the algorithms dedicated to obtain the depth information of the objects and people in the scene, regarding their distance to a reference point. This is similar to the biological process, but it uses two cameras, displaced horizontally one from another, to get two differing views of a scene. Figure 3.1 represents the picture of an object taken from two field points, to demonstrate how the stereoscopic vision works.

The cameras must be placed exactly on the same two coordinate axes and separated from each other based on the other known relative axis. The procedure will capture two images of the same scene, each with a camera. Once both images are taken, a correspondence technique will be executed, to determine the same parts of the scene in both images allowing the distance computation of a object (WALT, 2008).

Stereoscopic cameras usually works in both indoor and outdoor settings, and traditional ones have a cost up to $500. However, they suffer with low light performance and require high processing power to derive the depth maps. Moreover, the capture range depends on the spacing between the cameras, being the medium range the suit-able set (LEE, 2017).

(35)

Source: (TIPPETTS et al., 2016)

Figure 3.1 – Demonstration of the stereoscopic concept.

The relative separation between the pictures represented by the triangles, is called dis-parity and it is the main aspect of this method. First, two cameras displaced by B, take a picture of a object (left). Each camera point of view and the relative angles of the object to the each reference point (middle and right) are measured and the distance D is computed by triangulation.

3.1.2 Structured Light Cameras

The structured light camera is based on the concept of projecting known pat-terns of lights onto a surface. Each of them will be deformed due to the geometric shape of the object. Knowing the original pattern of these projections, the camera compares the displacement that exists between the known pattern and the one obtained, computing the depth information (HERAKLEOUS; POULLIS, 2014).

It is common that together with the projector and the camera, there is another RGB camera working in parallel. In this case, an association with the second camera positioned on the opposite side, can establish more confident distance metrics. In this way, three-dimensional maps with a high level of detail are created by the resolution of both cameras.

The example shown in Figure 3.2 demonstrates the use of horizontal and ver-tical sequences of patterns to establish one-to-one mapping between camera points and projector points. This type of pattern is generally used in the Phase Shift projection method, which is characterized by the measurement of depth variations based on the

(36)

dis-placement between the camera and the projector. Other methods include: the rainbow pattern variation, the gray scale indexing, and hybrid methods that combine more than one technique to optimize results for a given surface (LI; ZHANG, 2014).

Source: (LI; ZHANG, 2014)

Figure 3.2 – Structured light system containing one projector, one camera, and an object.

This method offers accurate data. In addition, these cameras have good perfor-mance even without ambient light conditions. However, they also have weakness such as the poor outdoor performance under sunlight and their failures when affected by the re-flection properties of some surfaces. Some common cameras in the market integrated with this technology are the Microsoft○c KinectTM _{v1 and the Occipital}○c _{Structure Core}TM_.

3.1.3 Time-of-Flight Cameras

The time-of-flight camera is based on a methodology that consists of throwing an infrared laser and measuring the time elapsed between the raylight emission and its return due to reflection. The measured time interval determines the distance between camera and object. Foix et al. (2011) describes different time-of-flight camera devices, which can be based on laser diodes, phothonic mixers or infrared.

Time-of-flight methods can be divided into two classes depending on whether their light emission is continuous or pulsed. In the modulated continuous wave technique, an amplitude modulated light carrier is emitted and the distance information is extracted from the received signal by comparing its modulation phase to the respective emitted signal phase. In the pulsed time-of-flight method the distance is obtained by measuring the time interval between the transmitted and received light pulses (KOSKINEN et al., 1992).

(37)

Figure 3.3 demonstrates the different operation classes regarding the time-of-flight methodology.

Source: (WIKIPEDIA, 2019)

Figure 3.3 – Demonstration of the time-of-flight concept.

Principle of operation of a time-of-flight camera: Scheme (1) represents the pulsed method and the scheme (2) represents the continuous-wave method.

The Laser Illuminated Detection And Ranging (LIDAR) is an known exam-ple of pulsed time-of-flight sensor. These cameras can generate high precision and high resolution depth maps. The main current application of LiDAR is related to autonomous vehicle technology, which requires fast and accurate depth mapping of surrounding areas. However, these sensors are typically more expensive and bulky, besides the fact of having a lower refresh rate.

Regarding continuous wave time-of-flight cameras, they are compact, require low processing power and have a high refresh rate. However, they also can be affected by reflective properties of some materials and the presence of other time-of-flight cameras. The most common camera in the market of such kind is the Microsoft○c KinectTM v2.

3.1.4 Hybrid Cameras

Based on the fact that different depth computations are applied to numerous applications, some vendors are integrating methodologies to improve and make more ex-tensive the use of RGB-D cameras in both internal and external environments. These cameras are usually known as hybrid cameras.

(38)

depth metrics and an projector to send non-uniform light patterns to the scene (based on structured light methods). These cameras include stereo sensors as well as the RGB camera. The color camera provides the final images and the two depth sensors provide the data for distance computation.

The implementation consists of a left stereo camera, a right stereo camera, and an infrared projector. The infrared projector projects non-visible static IR pattern to improve depth accuracy in scenes with low texture. The left and right camera sensors capture the scene and send raw image data to the processor unit, which computes depth values for each pixel in the image by correlating points on the left image to the right image. The depth pixel values are processed to generate a depth frame. Subsequent depth frames create a depth video stream.

Considering the presence of minimal ambient light conditions, hybrid cameras are usually compact and include better quality image than the stereoscopic or structured light cameras. However, they also require different configurations to work on different envi-ronments. The camera behavior on indoor scenes is based on structured light methodology, while in outdoor scenes is based on stereoscopic methodology.

3.2 Camera Choice

Based on the characteristics of each type of RGB-D camera (summarized on Table 3.1), we took into consideration the details that most influence our system require-ments. Since the main focus of the work is to analyze different indoor scenes containing multiple persons in respect to their distance and individual features, the RGB image must have good quality and the depth sensors should be able to capture information on medium distances.

Even based on indoor environment, it is expected to have the presence of some lighting variations proportioned by the reflection of materials and the sunlight entrance. Moreover, we observed the camera size and their price range on the market. A more compact device could facilitate the recording steps and future embedding prototypes, while a cheaper camera (with a price up to $200) can benefit the system expansion when requiring the acquisition of more devices to improve the performance.

Based on these features and the system requirements described above, the hybrid cameras were chosen as possible candidates for use in data collection. Among these cameras, the Intel○c RealsenseTM F200 and R200 cameras were selected for initial performance tests in indoor environments.

(39)

Feature Stereoscopic Structured Light Time-of-flight Hybrid Latency ≈ 1𝑓 𝑟𝑎𝑚𝑒 ≈ 1𝑓 𝑟𝑎𝑚𝑒 < 1 frame < 1 frame

Cost < $500 < $500 > $500 < $200

Active Illumination No Yes Yes Yes

Low light performance Low Medium Medium Medium

Bright light performance High Medium Medium Medium

Depth accuracy Low High Medium High

Scanning speed Medium Medium High High

Power consumption Low Medium Medium Low

Table 3.1 – Depth cameras: Features comparison

face recognition, immersive and video conferencing. The focus of this camera is on depth data capturing based on the range between 0.2 and 1.2 meters. Therefore, during the tests, the F200 camera ended up being discarded due to low accuracy when detecting metrics of depth superior to 1.2 meters.

The R200 camera (Figure 3.4) was chosen for data capturing, since its inside range is approximately 0.5-3.5 meters and it has an outside range up to 10 meters. It has 3 cameras providing RGB and stereoscopic IR to produce depth. The color camera is capable of doing 32 bit RGB at 1080p and 60 FPS (frames per second) using fixed focus with 16:3 of aspect ratio. The RGB camera has a slightly larger FOV (field of view) than the dual cameras but is should not be used as a standalone camera. The dual depth sensors use a fixed focus of 4:3 aspect ratio with a 70x59x46 degree of view (CULBERTSON, 2015).

Source: (CULBERTSON, 2015)

Figure 3.4 – Intel○c RealSenseTM _{R200 camera components.}

(40)

It brings the camera and audio specifications that were configured based on the indoor scenario and the details about the storage of each sample.

3.3 Data Capturing Module

The database collected for this work consists of RGB images, depth images and audio samples, being captured into indoor environments with the presence of one or more persons. The RGB-D images collection was performed using the Intel○c _RealsenseTM_R200

camera, that was connected to a Lenovo○c _IdeapadTM _{laptop for power and data transfer}

via USB 3.0 connection. The ambient audio information was collected in synchronicity with video frames through threads that control the capture process.

Pilot tests resulted in the observation of some relevant configurations of the recording process, such as the need to to obtain RGB images with higher resolution than depth images (which contains limitations imposed by the camera) and the consideration of a suitable camera height. The resolution chosen for the depth frames was 640x480 while for RGB frames was 1920x1080. The camera height was set to one meter to reduce occlusion problems and to increase the camera field of view.

Many configuration options are described into Intel○c RealSenseTM_SDK

docu-mentation to work with R200 module on different scenarios. The open-source tool provided by Intel○c _{development kit called cpp-config-ui ((ROS), 2019) was used to obtain the}

suit-able record preferences through a user interface (Figure 3.5) that allows real-time frame comparison.

The final camera settings used during the recording phase are described in the Table 3.2. Most of the configurations are default to the camera, but minor changes were required in order to work with indoor scenarios containing low lightning conditions.

The audio information was stored in WAV format, having a sampling rate of 44100Hz, a chunk of 1024 samples, 2 channels and 2 bytes (16 bits) as the size of each sample. We used PyAudio (MIT; T-PARTY, 2019) module to facilitate the recording management through its embedded callback mode during thread execution.

Regarding the participants involved in the scenes, four people were invited to voluntarily demonstrate situations of interaction. Some scenes were composed of one group of people and others consisted of only one individual. All the participants signed a document allowing the use of their image and the conversation held on each collection.

The scenes simulated common daily activities. The scenes comprehended sim-ulated situations in which the participants do not look at the robot (watching TV, mobile browsing or group conversation), situations that show little interaction interest with the

(41)

Source: Author

Figure 3.5 – Intel○c RealSenseTM R200: Use of the cpp-config-ui application to analyze camera settings.

Item Configured Value

color_brightness 56 color_backlight_compensation 1 color_contrast 32 color_exposure Automatic color_gain 32 color_gamma 220 color_hue 0 color_saturation 128 color_sharpness 0 color_white_balance Automatic lr_gain 400 lr_exposure 164 emitter_enabled True dc_estimate_median_decrement 5 dc_estimate_median_increment 5 dc_median_threshold 192 dc_score_minimum_threshold 1 dc_score_maximum_threshold 512 dc_texture_count_threshold 6 dc_texture_difference_threshold 24 dc_second_peak_threshold 27 dc_neighbor_threshold 7 dc_lr_threshold 24 dc_preset 5

(42)

robot (participants make quick greetings to the robot) and situations in which they show a high interest in starting an interaction (staring or diverting their attention to the robot). The samples were collected in different indoor environments, with variations in natural luminosity, trying to ensure that the system becomes generic and robust.The recording sessions were carried out in the following simulated environments:

∙ Social: Indoor rooms such as gym and game room. ∙ Home: Indoor rooms such as living room and TV room. ∙ Workplace: Indoor rooms such as office and meeting room.

The Intel○c RealSenseTM _{R200 was designed to focus on medium distances}

and can capture RGB-D with frame detection up to 60 fps (INTEL REALSENSE TECH-NOLOGY, 2017). Since our approach focus on the differentiation of macro characteristics of the scene in a short time interval, the default value of 30 fps was considered. Also, our latest experiments using infrared frames had not significant utility, so only RGB and depth frames were considered.

This work models three levels of interaction representing the way that robots could approach humans during initial interactions. Therefore, 100 samples representing each level of interaction were collected. Moreover, the time record chosen was seven sec-onds based on an appropriate period to recognize individual and group features and variations over time.

As result of this data collection, approximately 63000 frames and 300 audio WAV files were stored into laptop SSD (Solid State Drive) disk. We used these samples as the basis for constructing the final database containing the raw characteristics of the social context involving features of the image and audio of the individuals. The Section 3.4 describes the steps taken and the structure of this database.

Given that the database was created from scratch, the data labelling of the expected social robot reaction to each scene was performed by three annotators avoiding biased results. The labelling process was based on the raw collected samples and the most frequent value annotation represented the chosen class for the respective scene. The possible labels mapped by this work are described in Chapter 4.

3.4 Human-Robot Initial Interaction Multimodal Database

The samples containing the RGB-D frames and audio information described in Section 3.3, were processed to create the Human-Robot Initial Interaction Multimodal

(43)

Database. The purpose of providing a public pre-processed database is to enable social robotic researchers to have access to consolidated and organized information about the initial context of human-robot interactions and to make possible the development of new studies or improvements for the system developed in this work.

For each sample, the information content was divided in two: the RGB and depth images were processed frame by frame to retrieve the individuals features; and the audio file was fully processed, obtaining the characteristics of the ambient sound in the entire time interval.

3.4.1 Image information

Concerning the RGB frames, some features were analyzed using a tool called OpenFace (TADAS et al., 2018), which is a toolkit capable of facial landmark detection, head pose estimation, facial action unit recognition and eye-gaze estimation. The appli-cation has built-in modules based on deep learning models responsible to recognize facial characteristics. Figure 3.6 brings an overview of some possible features analyzed with OpenFace.

Source: (TADAS et al., 2018)

Figure 3.6 – Examples of features extracted with OpenFace

The detection of each face region on a frame require a high-quality image with good lightening condition and the right focus. Moreover, the different head poses impose a common challenge in computer vision field, specially when there are human profile poses in the scene. Even with the use of 1920x1080 as the image resolution, the OpenFace results presented many false negatives when detecting faces in the scene and their characteristics.