Business relationship network model from social reactions data

(1)

PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO APLICADA

DIEGO PAOLO TSUTSUMI

BUSINESS RELATIONSHIP NETWORK MODEL FROM SOCIAL

REACTIONS DATA

DISSERTAÇÃO DE MESTRADO

CURITIBA

2019

(2)

BUSINESS RELATIONSHIP NETWORK MODEL FROM SOCIAL

REACTIONS DATA

Dissertação de Mestrado apresentado a banca exam-inadora do Programa de Pós-Graduação em Com-putação Aplicada da Universidade Tecnológica Fed-eral do Paraná como requisito para a obtenção do título de Mestre em Computação Aplicada

Orientador: Prof. Dr. Thiago Henrique Silva

CURITIBA 2019

(3)

Tsutsumi, Diego Paolo

Business relationship network model from social reactions data [recurso eletrô-nico] / Diego Paolo Tsutsumi.-- 2020.

1 arquivo texto (49 f.): PDF; 6,22 MB. Modo de acesso: World Wide Web

Título extraído da tela de título (visualizado em 09 mar. 2020) Texto em inglês com resumo em português

Dissertação (Mestrado) - Universidade Tecnológica Federal do Paraná. Pro-grama de Pós-graduação em Computação Aplicada, Curitiba, 2020

Bibliografia: f. 42-45.

1. Computação - Dissertações. 2. Redes sociais on-line. 3. Redes sociais - Apli-cações. 4. Comunidades - Organização. 5. Opinião pública. 6. Previsão social. 7. Previsão. 8. Cooperação institucional. I. Silva, Thiago Henrique. II. Universidade Tecnológica Federal do Paraná. Programa de Pós-graduação em Computação Apli-cada. III. Título.

CDD: Ed. 23 – 621.30 Biblioteca Central da UTFPR, Câmpus Curitiba

(4)

(5)

The development of this research has been very important for my academic as well as my personal growth. I had the opportunity to get to know many knowledgable people who taught me a lot, from the basics in academic research to advanced concepts in computer science. I would like to express my sincere gratitude to all of them and everyone who helped me grow in this journey.

First of all, I would like to thank Thiago Henrique Silva, my professor advisor, for all the inspiration, support, research directions and strong guidance he gave me during my Masters Degree. Thiago is an exceptional professor and researcher, not only because he sets very high standards, but also because he has the ability to raise the level of people around him.

I thank my family members, specially my mother and father for being my strongest references, and also for all they’ve done to guide me to reach to the completion of this Master Thesis. I would also like to thank all my colleagues at work who always motivated me to complete my studies in UTFPR.

Lastly, but not least important, I would like to thank, my partner, Amanda Trojan Fenerich, who is a researcher, for being the person who inspired me to start my Masters Degree, also for the daily support, always helping me grow.

(6)

TSUTSUMI, Diego P.. Business Relationship Network Model from Social Reactions Data. 52 f. Dissertação de Mestrado – Programa de Pós-Graduação em Computação Aplicada, Universi-dade Tecnológica Federal do Paraná. Curitiba, 2019.

Uma das formas de expandir a atuação de uma empresa, ou mesmo mantê-la estável durante uma crise, é estabelecer parcerias estratégicas com outras empresas. Diante disso, este estudo apresenta resultados através de um novo modelo de análise de dados, que explora reações de usuários em mídias sociais para indicar parcerias estratégicas entre empresas. Existem três prin-cipais contribuições deste estudo para a literatura: (i) um modelo de uma rede de relações entre empresas; (ii) um algoritmo de detecção de comunidades; e (iii) um algoritmo de detecção de empresas outliers; A avaliação dos modelos em cenários reais foi feita explorando aproximada-mente 280 milhões de reações de usuários do facebook. Em contribuição para a literatura, esta dissertação de mestrado produziu três artigos completos, um artigo na revista Journal of Internet Services and Applications (TSUTSUMI et al., 2019), e dois em conferencias, Web Intelligence (TSUTSUMI; SILVA, 2018), CoUrb (TSUTSUMI et al., 2018). Os resultados apontam que a recomendação de parcerias entre empresas é possível através de informações disponíveis nas mídias sociais.

Palavras-chave: Mídias Sociais, Detecção de Comunidades, Parcerias Empresariais, Link Pre-diction

(7)

TSUTSUMI, Diego P.. Business Relationship Network Model from Social Reactions Data. 52 f. Dissertação de Mestrado – Programa de Pós-Graduação em Computação Aplicada, Universi-dade Tecnológica Federal do Paraná. Curitiba, 2019.

One of the primary ways to expand a business or to keep it stable during a crisis is to cre-ate partnerships with other companies. With that, this study presents results regarding a new data model, which explores user reactions on social media to indicate strategic business part-nerships. There are three main contributions of this study to the literature: (i) a business re-lationship network model; (ii) a business community detection algorithm; and (iii) a business outlier detection algorithm; The evaluation of the contributions was performed exploring real data of approximately 280 million user reactions on Facebook. Contributing to the literature, this Master Thesis produced three full papers, one in the Journal of Internet Services and Ap-plications (TSUTSUMI et al., 2019), and two in proceedings, Web Intelligence (TSUTSUMI; SILVA, 2018) and CoUrb (TSUTSUMI et al., 2018). Results suggest that business partnership recommendation is possible using the information available in social media.

(8)

–

Figure 3.1 Illustration of the data collection process. . . 18 –

Figure 3.2 Data collection points considered for Curitiba, PR, Brazil. . . 19

–

Figure 3.3 Heatmap representing the number of businesses found in different regions by

the business search process. . . 20 –

Figure 4.1 General view of the proposed analysis framework for identifying relations and

affinity among businesses with social media data. . . 22 –

Figure 4.2 Users and Business Pages . . . 23 –

Figure 4.3 Distribution of the number of users by number of reacted businesses. . . 25

–

Figure 4.4 Facebook subcategories renaming process. . . 29 –

Figure 4.5 Examples of non-normalized feature vector for each community. . . 29

–

Figure 4.6 Clusters, Communities and Businesses relationship ilustration. . . 31

–

Figure 5.1 Partial view of the business graph on the map of Curitiba . . . 34

–

Figure 5.2 Business Network, circles are a businesses, colors are their categories . . . 35

–

Figure 5.3 Communities detected by Algorithm 1 . . . 37 –

Figure 5.4 Word clouds for categories of similar community clusters. Each cluster

leg-end contains the three biggest super categories of its signature according to Algorithm 2. . . 40 –

Figure 5.5 A chart presenting user reaction number per cluster . . . 41

–

Figure 5.6 Framework output for Rubiane. . . 42 –

Figure A.1 Histogram of the top twenty companies in terms of user reaction number. . . 49

–

Figure A.2 Histogram of the top twenty categories in terms of user reaction number. The

names are the original ones and do not reflect the renaming process shown in Figure 4.4. . . 50 –

Figure A.3 A map showing the amount of user reactions per region considered in the

(9)

–

TABLE 3.1 STRUCTURE OF THE CONSIDERED DATA. . . 17

–

TABLE A.1 BUSINESS GRAPH NODES RANKED BY NUMBER OF CONNECTIONS. 50

–

(10)

1 INTRODUCTION . . . 10 1.1 MOTIVATION . . . 10 1.2 OBJECTIVES . . . 11 1.3 METHODOLOGY . . . 11 1.4 OUTLINE . . . 12 2 LITERATURE REVIEW . . . 13

3 DATA COLLECTION AND PROCESSING . . . 17

3.1 DATA CHOICE . . . 17

3.2 DATA COLLECTION . . . 18

3.3 DATA CLEANING . . . 21

4 BUSINESS RELATIONSHIP NETWORK MODEL . . . 22

4.1 OVERVIEW . . . 22

4.2 BUSINESS RELATIONSHIP GRAPH . . . 22

4.3 FILTERS AND GRAPH CONSISTENCY . . . 24

4.3.1 Users Filter . . . 24

4.3.2 Weak Edges Filter . . . 26

4.4 DETECTION OF BUSINESS COMMUNITIES . . . 27

4.5 CLUSTERING OF BUSINESS COMMUNITIES . . . 29

4.6 BUSINESS OUTLIER DETECTION . . . 31

5 RESULTS AND DISCUSSIONS . . . 34

6 CONCLUSION . . . 43

7 FUTURE WORK . . . 44

7.1 NEW DIRECTIONS AND INSIGHTS . . . 44

REFERENCES . . . 45

Appendix A -- SUPPLEMENTARY INFORMATION ON BUSINESS REACTION DATABASE AND BUSINESS RELATIONSHIP GRAPH . . . 49

(11)

1 INTRODUCTION

1.1 MOTIVATION

Businesses and companies are continuously exposed to uncertain and dynamic envi-ronments. For instance, markets go into crisis or change preferences, competitors strategies evolve, and newly elected politics may also affect business environments by changing regula-tions and making lobby. All those factors make it difficult for business owners to make strategic decisions. One of the ways to expand a business or to keep it stable during a crisis is to make strategic partnerships with other businesses.

A partnership with a true win-win intention could provide the edge a business needs to surpass its competitors. However, a poorly thought out partnership can hinder instead of help, making this procedure challenging (BERGQUIST et al., 1995; ELMUTI; KATHAWALA, 2001). Thus, the task of choosing which partnerships are best for a particular business may be challenging and finding appropriate data for this issue is key.

Many known factors, such as customers’ tastes and opinions, are reflections of their preferences for businesses (TRAINOR et al., 2014; AGNIHOTRI et al., 2016; HUDSON; THAL, 2013). Preferences implicitly manifested by users in actions in social media was also assumed to exist in previous works (CRANSHAW et al., 2012; SILVA et al., 2014; MUELLER et al., 2017; BRITO et al., 2018). From the social media perspective, since customers and com-panies are regularly feeding their social media profiles, users’ opinion data are widely available in platforms such as Facebook and Twitter. For instance, user reactions on content shared by a particular business can represent users’ preferences. A proper analysis of these data could be used along with traditional business analysis to improve the strategic decision quality of businesses.

By having better quality strategic decisions, businesses could also favor a wide exten-sion of the environment around them, from their neighborhood to a whole society. For instance, if a bakery goes bankrupt after a series of bad decisions, it does not only affect its customers, that probably would have to buy in other regions, but also affects the resources in their region, which better invested could bring more jobs and also improve the local economy.

Some studies analyzed social media data for business intelligence (LIN et al., 2016; KARAMSHUK et al., 2013; HOFFMAN; FODOR, 2010; TUTEN; SOLOMON, 2017), other

(12)

studies were carried out, without social media data, to understand business partnerships (BERGQUIST et al., 1995; ELMUTI; KATHAWALA, 2001); however, the exploration of social media data analysis in the context of business partnerships decision making, to the best of our knowledge, is a novelty in the literature. Contributing to the literature, this study produced three full papers, one in the Journal of Internet Services and Applications (TSUTSUMI et al., 2019), and two in proceedings, Web Intelligence (TSUTSUMI; SILVA, 2018) and CoUrb (TSUTSUMI et al., 2018).

1.2 OBJECTIVES

In this direction, the general goal of this study is to propose and evaluate a data analysis model that helps business owners to choose strategic partnerships. To accomplish this goal, there are three specific objectives:

1. To produce and evaluate a business relationship model by exploring network theory 2. To propose and explore a community detection algorithm specifically designed for

busi-ness networks

3. To propose and explore a business outlier detection algorithm in the presented model

1.3 METHODOLOGY

This study explores user reactions to businesses pages in social media to show that this information is valuable to indicate possible business partnerships. Particularly, this study ex-plored more than 260 million public user reactions collected from Facebook’s users about 1924 businesses in Curitiba, PR, Brazil. Facebook, among many other social media platforms, has been chosen because it is a popular social media among customers and businesses (FERRARI, 2016).

There are two main hypothesis in this study, the first is that a user who reacted to a business page is a potential customer. It seems reasonable to consider it mainly because business owners also consider this to be true when they actively advertise their products and services in their social media page. The second hypothesis is that common user reactions between two businesses represent a potential partnership between them, technically this hypothesis converts common user preferences into business affinity.

(13)

Two problems can emerge when using a data analysis approach to identify potential business partnerships regarding the type of data considered:

1. The data is typically noisy and can mislead knowledge extraction; 2. A newly created business does not have enough data to be analyzed.

Thus, this study tackles the first problem by proposing and exploring techniques to minimize its effects, improving the overall data quality. Regarding the second problem, this study proposes a future work on a new link prediction model which might infer possible missing links in the original data model, enabling businesses with a small amount of data to be part of the analysis.

1.4 OUTLINE

Connected to the main objective, this main study is carried out covering the three main specific objectives: (i) the business network model, (ii) the business community detection algorithm, and (iii) the business outlier detection.

The next chapters in this document are structured as follows: Chapter 2 presents studies and work in the literature relevant and related to this study; Chapter 3 explains what data was collected, how it was collected and the challenges of real data collection; Chapter 4 dives deep into the first part of this study, the Business Network Model, showing the metrics of business relationships, which is core for this study; Chapter 5 presents and discusses results obtained by this study; Chapter 7 gives insights and proposals for future work, specially a possible link prediction model to infer missing connections in the original network; Chapter 6 summarizes the results giving thoughts and conclusions on how the study contributed to the literature and the society in general, also it leaves open questions and insights for future work based on this study.

(14)

2 LITERATURE REVIEW

According to Mukhopadhyay (MUKHOPADHYAY, 2018), with the advent of online technologies and social media, individuals are increasingly sharing their views and opinions through the internet, which are influencing and affecting the business sociopolitical and personal contexts. A considerable amount of effort has been made by the academic community with the objective of extracting relevant information from social media, which is evidenced by the constant growth of literature. For instance, the average annual growth rate of the number of publications over the last five years is around 0.15 at Scopus and 0.37 at Web of Science Core

Collection1, and a similar trend is observed in other scientific databases.

The information from social media can be used for different purposes (SILVA et al., 2019), ranging from mobility understanding (HUDSON; THAL, 2013; FERREIRA et al., 2015) and well-being improvement (SCHWARTZ et al., 2013; CHOUDHURY et al., 2016) to city semantics understanding (AIELLO et al., 2016; SANTOS et al., 2018) and gender behavior study (MUELLER et al., 2017; MAGNO; WEBER, 2014). Some of the studies in this direction have implications for existing and new business.

For instance, Cheng et al. (CHENG et al., 2011) presented spatiotemporal analyses of users’ displacement, exploring for that a dataset from a Foursquare-like system, i.e., a social media for location sharing. Their results can provide support in decisions about where and when to invest resources in a new business. By observing what people eat and drink in location-based social networks, Silva et al. (SILVA et al., 2014) developed an approach to identify boundaries between different cultures, which could be useful, for example, to businesses from a particular country that desire to verify the compatibility of cultural preferences across different markets.

Cranshaw et al. (CRANSHAW et al., 2012) introduced the concept of Livehoods, which are regions of a city partitioned and grouped by similarity of users’ behaviors. This behavioral targeting study may also be necessary for strategic decisions in companies. The geographic interaction characteristics of online public opinion propagation were also studied by Ai et al. (AI et al., 2018). Barbier et al. (BARBIER et al., 2011) proposed a method to understand the behavior and dynamics of online groups, showing that their results could have practical business implications, such as a better understanding of customers and influence propagation.

1_{These values represent average growth using the term “Social Media”, and “Social Media” and “Business” in}

(15)

Yang et al. (YANG et al., 2018) presented the concept of core-periphery structures, that is a two-class partition of nodes (one is the core, and the other is the periphery), and pro-vide empirical epro-vidence that social communities always have this type of structure. This is an important study in the classic task of community detection, in social network analysis, which has implications on the different influences among users within the network. Thus, the core-periphery structures enable identification of such an influential actor, a business in this case, in a community. Regarding community detection, Fortunato (FORTUNATO, 2010) formally defines communities and presents a comparison of some algorithms for detecting communities. Rossetti and Gazabet (ROSSETTI; CAZABET, 2018) also give a background on community discovery; however, in contrast to (FORTUNATO, 2010), they state community structures in dynamic networks.

Wu et al. (WU et al., 2016) were also concerned with studying a community detection task, presenting a new framework for detecting parallel communities, called SIMPLifying and Ensembling (SIMPLE). Similarly, Liu et al. (LIU et al., 2017) used the Markov-network for discovering latent links, i.e., links which are not directly observable but rather inferred, among people in social networks. Data analysis of social media based on community detection was also used by Alamsyah et al. (ALAMSYAH et al., 2017), and some challenges of community detection in social media were presented by Tang and Liu (TANG; LIU, 2010). Social commu-nity detection studies are important because they can be applied to networks of different types. In fact, this study presents a business network demonstrating the utility of social community detection in the business context.

Grizane and Jurgelane (GRIZANE; JURGELANE, 2017) reinforce the importance of social media as a marketing tool for business, and present a model assessing the benefits of investing financial resources in social media. Mahony et al. (MAHONY et al., 2018) investigate the adoption of social media by small business, showing that the use of social media can bring benefits not only to large companies but also to small and medium businesses. Kafeza et al. (KAFEZA et al., 2017) also focused on assessing business process performance using social media analysis and community detection methods, but with the differential of examining how communities change in time.

An interactive community mapping and detection scheme to reveal the dynamics of communities’ evolution around an event is proposed by Giatsoglou et al. (GIATSOGLOU et al., 2015). This approach to identifying static and evolutionary communities over time is of particular interest in investigating business dynamism. Dynamic evolution over time was also studied by Pepin et al. (PEPIN et al., 2017), who presented an approach based on a graph model

(16)

easily adaptable to an interactive visualization. Visual analysis can be useful, for example, in the presentation of business communities and how they change over the years.

Considering that the contents of social media are dynamic, Palsetia et al. (PALSETIA et al., 2014) presented an approach for detecting social communities based on posts, comments, and tweets of users. Their approach is, in some aspects, similar to Algorithm 1 presented in this study, they iteratively remove communities from the main graph making the detection of the current community not influenced by previously detected communities.

Closer to the proposal of this study, two studies were developed to help entrepreneurs and decision makers to find the best place in a city to open a new business, and the core decision process is based on social media data (LIN et al., 2016; KARAMSHUK et al., 2013). Regarding the study of Lin et al. (LIN et al., 2016), business information, such as business type, location, and check-ins, is collected from public pages of Facebook in order to recommend the best places in the city of Singapore to open a new business. Similarly, Karamshuk et al. (KARAMSHUK et al., 2013) collected information from Foursquare also aiming to recommend better locations to business, but in contrast to the static data explored in the study of Lin, Karamshuk also considered users’ mobility.

At the best of my knowledge, identifying similarities, affinities and relationships among businesses from spontaneous user reactions in social media is a novelty in the literature. To achieve this goal, the present study proposes a new way of modeling these relationships, as well as a strategy to extract relevant connections among businesses. Also, it presents a strategy for detecting businesses communities in the proposed model.

It is important to note that our study could be used in conjunction with some of the pre-vious efforts. For instance, the model proposed by Grizane and Jurgelane (GRIZANE; JURGE-LANE, 2017) could be used in conjunction with the model proposed in this research (explained in Section 4), in order to enrich the information provided to entrepreneurs about the impacts of the use of social media in business.

Besides offering a possible way to understand the formation of networks (MA et al., 2017), the link prediction problem could be relevant to several interesting current applications, such as friend recommendation in social networks, product recommendation in e-commerce, knowledge graph completion, discovery of interactions between proteins, and recovering miss-ing reactions in metabolic networks (ZHANG; CHEN, 2017). Also, Scellato et al. (SCELLATO et al., 2011) used the link prediction to identify the places that people visit.

(17)

been proposed to extract missing information, identify spurious interactions, evaluate network evolving mechanisms, and so on (L&QUOT;U; ZHOU, 2011). It is true that this is not the first paper to apply supervised learning to the link prediction problem, but there are important differences versus past work.

According to (LICHTENWALTER et al., 2010), previous research has typically ap-proached link prediction as an unsupervised problem, and they explore supervised learning with many significant factors which influences and guides the classification, for instance: network observational period, generality of existing methods, variance reduction, topological causes and degrees of imbalance, and sampling approaches.

A next-generation link prediction method has been proposed by (ZHANG; CHEN, 2017), defined as Weisfeiler-Lehman Neural Machine (WLNM). It learns topological features in the form of graph patterns that promote the formation of links. As this algorithm learns graph patterns, it does not need any previous metrics and heuristics, such as common neighbors, to perform well in different network structures. The proposed model in the present work utilizes a similar approach to capture network patterns around the links using supervised learning algo-rithms.

Both (WANG et al., 2015) and (L&QUOT;U; ZHOU, 2011) presented an important literature review, which is the basis for research on the subject, that includes a systematical category for link prediction techniques, summaries recent progress, and describe typical appli-cations.

(18)

3 DATA COLLECTION AND PROCESSING

3.1 DATA CHOICE

The decision of what data to collect is important to support further analysis, as well as to meet possible limitations imposed in the data collection process. For this study, data were collected from Facebook, because it is the most used social media platform in Brazil (FERRARI, 2016). According to Ferrari et al. (FERRARI, 2016), in Brazil alone the number of Facebook users reached 74.8 million. Facebook is also highly relevant for businesses to create new and maintain ongoing relationships with their customers; therefore, data are widely available for analysis (FERRARI, 2016).

As the purpose of the model is to identify relationships, similarities and affinities among businesses exploring user reaction data on Facebook, the data were chosen consider-ing our objectives and what is publicly available on Facebook. Table 3.1 displays the structure of the considered data. We could choose similar information from other social media platforms; however, this assessment is outside the scope of the present study.

Table 3.1: Structure of the considered data.

Business Data Example User

Reac-tion Data

Example

Business ID 166765230043005 User ID 154573625550

Business Name Rubiane Frutos do Mar User Name Fulano de Tal

Location -25.516122, 49.231571 Reacted

Busi-ness ID

166765230043005

Category Seafood Restaurant Reaction Type Like

Number of Check-ins 38627

Number of Fans 15532

Average Evaluation 4.6

Source: Designed by the author.

The first column of Table 3.1, called Business Data, represents data referring to the businesses themselves, such as their geographical location, their category (i.e., the market sector

(19)

in which the business operates) and so forth. Therefore, each business in the dataset has all the information described in the first column. The second column contains User Reaction Data for businesses in our dataset. Each reaction expressed on Facebook comes from a user and refers to a particular business.

User reaction data make it possible to create similarity connections among businesses, primarily by using the common reactions expressed by users regarding two or more businesses, as explained in Chapter 4. The business data is used mainly to extract names and categories of the businesses, which assists the reaction collection and evaluation of the results presented in Chapter 5.

All information collected is open and publicly available on the Facebook platform.

More details can be found on the Facebook Graph API1.

3.2 DATA COLLECTION

Figure 3.1: Illustration of the data collection process.

Source: The Author.

Figure 3.1 illustrates the main steps performed in the collection process. The data were

collected using the Facebook Graph API2, all business data collected are located in the city of

Curitiba, Brazil, and all user reaction data are from November to December of 2017. First, we collected data referring to the first column of Table 3.1, i.e., Business Data. Facebook Graph API requires a geographical coordinate and a radius in meters, then returns results considering the coordinate entered as the center of a circle with the radius informed, returning up to 800 results per search, that is, up to 800 businesses in this case. It is known that several regions of the studied city may have this number of businesses, then to increase the chance of getting most businesses of all regions of Curitiba, we considered twenty-one different geographic searches

1_{https://developers.facebook.com.}

(20)

throughout the city. Each geographic search has a radius of 2000 meters and is centered in different regions of Curitiba, as shown in Figure 3.2.

Figure 3.3 shows a heatmap representing the number of business found in different regions by the collection process just described. The redder the color, the more business were found in that particular location. As we can see, the central region of the city has a reddish col-oration, indicating that more businesses were collected in that region, as expected. Despite that, it is also possible to notice that the resulting dataset includes businesses spread all around the city of Curitiba. Figure A.3 in Appendix A shows how user reactions are distributed across re-gions of the city, as expected, it is easy to notice that the amount of reactions is more significant in the city center.

Figure 3.2: Data collection points considered for Curitiba, PR, Brazil.

(21)

Figure 3.3: Heatmap representing the number of businesses found in different regions by the busi-ness search process.

(22)

After obtaining the results of the geographical searches (Business Data in Table 3.1), containing basic data of the businesses in Curitiba, the reactions of the users (User Reaction

Datain the Table 3.1) were collected from the business pages previously collected. For that, we

obtained the reactions of the posts on the business pages. Because some businesses’ pages have hundreds of posts and others have millions, it was only collected reactions of the one hundred most recent posts for each page. There are five types of reactions available in Facebook, namely Like, Angry, Wow, Sad and Thankful; we included all types in the database.

We collected a total of 1, 986 georeferenced pages and approximately 280 million user reactions related to those pages. Appendix A presents supplementary information about this dataset. Figure A.1, shows the top twenty businesses regarding user reaction number, in addi-tion, Figure A.2 shows the top twenty businesses categories concerning user reaction number.

3.3 DATA CLEANING

After obtaining all the data, an automatic and manual cleaning procedure was per-formed to increase the consistency of the dataset, consequently increasing the consistency of the final results. We performed three main steps:

• Duplicate records removal (automatic procedure);

• Inconsistent records (e.g., unnamed pages, without location, and) removal (automatic procedure);

• Removal of pages that do not represent businesses (for example, a public square) and their reactions (automatic and manual procedure).

After the cleaning process, the dataset is left with 1, 926 pages, all representing busi-nesses, and approximately 260 million reactions.

(23)

4 BUSINESS RELATIONSHIP NETWORK MODEL

4.1 OVERVIEW

The main steps employed in this study to achieve the proposed objectives can be de-scribed in a framework, proposed by the authors of this study, illustrated in Figure 4.1. The framework entries are the Clean Data, described in Section 3, and a Target Business, the busi-ness chosen to be analyzed. As outputs, we obtain the Egonet of the Target Busibusi-ness, a network of the most relevant direct connections of the target business, and the Business Tagged Commu-nities, tagged communities of businesses in which the target business is part of. All non-standard businesses inside communities, considered outliers, are tagged. We provide more details about those outputs next. A relevant feature of this framework is that it is designed to handle data from any data source. In this paper it is used Facebook data; however, this is not a restriction of the framework.

Figure 4.1: General view of the proposed analysis framework for identifying relations and affinity among businesses with social media data.

4.2 BUSINESS RELATIONSHIP GRAPH

With the obtained data, we can then create a model to represent relations, similarities and affinities among businesses. The model is a non-directed graph in which vertices represent businesses, and weighted edges represent relations between two businesses. Considering our first hypothesis that users who reacted to a particular business page are potential customers, our

(24)

second hypothesis says that the affinity between two businesses is proportional to the number of users who reacted to both businesses, the common users between them. This relation between two different businesses get stronger as their common users grow in proportion to the set of their own users. Thus, technically, edges are weighted using the Jaccard Index of the set of users of each business, representing an index of affinity or similarity between the two sets.

In a more formal way, consider

B= {b1, b2, ..., bnb} (4.1)

being the set of all businesses, where nb is the total number of businesses in the dataset. Now

consider Uibeing the set of all users who reacted to the i-th business, Figure 4.2 illustrates the

definition of the set Ui. Thus, the graph is defined as in (4.2):

Figure 4.2: Users and Business Pages

BusinessGraph= (V, E,W ), (4.2)

where vertices are businesses, V = B; As in Equation (4.3), edges exist if businesses have a minimum number of lowerBound users in common, with the lowerBound defined later in the Edges Filter,

(25)

E= {(i, j) : |Ui∩Uj| > lowerBound} (4.3)

and the weights of the edges are represented, as in Equation (4.4), by the Jaccard Index: W(i, j) =    |Ui∩Uj| |Ui∪Uj| if (i, j) ∈ E 0 if (i, j) /∈ E (4.4)

4.3 FILTERS AND GRAPH CONSISTENCY

In order to increase the consistency of the information about the graph structure, two essential filtering steps are considered. The first filter is called Users Filter, it is performed before building the business grahps and it aims to reduce the amount of unnecessary input information to build the graph. The second filter is called Weak Edges Filter, it is performed after the graph is already created and its goal is to remove poor links in the graph that are likely not to represent expressive relationships between two businesses.

4.3.1 USERS FILTER

Considering that user preferences represent potential customers, and also that potential common customers can form potential partnerships between businesses, the users filter firstly

eliminates negative reactions (of the type Angry and Sad) to build the set of users Ui. Therefore,

a user u which reacted negatively to a business biis, then, not part of the set Ui.

After eliminating users with negative reactions, the users filter eliminates users who do not express themselves frequently enough about businesses in B. Thus, users with two or fewer reacted businesses are eliminated from the model, leaving the users filter lower bound as 3 reacted businesses.

On the other hand, the filter also eliminates users with too many reacted businesses for two reasons. (i) The proportion of the number of edges e that a user with m reacted businesses

contributes in the graph is quadratic e = m(m−1)₂ , therefore, for instance, according to Equation

4.3, users with 500 reacted businesses contribute in the generation of 124, 750 edges, unbal-ancing their influence over users with few reacted businesses. (ii) Users with too many reacted businesses could be robots (bots), a problem that appears in several Web systems (TASSE et al., 2017).

(26)

Keeping in mind that users with too many reacted businesses unbalance the edge gen-eration, and in order to maintain a fair balance between representativeness over potential prob-lems, 99.9% of the sum of all user reactions (for users with 3 or more reacted businesses) in the original database are kept. The filter upper bound is calculated counting the proportion of

user reactions R3−x (number of user reactions from 3 to x) over R3−∞ (number of all user

re-actions above 3), as in equation 4.5. Therefore, users with more than x reacted businesses are considered either to be a robot or to unbalance the edge generation.

R3−x

R_3−∞ = 99.9% (4.5)

Figure 4.3: Distribution of the number of users by number of reacted businesses.

As depicted in Figure 4.3, which shows the distribution of total users per number of reacted businesses, the filter upper bound calculated for the studied dataset is x = 174; therefore the users filter considers only users who reacted from 3 to 174 times. After the users filtering process, the resulting dataset has 220 million user reactions.

(27)

4.3.2 WEAK EDGES FILTER

Once users are filtered out and the business network is built, the weak edge filter re-moves possible noises in the graph structure, thus removing edges that are unlikely to represent expressive relationships among businesses, those edges are considered as weak edges. An edge is considered as an weak edge by comparing it to an average edge on a random generated graph with similar properties to the original graph (number of vertices and reactions). In this ran-dom generated graph the typical user reactions contributions to edge formation (as defined in Equation (4.3)) are randomly and uniformly distributed among all possible edges in the graph, forming a random graph. Then, after this simulation process, the edges of the original structure with similar weight to the experiment’s graph can be considered as weak edges.

When user reactions are uniformly distributed among all possible edges, the weight of any edge in the random graph follows a binomial distribution. The expected value and the variance of the weight of any edge in the random graph (with binomial distribution) are, respectively, given by Equations (4.6) and (4.7)

µ = E[X ] = nr n_c (4.6) σ2= Var[X ] = nr nc 1 − 1 nc (4.7) where, nr=∑i, j:i6= j |Ui∩Uj|

2 is the sum of weights in the original graph, nc=

nb(nb−1)

2 is the number

of all possible edges and n_bis the number of businesses in the dataset.

In this way, an edge between businesses i and j is weak if: |Ui∩ Uj| ≤ lowerBound,

as defined in Equation (4.8), following the 3σ statistics for the binomial distribution, which includes 99.73% of random edges.

lowerBound= µ + 3σ (4.8)

For the data collected in this study the calculated values were: µ = 120.289, σ = 10.96. Thus, lowerBound = 153.195. Considering the dataset of this study, 978,410 edges were eliminated, resulting in a total of 223,939 edges in the graph with less probability of being random noise.

(28)

4.4 DETECTION OF BUSINESS COMMUNITIES

Given a consistent network of business relationships, an essential step in achieving the study’s goal is to detect business communities. As the business graph diameter is 4 (the longest shortest path between any two businesses), therefore not sparse considering the number of nodes and edges, community detection algorithms based on searching cliques or dense subgraphs with optimal solution, such as the Clique Percolation Method, have a very high computational time and space complexity; therefore they are not applicable in this study.

Raghavan et al. (RAGHAVAN et al., 2007) proposed a community detection algorithm based on label propagation (LP), which iteratively uses the exchange of labels between adjacent vertices in such a way that promotes convergence of labels. One significant advantage is that this algorithm operates in almost linear time, which makes it tractable for dense graphs. Also, since previous information about the graph and communities are not available, another advan-tage is that this LP-based algorithm does not need previous information, such as heuristics of communities present in the input network, required in other methods cited in (RAGHAVAN et al., 2007). It is also true that LP-based algorithms sometimes give different solutions due to the random steps performed. To enhance the stability of the final solution this LP-based algorithm has aggregation steps to combine different solutions.

For the problem addressed here, it is interesting that communities are not too large, if they have too many businesses inside they lose their purpose of offering a comprehensive result to entrepreneurs and business owners. Thus, this step has a maximum size parameter (maxSize), and also a minimum size parameter (minSize). These parameters are configurable, and for the results obtained in this study, they were set to maxSize = 30 and minSize = 4.

An iterative algorithm, Algorithm 1, is proposed for the detection of communities of businesses. The entries of this algorithm are the business graph (BusinessGraph), the minimum size (minSize) and maximum size (maxSize) of the communities, and the output is a set of business communities. The algorithm performs the following key steps:

• Detection of communities with the algorithm described in (RAGHAVAN et al., 2007); • Business communities with size inside the range [minSize, maxSize] are saved for final

return as final valid communities;

• The union of detected communities with size outside the range above composes a new graph, named G;

(29)

Algorithm 1: Business Communities Detection Algorithm. Data: BusinessGraph,minSize,maxSize

Result: Set of Communities of BusinessGraph

1 G← BusinessGraph

2 allCom← /0

3 minEdge← min_{i, j∈G}W(i, j)

4 counter← 1

5 /* ∼ O(|B| + |E|)*/

6 while |G| > minSize do

7 counter← counter + 1

8 /* Method of (RAGHAVAN et al., 2007) */

9 detectedComm← labelPropCommDetection(G)/* ∼ O(|B| + |E|)*/

10 G← empty graph

11 for c ∈ detectedComm do

12 if |c| > minSize and |c| < maxSize then

13 allCom← allCom ∪ {c}

14 else

15 G← G ∪ c

16 end

17 end

18 remove all edges of G with W (i, j) < minEdge ∗ counter

19 if no edges removed from G then

20 break

21 end

22 end

23 return allCom;

The communities detected according to Algorithm 1 are subgraphs that tend to be dense (many edges), so businesses within the same community have a greater cohesion than businesses randomly chosen in the graph. This cohesion is formed by spontaneous user reac-tions, without additional information that could include a bias in communities detected. The time complexity of Algorithm 1 is linear O(|B| + |E|), where G = (B, E).

The set allCom can be defined as:

allCom= {com1, com2, ..., comnco} where, (4.9)

n_co= |allCom|

∀comi, comj∈ allCom : comi6= comj,

com_i⊂ B, com_j⊂ B and com_i∩ comj= /0

(30)

4.5 CLUSTERING OF BUSINESS COMMUNITIES

This section describes the steps to cluster business communities by their similarity. As we are dealing with Facebook data, the clustering process is done considering categories of businesses given by Facebook. However similar processes could be done with data from any other social media platform.

Facebook classifies all business pages into one of seven categories: Interest; Commu-nity Organization; Media; Public Figure; Businesses; Non-Business Places; and Other. All of them have subcategories, but the larger number of subcategories (22), as well as the greater diversity, are subordinated to the Business category. Therefore, we considered all subcate-gories within Interest, Community Organization, Media, Public Figure, Non-Business Places, and Other as their parent categories. For instance, all subcategories of Interest are transformed into Interest.

For the Business category, all subcategories have been considered. Under each of them, there are sub-subcategories. The sub-subcategories of Business were disregarded, as this level of specialization was not considered interesting for the analysis carried out. The Food & Beverage subcategory of Business has, for example, the sub-subcategories Bakery and Restaurant. In this case, all sub-subcategories within Food & Beverage are considered Food & Beverage. We performed the same procedure for all other subcategories within Business.

Figure 4.4: Facebook subcat-egories renaming process.

Figure 4.5: Examples of non-normalized feature vector for each community.

After doing this process, a total of 28 business’ category names (6 from the parent categories, and 22 subcategories from Business) were obtained, as illustrated in Figure 4.4. We then built a feature vector with these 28 categories and counted for each community the number

(31)

of occurrences of businesses in each of the 28 categories, as in Figure 4.5. These values are then normalized based on the maximum number of locations in a given feature for each community (column-wise normalization in Figure 4.5), this normalization is defined in Equation 4.10 in which each position of the vector has a normalizing factor.

It is relevant to point out that the normalizing factor could be, instead of the maximum number of locatios for each category, the sum of all businesses inside that very community (row-wise normalization in Figure 4.5). By choosing that normalization big communities and small communities, with similar category proportions, would fall into the same cluster increasing the potential bias for outlier detection.

Considering that the category of a business bican be represented by cat(bi) the

Equa-tion 4.10 defines the feature vector.

∀com_i, com_j∈ allCom; ∀cat_k∈ allCategories

B_com_i,catk= {b ∈ comi: cat(b) = catk}

With this feature vector (represented in Equation 4.10 formally as vector(comi) for a

given community comi), it is possible to identify the similarity of different communities and

per-form a clustering process. The clustering algorithm used was the k-means, with the Euclidean distance (HARTIGAN, 1975). For choosing the right k parameter of k-means algorithm, several different values were tested, and the value with the smallest sum of squared distance from the

centroid1was chosen as the best fit for the data. For the dataset presented in this paper, the best

fit was k = 8; thus this value is kept for this study.

Figure 4.6 ilustrates how clusters, communities and business are related from the set perspective. Formally, k-means receives a set of all communities (called allCom as in Algorithm 1) and returns a set of clusters, where clusters are disjoint non-empty subsets of allCom:

clusters= kmeans(allCom),

∀cl₁, cl₂∈ clusters : cl₁6= cl₂, cl₁∩ cl₂= /0, cl₁6= /0 and cl₂6= /0.

(32)

Figure 4.6: Clusters, Communities and Businesses relationship ilustration.

4.6 BUSINESS OUTLIER DETECTION

To detect possible businesses outliers inside business communities, first a clustering of business communities, as described in Section 4.5, has to be performed. In possession of clus-ters of business communities, the detection is based on cluster centroids, which represent the common proportion of business categories for each cluster, and are a central piece in the busi-ness outlier detection. In a high level of abstraction, if communities differ significantly from its cluster centroids, they have some outlier businesses inside them. Formally, the cluster centroid

is the average vector (∈ IR28) among all the communities from that cluster, as in Equation 4.11:

∀cl ∈ clusters, centroid(cl) = ∑c∈clvector(c)

|cl| (4.11)

Cluster centroids are used to build cluster signatures. A cluster signature is a set of the most representative business categories for a particular cluster. The sum of the dimensions of the centroid vector might not sum up to 100%, however when this vector is normalized, each of its categories (dimension) now represents a certain percentage of the entire vector, making them all summing up to 100%. By selecting the top n categories with the greatest percentages we can build the cluster signature, where the sum of percentages of those n categories is higher than a threshold (> 50%). For example, a cluster could have as its signature the categories Food and Beverage, Shopping and Retail and Entertainment, because they meet a 70% threshold defined in a particular application.

(33)

inputs: A vector v ∈ IRncat_{, representing the centroid, where n}

cat is the total number of

cate-gories; and a threshold ∈ ]0.5, 1], a number representing the minimum percentage considered

as majority. As a result, Algorithm 2 returns a set containing the greatest dimensions, i.e. cate-gories, of the vector v that correspond to the closest possible number to threshold ∗ 100% of all

dimensions. Its time complexity is O(d2) where d = dim(v) (dimension of vector v).

Algorithm 2: Function that returns the signature (greatest dimensions) of the vec-tor.

1 Function getSignature(v, threshold):

2 s← ∑_iv_i/*O(d) Used to normalize values to be comparable with the threshold*/

3 accT hrs← 0

4 signature← /0

5 /*For loop runs in O(d2)*/

6 for i ∈ {1, 2, ..., d} do

7 m← max(v)/*O(d) Max value in vector*/

8 j← argmax(v)/*O(d) Category (index) of the vector’s maximum value*/

9 if |m+accT hrs_s − threshold| < |accT hrs_s − threshold| then

10 signature← signature ∪ j

11 accT hrs← accT hrs + m

12 v_j← 0/*In next iteration, max(v) is the next greatest*/

13 else

14 break

15 end

16 end

17 return signature

Finally, Algorithm 3 is responsible for tagging all businesses whose categories are not included in its corresponding cluster signature, therefore considered as outliers. This algorithm takes as input a set of clusters and returns the same input structure, however with additional outlier tags. Note that firstly it calls Algorithm 2, to get each cluster signature, and next it tags businesses whose categories are not in its cluster signature. Note that each community has a vector (as in Figure 4.5 and Equation 4.10), and each cluster has a centroid (as in Equation 4.11), both are captured in Algorithm 3 as vector(community) and centroid(cl), respectively. In order to capture the most representative categories of each cluster, the function getSignature (Algorithm 2) is called. For this function, if the threshold value is high (e.g., 0.9), then too many categories are considered in the cluster signature, on the other hand, if the value is low (e.g., 0.5) then there is a chance that no category is considered in the cluster signature. According to our empirical analysis, the value 0.7 represents a good balance; therefore, we consider it as the threshold.

(34)

Algorithm 3: Outlier Detection Algorithm.

Data: Clusters - set containing business community clusters Result: Clusters with tagged businesses

1 taggedClusters← /0

2 for cl ∈ clusters do

3 newCluster← /0

4 clSignature← getSignature(centroid(cl), 0.7)/* For all Clusters it runs in

O(|clusters| ∗ n2_cat)*/

5 for community ∈ cl do

6 vc← vector(community)/*Defined in Equation 4.10, for all communities

it runs in O(|B| + |allComm| ∗ ncat)*/

7 for i ∈ {1, 2, .., |vc|} do

8 if vc_i> 0 and i 6∈ clSignature then

9 /*tagBusiness(community, cat) tags businesses of category cat inside community, for all businesses it runs in O(|B|)*/

newCluster← newCluster ∪ {tagBusinesses(community, i)}

10 end

11 end

12 taggedClusters← taggedClusters ∪ {newCluster }

13 end

14 end

15 return taggedClusters

The total time complexity of the Algorithm 3 is the sum of all commented complexities,

(35)

5 RESULTS AND DISCUSSIONS

The two outputs of the framework are the tagged communities, i.e., tagged as an outlier or not, that includes the target business (informed by the user), and the egonet of the target business, consisting of a subgraph of all edges of the target business (see Figure 4.1). As there are considerably large egonets for individual businesses, the egonet size in this study was limited to a maximum of seven adjacent vertices (with the strongest edges) plus the target business. Note that this parameter could be adjusted for each case under study.

Figure 5.1: Partial view of the business graph on the map of Curitiba

(36)

Figure 5.2: Business Network, circles are a businesses, colors are their categories

(a) Circles in green color represent business of “Women Clothing” category

(b) Circles in pink color represent business of “Interior Design & Furniture” category Source: Designed by the author.

shown on a map of the city of Curitiba. To ease the visualization, we only show edges with more than 1, 500 common reactions. The thicker the edge, the stronger the relationship between nodes. We show each node in the graph according to the location of the business it represents. Note that there are vertices far from the city center with a considerably high density of edges going towards the city center, indicating a strong activity also in outlying neighborhoods.

(37)

Figure 5.2 shows the same network as in Figure 5.1 however not geographically printed, circles are businesses, their sizes show the number of connections they have with other busi-nesses and their colors represent the category that business belongs to. For instance, in Figure 5.3(a), the green color represents “Women Clothing” category, and in Figure 5.3(b), the pink color represents the “Interior Design & Furniture” category.

An important detail was captured during analysis. The nodes “Prefeitura de Curitiba” (N1) and “RPC” (N2) have a strong influence on business graph (as shown in Appendix A by Tables A.1 and A.2). N1 represents the city hall of Curitiba, and it is a very popular Facebook

page 1 not only in the city of Curitiba but also nationally. N2 is the largest TV channel in

Curitiba. Both strongly impact the analysis carried out by this study. N1 does not represent a business and N2 is an obvious TV media partnership. Their edges influence on the business relationship graph potentially hides interesting smaller business relationships. In order to favor more valuable relationships, the analysis was performed without both nodes. An improvement in this analysis model could include an additional filter to automatically remove such companies that has too strong influence in the graph and are obvious recommendations.

As this graph is quite large, the extraction of useful information becomes complicated to human eyes, justifying the extraction of communities and egonets. After running Algorithm 1 with the parameters considered, 144 communities were detected, each ranging from 4 to 30 businesses located in the city of Curitiba. Figure 5.4(a) illustrates a community containing en-tertainment businesses (e.g., “Blood Rock Bar”, and “SSCWB - Shinobi Spirit”) and food busi-nesses (e.g., “Ca’dore Comida Descomplicada”), so they are busibusi-nesses united by the “leisure” context. The two communities illustrated in Figures 5.4(b) and 5.4(c) both have businesses bound together by the context that can be called “fashion”, because it contains businesses from the beauty salon sector (e.g., “Studio Andressa Mega Hair” and “Cheias de Charme Costméti-cos”), modeling agencies (e.g., “Nk Agencia de Modelos” and “South Models Parana”) and fashion stores (e.g., “TONY JEANS” and “Zandra Bolsas” ).

(38)

Figure 5.3: Communities detected by Algorithm 1

(a) Community of businesses related to leisure. (No businesses tagged as outliers)

(b) Community of businesses related to fashion. (No businesses tagged as outliers)

(c) Another Community of businesses related to Fashion. The business tagged in red is a health plan related business and it was considered an outlier by Algorithm 3

We can note that, even though both thebusiness network construction (see Section 4.2) and the Algorithm 1 did not use any information of the businesses themselves, all communities detected have similar strong semantics that bind businesses together inside each community.

The business category clustering analysis can illustrate those contexts in a more gen-eral view, considering all communities detected. The clustering step, then, unites all similar communities, by business categories, in eight different clusters (for k = 8 as discussed in Sec-tion 4.5). To illustrate the clusters, eight word clouds of businesses’ categories, inside each cluster, were generated and shown in Figure 5.4. Besides, Figure 5.5 shows the number of user reactions in each cluster.

(39)

execute K-means, so we considered the original names. Note the surprising similarity between the categories in each group. For example, Cluster 1 is related to leisure, containing predomi-nantly food, drink and entertainment businesses, Cluster 2 contains most businesses related to beauty and style, while Cluster 3 is more related to establishments about automotive products and services. This analysis shows the existence of a predominant context in each community. Taking into account information in Figure 5.4 and Figure 5.5, it is possible to notice that the most popular contexts are leisure and food, represented by Cluster 1, followed by shopping malls, represented by Cluster 7.

Knowing that there is a tendency of having a predominant context of business in com-munities, outliers, i.e., business outside the predominant type of business, can be useful for decision makers. Next, we present results in this direction following the procedures described in Section 4.6.

The community illustrated in Figure 5.4(c), which is a fashion related community (its predominant context), has one outlier inside it, which is the business called “Grupo AllCross” (tagged in red). This business is a health plan consultant business, being not part of the “fashion” context and, thus, correctly identified as an outlier by Algorithm 3. As an improvement of the results in (TSUTSUMI et al., 2018), outliers cannot be ignored in the results presented here, as they might represent non-trivial potential business partnerships. Although outliers are not part of the dominant context, they still have strong connections to businesses from that context.

(40)

Figure 5.7(a) shows the egonet of Rubiane, a seafood restaurant, which was arbitrar-ily chosen for analysis, and Figure 5.7(b) shows a detected community in which Rubiane is included. On the one hand, having the business’ egonet, it is possible to visualize the direct connections that the target business possesses with other businesses. On the other hand, having communities, it is possible to notice connections that may not be direct to the target business. Since these non-direct connections are within a community (detected by the Algorithm 1), they are cohesive (a dense subgraph) and may represent possible non-trivial partnerships for the busi-ness under evaluation. For example, the company named “Quintal do Monge” does not appear in the Rubiane’s egonet shown in Figure 5.7(a), but it appears in a community where Rubiane is also included, shown in Figure 5.7(b). Also in Figure 5.7(b) notice that the business called “Cannes Turismo” is a tourism related business and was tagged as an outlier by Algorithm 3.

Rubiane, for instance, could make use of this result to increase its sales by creating business partnerships, such as selling products and services along with the businesses found in the results, as well as marketing partnerships and joint marketing campaigns. For the case of the restaurant analyzed here, we observe that competitors appeared in the same community, for example, “Braseirinho Frutos do Mar”. For the case involving restaurants, this could be explained by the fact that users tend to attend several restaurants and some may be of the same type. However, this is not a problem with the proposed approach, since the entrepreneur is who decides the best strategy of how to explore the results. Note that a partnership could be made with competing establishments. However, these cases deserve special attention.

Also, another interesting use case for Rubiane here would be to expand its business activity by incorporating other companies products and services to its own products and services portfolio and become then more competitive.

(41)

Figure 5.4: Word clouds for categories of similar community clusters. Each cluster legend contains the three biggest super categories of its signature according to Algorithm 2.

(a) Cluster 1. Arts & Entertain-ment; Local Service; Media News Company

(b) Cluster 2. Beauty Cosmetic & Personal Care; Shopping & Retail; Other.

(c) Cluster 3. Automotive, Air-craft & Boat; Commercial & In-dustrial; Non-profit Organization.

(d) Cluster 4.Media; Hotel & Lodging; Other

(e) Cluster 5. Education; Public Figure; Other.

(f) Cluster 6. Sports & Recre-ation; Travel & TransportRecre-ation; Advertising Marketing.

(g) Cluster 7. Science, Tech-nology & Engineering; Hotel & Lodging; Advertising Marketing.

(h) Cluster 8. Local Service; Shopping & Re-tail; Medical & Health.

(42)

Figure 5.5: A chart presenting user reaction number per cluster

(43)

Figure 5.6: Framework output for Rubiane.

(a) Egonet for the company called Rubiane.

(b) Community of businesses related to Food (including Rubiane). In red, a tourism related business was tagged as outlier by Algorithm 3.

(44)

6 CONCLUSION

The approach introduced in this study aims to provide a new way of identifying signifi-cant and non-trivial relations among businesses, which could ease the laborious task of strategic business partnerships recommendation. This study found, using large-scale data from Face-book, that businesses inside each detected community have surprising similarities to one an-other regarding categories. Even not considering any information of the businesses themselves, all communities have strong semantic similarities in business that compose them, indicating that this approach has the potential to extract cohesive communities of businesses. It is also observed that non-obvious relationships between businesses could be extracted using the pro-posed outlier detection algorithm, those relations might deserve particular attention of business owners.

The proposed approach could be an important building block for the development of new applications and services, including a business partnership recommender for entrepreneurs and business owners. This service can contribute to sales improvement, as well as to keep businesses more sustainable in the market, this contribution also shows that the data available in social media and other platforms are indeed helpful to understand the dynamics of the world around us.

(45)

7 FUTURE WORK

7.1 NEW DIRECTIONS AND INSIGHTS

Considering the whole study as a basis for further research, there are many directions to follow. For instance, it is observed the existence of competing businesses in the same com-munity. As one of the possible implications of this study is to contribute to identifying new business partnerships, it may be interesting to determine a way to detect whether two busi-nesses are competitors to improve the performance of possible recommendations. Besides, it is interesting to evaluate the results for a new dataset, especially in one representing a different culture. In addition, it is essential to perform a qualitative evaluation considering business own-ers or decision makown-ers of the studied businesses. This is essential to undown-erstand how to better explore the results in practice. Another direction is to consider the temporality of the reactions, to evaluate, for example, the temporal correlation in the communities.

New directions could also explore the negative reactions collected into the database, one hypothesis could count negative reactions in edges weights, another could consider only negative reactions as a similarity metric to form edges then communities.

Besides the directions mentioned above, another interesting direction would be to have a system or algorithm to infer missing connections on data. There are many link prediction techniques to be possibly applied (e.g. (WANG et al., 2015) and (L&QUOT;U; ZHOU, 2011)) for this matter, one interesing approach is studied in (ZHANG; CHEN, 2017). A link prediction model which aims to enhance link prediction accuracy by combining additional information to the network structure information in a machine learning algorithm. By incorporating the link prediction model to the business relationship network, it could possible to recommend potential business partnerships that might exist in the real world but it was not present in the collected data.

(46)

REFERENCES

AGNIHOTRI, R. et al. Social media: Influencing customer satisfaction in b2b sales. Industrial Marketing Management, Elsevier, v. 53, p. 172–180, 2016.

AI, C. et al. The national geographic characteristics of online public opinion propagation in china based on wechat network. GeoInformatica, Springer, p. 1–24, 2018.

AIELLO, L. M. et al. Chatty maps: constructing sound maps of urban areas from social media data. Open Science, The Royal Society, v. 3, n. 3, p. 150690, 2016.

ALAMSYAH, A. et al. Social network data analytics for market segmentation in indonesian telecommunications industry. In: IEEE. Proc. of ICoICT’17. [S.l.], 2017. p. 1–5.

BARBIER, G.; TANG, L.; LIU, H. Understanding online groups through social media. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Wiley Online Library, v. 1, n. 4, p. 330–338, 2011.

BERGQUIST, W. H.; BETWEE, J.; MEUEL, D. Building strategic relationships: How to extend your organization’s reach through partnerships, alliances, and joint ventures. San Franscisco, USA: Jossey-Bass Publishers, 1995.

BRITO, S. et al. Cheers to untappd! preferences for beer reflect cultural differences around the world. In: Proc. of AMCIS’18. New Orleans, USA: [s.n.], 2018.

CHENG, Z. et al. Exploring millions of footprints in location sharing services. In: Proc. of ICWSM’11. Barcelona, Spain: [s.n.], 2011. p. 81–88.

CHOUDHURY, M. D.; SHARMA, S.; KICIMAN, E. Characterizing dietary choices, nutrition, and language in food deserts via social media. In: Proc. of CSCW’16. San Francisco, USA: ACM, 2016. p. 1157–1170. ISBN 978-1-4503-3592-8.

CRANSHAW, J. et al. The livehoods project: Utilizing social media to understand the dynamics of a city. In: Proc. of ICWSM’12. Dublin, Ireland: [s.n.], 2012.

ELMUTI, D.; KATHAWALA, Y. An overview of strategic alliances. Management decision, MCB UP Ltd, v. 39, n. 3, p. 205–218, 2001.

FERRARI, V. C. Content marketing and brand engagement on social media: a study of Facebook´ s posts in the ecommerce industry in Brazil. Tese (Doutorado) — FGV - Fundação Getúlio Vargas, 2016.

FERREIRA, A. P. G.; SILVA, T. H.; LOUREIRO, A. A. F. Beyond sights: Large scale study of tourists’ behavior using foursquare data. In: Proc. of IEEE ICDMW’15 Workshops. [S.l.: s.n.], 2015. p. 1117–1124.

FORTUNATO, S. Community detection in graphs. Physics reports, Elsevier, v. 486, n. 3-5, p. 75–174, 2010.

(47)

GIATSOGLOU, M.; CHATZAKOU, D.; VAKALI, A. User communities evolution in mi-croblogs: A public awareness barometer for real world events. World Wide Web, Springer, v. 18, n. 5, p. 1269–1299, 2015.

GRIZANE, T.; JURGELANE, I. Social media impact on business evaluation. Procedia Com-puter Science, Elsevier, v. 104, p. 190–196, 2017.

HARTIGAN, J. A. Clustering algorithms. Wiley, 1975.

HOFFMAN, D. L.; FODOR, M. Can you measure the roi of your social media marketing? MIT Sloan Management Review, Massachusetts Institute of Technology, Cambridge, MA, v. 52, n. 1, p. 41, 2010.

HUDSON, S.; THAL, K. The impact of social media on the consumer decision process: Im-plications for tourism marketing. Journal of Travel & Tourism Marketing, Taylor & Francis, v. 30, n. 1-2, p. 156–160, 2013.

KAFEZA, E.; MAKRIS, C.; ROMPOLAS, G. Exploiting time series analysis in twitter to mea-sure a campaign process performance. In: IEEE. Proc. of SCC’17. [S.l.], 2017. p. 68–75. KARAMSHUK, D. et al. Geo-spotting: Mining online location-based services for optimal retail store placement. In: Proc. of ACM KDD’13. Chicago, Illinois, USA: [s.n.], 2013. p. 793–801. ISBN 978-1-4503-2174-7. Disponível em: <http://doi.acm.org/10.1145/2487575.2487616>. LICHTENWALTER, R. N.; LUSSIER, J. T.; CHAWLA, N. V. New perspectives and methods in link prediction. In: ACM. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. [S.l.], 2010. p. 243–252.

LIN, J. et al. Where is the goldmine?: Finding promising business locations through facebook data analytics. In: ACM. Proc. of Hypertext’16. Halifax, Canada, 2016. p. 93–102.

LIU, W. et al. Markov-network based latent link analysis for community detection in social behavioral interactions. Applied Intelligence, Springer, p. 1–16, 2017.

L&QUOT;U, L.; ZHOU, T. Link prediction in complex networks: A survey. Physica A: statis-tical mechanics and its applications, Elsevier, v. 390, n. 6, p. 1150–1170, 2011.

MA, C.; BAO, Z.-K.; ZHANG, H.-F. Improving link prediction in complex networks by adap-tively exploiting multiple structural features of networks. Physics Letters A, Elsevier, v. 381, n. 39, p. 3369–3376, 2017.

MAGNO, G.; WEBER, I. International gender differences and gaps in online social networks.

In: . Proc. of SocInfo. Barcelona, Spain: Springer, 2014. p. 121–138.

MAHONY, T. et al. If we post it they will come: A small business perspective of social media marketing. In: ACM. Proc. of ACSW’18. Brisbane, Australia, 2018. p. 21.

MUELLER, W. et al. Gender matters! analyzing global cultural gender preferences for venues using social sensing. EPJ Data Science, v. 6, n. 1, p. 5, 2017. ISSN 2193-1127. Disponível em: <http://dx.doi.org/10.1140/epjds/s13688-017-0101-0>.

MUKHOPADHYAY, S. Opinion mining in management research: the state of the art and the way forward. OPSEARCH, Springer, v. 55, n. 2, p. 221–250, 2018.