Social media analytics and research testbed (SMART): Exploring spatiotemporal patterns of human dynamics with geo-targeted social media messages

(1)

Social media analytics and research

testbed (SMART): Exploring

spatiotemporal patterns of human

dynamics with geo-targeted

social media messages

Jiue-An Yang, Ming-Hsiang Tsou, Chin-Te Jung,

Christopher Allen, Brian H Spitzberg,

Jean Mark Gawron and Su-Yeon Han

Abstract

The multilevel model of meme diffusion conceptualizes how mediated messages diffuse over time and space. As a pilot application of implementing the meme diffusion, we developed the social media analytics and research testbed to monitor Twitter messages and track the diffusion of information in and across different cities and geographic regions. Social media analytics and research testbed is an online geo-targeted search and analytics tool, including an automatic data processing procedure at the backend and an interactive frontend user interface. Social media analytics and research testbed is initially designed to facilitate (1) searching and geo-locating tweet topics and terms in different cities and geographic regions; (2) filtering noise from raw data (such as removing redundant retweets and using machine learning methods to improve precision); (3) analyzing social media data from a spatiotemporal perspective; and (4) visualizing social media data in diagnostic ways (such as weekly and monthly trends, trend maps, top media, top retweets, top mentions, or top hashtags). Social media analytics and research testbed provides researchers and domain experts with a tool that can efficiently facilitate the refinement, formalization, and testing of research hypotheses or questions. Three case studies (flu outbreaks, Ebola epidemic, and marijuana legalization) are introduced to illustrate how the predictions of meme diffusion can be examined and to demonstrate the potentials and key functions of social media analytics and research testbed.

Keywords

Big Data, social media, human dynamics, spatiotemporal analysis, data mining

Introduction

Careful mining of social media messages can reflect the trends of human dynamics, such as where seasonal flu is spreading (Aslam et al., 2014; Nagel et al., 2013), when social movement ideology is diffusing (Tsou et al., 2013), and how urban mobility patterns reveal geospa-tial and social functions (Hawelka et al., 2014, McKenzie et al., 2015). Despite the new and evolving nature of social media, various models and theories are emerging to account for the ways in which message content and exchange dynamics both reflect and drive human processes in realspace (Cheung and Thadani,

2012; Wang and Zhang, 2012). Integrating several trad-itional theories (e.g., framing theory, narrative theory, diffusion of innovations, information theory, commu-nicative competence theory, evolutionary theory), the multilevel model of meme diffusion (M3D) conceptual-izes how mediated messages diffuse over time and space

San Diego State University, USA

Corresponding author:

Ming-Hsiang Tsou, Department of Geography, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182-4493, USA.

Email: mtsou@mail.sdsu.edu

Big Data & Society January–June 2016: 1–19

!The Author(s) 2016 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/2053951716652914 bds.sagepub.com

(2)

(Spitzberg, 2014). Memes are any replicable unit of cul-tural transfer of information. Thus, all digital social media messages are potential memes, and their replic-ability is a key distinction between memes and everyday face-to-face communication. It also makes memes a unique source of information that can be monitored and modeled at varying levels of granularity. The degree to which the diﬀusion of memes through social media networks can be modeled becomes a major the-oretical challenge for businesses, governments, celebri-ties, and all those seeking to understand the human dynamic of communicative inﬂuence (Kamel Boulos et al., 2011; Sasahara et al., 2013; Segev et al., 2015).

The M3D anticipates that certain features of (a) memes or social media messages (e.g., novelty, repeti-tion, etc.); (b) communicators (e.g., source credibility, network centrality, etc.); (c) structural (e.g., network span, homophily, etc.) and subjective network structures (e.g., counter-memes, thresholds, etc.); (d) societal pro-cesses (e.g., publicity campaigns, stage of diffusion, etc.); and (e) geo-technical factors (e.g., geospatial proximity, population density, etc.) predict or moderate M3D dynamics. In any given context, such processes may reveal unique diffusion maps or patterns across time and space, varying by rapidity of diffusion, exhaustion and duration of diffusion, and evolution of message vari-ation. To the extent such unique patterns reveal distinct correspondence to realspace activities, they provide an important window into developing surveillance and intervention programs to serve the public interest in dif-fusing time and space sensitive information (e.g., disease diffusion and/or treatment, drug abuse or diffusion, nat-ural disaster or crisis response). In other contexts, such surveillance may map important applications for organ-izations concerned with mapping or stimulating institu-tional reputation or product diffusion (e.g., academic reputation or product reviews).

As a pilot application of implementing the M3D in studying human dynamics and social phenomena, we developed a web-based social media analytics and research testbed (SMART) integrating social media application programming interface (API), machine learning, and geographic information system (GIS) (Figure 1). SMART is designed with the goal of provid-ing researchers with a platform to quickly test hypoth-eses and to refine research questions. The analytics and statistics presented on SMART may not directly pro-vide answers to the research questions, but the trend visualization and aggregated analytics can suggest pos-sible outlooks and potential spatiotemporal diffusion patterns for the selected topics and keywords in the region(s) of interest. Researchers can utilize the frontend user interface of SMART—the SMART dashboard—to visualize, characterize, and predict trends in these topics in different cities over time.

SMART can also provide broader applications beyond academia. Taking the public health and epidemiology domain as an example, local healthcare providers, hos-pital staff, government officials, first responders, and other stakeholders can access this online web-based dashboard without local desktop installation to keep track of trending information related to specific disease outbreaks. The following features of SMART under-score its significance and innovativeness:

1. SMART captures and updates the spatial nature of social media messages on a daily basis, evaluates patterns of messages in diverse cities, and geograph-ically tracks patterns of diﬀusion.

2. The frontend of SMART, the SMART dashboard, displays the dynamic temporal trends of social media messages (daily, weekly, and monthly) with inter-active selection tools.

3. The backend of SMART is built around geo-locating, data ﬁltering, and machine learning components to remove noise and errors in diﬀerent subjects. These procedures facilitate more accurate analysis and tracking of topics, such as disease out-breaks and drug abuse.

4. Users can select atarget cityto analyze the temporal trends and top messages from individual cities, or use the default view to display the aggregated trend analysis by combining messages from all cities in the monitoring list.

5. The visualization map of tweet messages displayed in the online interface helps users to understand the diﬀerences of social media messages between cities. This provides information for users to track the spa-tial and temporal evolution of a topic.

(3)

The development of SMART initially focuses on five topics (Influenza, Whooping Cough, Wildfire, Drugs, andAztecs) because they represent a diverse range of patterns of use, targeted users, required analysis func-tions, and social values. Communication in general, and social media in particular, have been identified as key elements in each type of context: flu (Culotta, 2013; Nagar et al., 2014), whooping cough (Olsen, 2013), wildfire (Helsloot and Groenendaal, 2013, Slavkovikj et al., 2014; Sutton et al., 2013), drugs (Hanson et al., 2013; Young and Shoptaw, 2013), and institutional affiliation and reputation (Miao et al., 2014; Snoeijers et al., 2014). As anticipated by the M3D and numerous other theories, there is likely to be some degree of recip-rocal influence and representation between social media communication and the activities that involve such communication. Mapping such media content and dynamics is the first step in revealing the degree of cor-respondence in socially relevant contexts of human activity, and that is the fundamental concept behind the development of SMART.

Web and social media analytics platforms

The high daily volume of user-generated content posted to social media and the Internet has brought great opportunity to social media and Big Data analytics (Tsou et al., 2013b). To facilitate such analytics and visual presentations, many research groups have devoted their eﬀorts to analyzing information from social media and the Internet, and developing web-based interactive platforms to display analytics (Brownstein et al., 2008; Diakopoulos et al., 2010; Ginsberg et al., 2009; Kamel Boulos et al., 2010; MacEachren et al., 2011; McKelvey and Menczer, 2013). Applications of this domain include and are not limited to epidemiology surveillance, crisis manage-ment, and trend detection and monitoring. The design of SMART emphasizes the geospatial characteristics of social media data by giving users the ability to target topics in speciﬁc geographical regions and analyze the spatiotemporal patterns of social media messages. Here, we introduce and compare several applications that inspired the development of SMART.

With the release of public APIs from social media companies, there are many commercial platforms that tap into the data stream and provide social media ana-lytics.Topsy1was a social media analytics system and a certiﬁed partner of Twitter that indexed all the tweets since 2006. With direct access to the Twitter ﬁrehose database, typing a search term in the Social Search

interface looks up tweets from 2006 that are associated to the term and presents the most popular content in diﬀerent temporal resolutions. The Topsy Social Analytic function allows users to select multiple

keywords and compare the trends in the past 30 days. The function of displaying the most popular content and interactive temporal trend visualization informed the design of SMART. Diﬀerent from Topsy, SMART not only shows the overall top content in each module but implements a map view which can present analytics at regional scales. Also, instead of analyzing single key-words or ﬁnding the trend of an URL being mentioned, SMART focuses more on topic-level analytics.

Electronic event-based biosurveillance systems are examples where near real-time information from the Internet is an important source of providing epidemio-logic intelligence (Gajewski et al., 2014). These systems robustly validate data source from the Internet using epidemiological approaches against an official compari-son dataset to generate statistics and predictions (Barboza et al., 2014). Google Flu Trends2 (GFT) (Ginsberg et al., 2009) attempted to make accurate pre-dictions about flu activity across regions in U.S. based on aggregated Google Search query data. Their predic-tion model indicates that the frequency of certain Google Search queries is highly correlated with the per-centage of physician visits in which a patient presents with influenza-like illness symptoms. Users of the GFT application can view the overall flu activity in line charts or in map view that shows the intensity of flu activity using color gradient. GFT implemented an automated query term selection process to generate a list of the highest scoring search queries to represent flu activity. SMART and GFT are different in two ways. First, GFT automatically aggregates multiple search term for a topic and focuses on search queries of the Google Search users, while SMART composes each topic with keywords defined by domain expertise and tracks the conversation on social media related to the topic. Second, GFT focuses on the predication model of influenza, where SMART, besides monitoring influ-enza outbreak, can also be used to track the trends of different topics.

(4)

data sources into disease-related alerts and nonrelated information. While SMART also supports disease out-break surveillance if a topic is deﬁned with disease-related keywords, it is diﬀerent from HealthMap in which SMART users can monitor the outbreak at any region of interests.

Many other applications apply visual analytics and mapping to support analysis tasks including iden-tifying structure and semantics about media event (Diakopoulos et al., 2010), revealing city dynamics about citizens (Andrienko et al., 2013), and detecting abnormal events (Chae et al., 2012; Thom et al., 2012).

SensePlace2is developed to support the understanding of spatial and temporal patterns of activities, events, and attitudes (MacEachren et al., 2011). With a user-selected keyword, the application extracts the top 500 related tweets and displays them on the list view and the map view. The map feature on SensePlace2 displays tweets in both gridded density surfaces and in geo-located points to support analysis of tweets. The time plot and control function allow users to examine related tweets posted at a certain time period. Different from SensePlace2, SMART (a) normalizes conversations about each topic by the population of user-defined region of interests, and (b) implements a machine learning procedure to filter out nonrelevant conversations for the monitoring topics.

Technological framework and key

components of SMART

SMART is a web application utilizing GIS, social media APIs, and machine learning tools for Twitter data collection, normalization, and analytics. Figure 2 illustrates the two-tier system framework design of the SMART system. The server side of SMART is a data processing and analysis engine mainly developed with Node.js3 and Python. Twitter data is collected with official Twitter Search APIs, processed with our geo-locating engine, and then stored into the open-source No-SQL MongoDB4 database. To filter out noise in social media conversations and retain only related messages, a machine learning procedure was developed in Python with the scikit-learnlibrary that allows the SMART system to better differentiate tweets representing real-world instances of diseases from false positives. Analytics about the monitored topics are also generated from the automatic SMART backend engine. On the client side, analysis results and spatiotemporal patterns are presented to the users in an interactive dashboard-like interface that contains several visualiza-tion modules. This SMART dashboard is a web appli-cation built aroundBootstrap,5a popular open-source frontend framework with HTML, CSS, and optional JavaScript extensions. For the visualization modules,

we use dygraph.js6 to construct the temporal graphs for its interactive functions and D3.js7 to build word clouds for visualizing top keywords. Oﬃcial Twitter JavaScript Widget is implemented on the dashboard to display media content embedded in the tweets. The trend maps module is developed with the open-source

Leaﬂet.js8 and basemap layers from Esri ArcGIS Online. The SMART client side communicates with the server side by making RESTful API calls. The server returns the response to the client side API call in the JavaScript Object Notation format. Details about the SMART server side engine designs and client side user interface designs are introduced as follows.

Social media retrieval (APIs)

(5)

unexpected events, such as earthquakes. In addition, the design of SMART takes advantage of the spatial filtering methods provided by Twitter Search APIs. With thegeocode parameter setup, the geo-search first identifies tweets with latitude/longitude coordinates that fall within the query areas, and then also finds tweets created by users whose profile location can be geo-coded to latitude/longitude coordinates that fall within the same area. By polling the Search API once per day, the system can access large datasets for specific geographic areas, which allows for detailed analysis of the data at municipal, regional, and national scales. Both geo-tagged tweets and non-geo-tagged tweets are retrieved with the Search API.9In terms of data man-agement, the SMART system stores all retrieved social media attributes in MongoDB, an open-source and NoSQL database, considering its scalability and read/ write performance (van der Veen et al., 2012). MongoDB is one of the most popular databases for Big Data with its capability of high performance queries, full-text search, and support of spatial query from very large datasets.

Geo-locating social media conversations and

normalization

When searching for tweets with keywords, the Twitter search API can potentially return tweets with two types of spatial information: (1) geo-tagged locations pro-vided by GPS-enabled devices, and (2) self-reported locations specified in user profiles. Geo-tagged loca-tions are latitude and longitude pairs created by mobile devices with built-in GPS receivers or by users engaging with the geo-location features when posting a tweet. Self-reported locations are specified by users in their profile and can be changed at any time. Using Twitter messages with geo-tagged locations can poten-tially decrease the location uncertainty in analysis. However, tweets as such remain a small fraction when harvesting data with Twitter Search API, ranging from only 0.7% (Graham et al., 2014) to 6% (Takhteyev et al., 2012) depending on different topics. Many efforts have been made to infer the location of the social media users using user profiles (Hecht et al., 2011; Sakaki et al., 2010), with text content and timestamp (Li et al., 2011), and with models that combine multiple indicators (Ghahremanlou et al., 2014; Mahmud et al., 2012; Rodrigues et al., 2013).

For tweets that do not have attached latitude/longi-tude coordinates, SMART implements a placename geo-coding procedure using the GeoNames database (GeoNames, 2015). GeoNames is a community-driven geographical database that contains more than 10 mil-lion geographic names including alternative names. The backend system of SMART retrieves text from the user

proﬁles and tries to ﬁnd matches from the GeoNames database. In the cases when GeoNames gazetteers return multiple matching locations, the SMART geo-locating engine goes further and assigns the most popu-lated location as a default to the tweet.

Normalization of population is another important task in the design of the SMART backend system. SMART analyzes social media messages at the city level and it implements the block group centroid demo-graphic retrieval approach to normalize tweeting num-bers in each city based on their population size. Traditional city population size is based on the admin-istrative boundaries and census tracks, which do not match well with geo-search circles of the Twitter Search APIs method. We used GIS software to calcu-late the estimated population for each circle based on 2010 census tracts from the center of downtown to the radius defined by the Twitter Search APIs. Each of these city point buffers was joined with census tract centroids to determine which tracts should be included in our population calculations (Figure 3). Using the fine-grained census data allows us to gain a more accur-ate estimation of population, which greatly improves the ability to accurately normalize tweet counts for individual cities. Most of our city search methods use a 17 mile radius buffer to cover major metropolitan areas of U.S. cities without overlapping each other (Tsou et al., 2013). Some cities use a larger or smaller radius, such as Phoenix (40 mile) or Anaheim/Irvine (10 mile), in order to adjust to the size and closeness of surrounding cities. The normalization function is lim-ited to the U.S. in the current implementation of SMART, and adding census data of other countries can expand this function to more regions.

Data filtering and machine learning classification

(6)

tasks involving text. Twitter messages are transformed to numerical values using a term frequency-inverse document frequency model (Salton and Buckley, 1988). For example, to train the SVM for the inﬂuenza

topic, 1500 randomly sampled tweets from the 2012 to 2013 flu season containing the keyword ‘‘flu’’ were used to train the SVM classifier. Each of these tweets was manually inspected and classified as valid or inva-lid according to the likelihood that the message indi-cated an actual case of influenza, and this labeled data was used to train the SVM. The text classification model was validated using standard machine learning measures:recall,precision, and theF1 score. The test set was created by sampling 1000 influenza-related tweets at random. The validation procedure resulted in a pre-cision score of 0.671, a recall score of 0.949, and an F1 score of 0.786. The high recall score indicates that the model correctly classifies the majority of valid tweets in the test set; however, the lower precision score indicates

that invalid tweets in the test set are occasionally incor-rectly categorized as valid.

Statistics and analytics for topics

After data ﬁltering, SMART uses descriptive statistical methods to summarize the analytics such as the most popular retweets, the most popular URL, the most popular hashtags, and the most popular mentions. As the pseudo code (Figure 4) demonstrates, each new tweet is validated with the pretrained SVM text classi-ﬁcation model mentioned in the previous section. Valid tweets are then inspected and entities (i.e., hashtags, user mentions, media, or retweets) are extracted from the tweet text. Total counts of all unique entity values are maintained in order to derive a list of the top 10 most frequent values for each entity.

The backend of SMART utilizes Node.js and

(7)

to receive requests and send responses. When the SMART frontend client connects to the SMART server, a request is sent to RESTful API to ask for the latest analytics for a speciﬁc topic. In addition to top content, the SMART server also returns overall spatiotemporal trends upon frontend query with spe-ciﬁc time period of interests. Table 1 presents the stat-istics and analytics modules that are implemented in the SMART dashboard.

Geo-targeted regional selection and multilevel

spatial analysis

SMART can implement multiscale spatial analysis functions depending on the spatial resolution of interest regions and on the spatial granularity of social media content. For geo-tagged social media data that come with latitude/longitude coordinates, SMART can aggregate them at diﬀerent levels of map scale. For

social media data that is not geo-tagged in the ﬁrst place, SMART can geo-locate these messages using user proﬁles and then analyzes them at aggregated spa-tial resolutions including city level or state level. SMART currently provides several aggregation modes for its spatial analysis results. In a city-level trends map, analysis results are aggregated for each city and a circle on a map represents each city (Figure 5, right side). The graduated color in each circle shows the tweeting inten-sity in each city. In a region-level trends map, social media messages are aggregated by administrative boundaries (e.g., state boundaries, county boundaries) or bounding boxes (Figure 5, left side). Analysis results displaying on the trends map can also overlay with external data sources such as land use or census to pro-vide users with additional information to preliminarily explain the distribution of topics.

Interactive user interface design

Diﬀerent from traditional web map applications, the design of the SMART dashboard provides interactive display maps with multimedia content, such as pictures, videos, text messages, and charts (Figure 6). The web-based user interface is built with open-source program-ming libraries, jQuery and Leaﬂet APIs, to visualize and query tweets from a server-side database.

The following key features are included in the SMART dashboard to provide interactive query and visualization functions:

1. Top index numbers show the number of tweets col-lected from one day, one week, or one month. 2. The left panel provides the list of targeted cities and

the short cuts to diﬀerent functions (Word Cloud Figure 4. Pseudo code of the filter and classification procedures.

Table 1. Descriptions of statistics and analytics modules in SMART dashboard.

Module Description

Trend The total number of daily, weekly, and monthly tweets

Top URL The top 10 most referred Web pages Top retweet The top 10 most popular retweets

Top media The top 10 pictures or videos included in tweets Top keyword The most frequent words used in tweets Top hashtag The top 10 hashtags people used in tweets Top mention The top 10 Twitter users mentioned in tweets Tweets in

cities

(8)

(Kuo et al., 2007), Trend, Top URL, Top Media, Top Cities, etc.)

3. Trend Function allows interactive queries of actual tweeting texts by clicking on the point on the line chart. Users can switch the view to Daily, Weekly,

andMonthlymode.

4. Word Cloud function shows the most prominent conversation keywords in tweets within one day, past seven days, past 30 days, or combined all. 5. Tweets in Cities function shows the normalized

tweeting rates in each city using graduated carto-graphic symbols.

6. The dashboard also provides the top 10 list of top URLs (web pages), Hashtags (subjects), Retweets (forwarded messages), Mentions (opinion leaders), and Media (Pictures).

Figure 7 illustrates an example of web maps for the combined ‘‘flu’’ and ‘‘influenza’’ tweeting rates among top 31 U.S. cities. This web mapping tool was created by using open-source GIS mapping functions (Leaflet APIs) and basemap layers from ESRI ArcGIS Online. The size of red circles indicates the tweeting rates per one million people that were normalized by the method described earlier. Users can select each city interactively to identify the detailed information about the tweeting rate calculation. Analysis results are also shown in dif-ferent types of web maps depending on the spatial granularity of interests as mentioned earlier in the ‘‘Geo-targeted regional selection and multilevel spatial analysis’’ section (Figure 5).

The ultimate value of such dashboards is their ability to elucidate human dynamics in realspace. In order to evaluate the value of this dashboard, its preliminary utility in revealing a variety of social dynamics is illu-strated by three independent case study exemplars.

Three case studies of SMART

When users access the SMART dashboard, at present they can select one topic from the topic list (Figure 8) to analyze the trends and popular messages. This topic list is potentially scalable. Here we present three case stu-dies to illustrate the practical uses of the SMART system and the SMART dashboard in monitoring inﬂuenza out-break, public responses to the Ebola outout-break, and online discussions of marijuana legalization. These analytic func-tions of the SMART dashboard provide researchers with methods for formalizing and testing their research hypotheses and questions. However, these analytics results should not be used to give a direct answer or action for complicated research issues. Since the current analytic capability of SMART is still at a preliminary stage, the analysis results are simpliﬁed and generalized.

Case #1: Monitoring influenza outbreaks

in U.S. cities

(9)

(10)

The influenza monitoring case study is built upon our previous works (Aslam et al., 2014; Nagel et al., 2013) by using the two keywords ‘‘flu’’ and ‘‘influenza’’ to collect related tweets from 31 major cities in the United States. The dashboard can display multilevel temporal trend analysis at the daily, weekly, and monthly scales for the 2014/2015 flu season (Figure 9). The system also includes historical data as support-ing background information (Figure 9c). We used the SMART graph to compare to the official Centers for Disease Control (CDC)-curated manually collected flu records. The weekly SMART flu tweeting trend is

highly correlated to the CDC weekly FluView

Inﬂuenza Positive Tests Reports (Figure 10).

Case #2: Understanding public responses to the

Ebola outbreak in the U.S. and West Africa

Flu outbreaks represent a situation expected to stimu-late routine social media expressions. In contrast, Ebola represents a potential moral and existential panic that

could reveal both self-expressions of apprehensions about the disease. Although ﬂu kills thousands of lives per year, it is also a routine experience. In con-trast, Ebola is widely viewed as an exotic, novel, and univocally feared disease with an extremely high mor-tality rate. Media exposure of the few domestic Ebola cases, and the prolonged outbreak abroad, provided an episodic context in which to observe a moral panic in its dynamic social media presence. There are likely to be few counter-campaigns or memes competing for the attention of the public on such topics in this context.

(11)

population understands how to respond to an infection. Some diseases may require immediate treatment, whereas others might involve sheltering and isolation in place, or contact chaining (Brennan, 2014; SteelFisher et al., 2015). Although our current health system and the CDC have good tools for reporting, monitoring, and measuring the spread of disease infec-tion cases, we do not have effective tools to measure public perception, fear, and response during significant public health crises. Such panics can be counterpro-ductive, diverting needed resources and attention away from more effective interventions (Towers et al., 2015). Thus, it is within the interest of both local and federal public health officials to understand the nature of public perceptions of risks regarding Ebola, and per-haps mine the myths that are diffusing regarding public health response to such outbreaks and moral panics.

One potential use of SMART is to monitor public opinions and responses during or after disease out-breaks or disaster events. During the Ebola outbreak in 2014, we used SMART to track the public responses in West Africa and in the U.S. Twelve cities in West Africa, three U.S. cities with 20 mile buffer zones (New York, Dallas, and Cleveland), and five U.S. airports with 5 mile buffer zones (JFK, Newark Liberty, Washington Dulles, Atlanta Hartsfield Jackson, and

Chicago O’Hare) were monitored (Figure 11).

Figure 12 illustrates the most popular media (pictures) sharing in Twitter in these regions. The temporal trends

of Ebola-related tweets in U.S. cities are very diﬀerent from the 12 West Africa cities. The most popular pic-tures shared in Twitter in U.S. cities are very negative or sarcastic (such as ‘‘Enjoy Ebola’’). On the other hand, the most popular pictures and messages shared in West Africa cities are mainly disease prevention-oriented information or medical aid from international organizations. This knowledge can be used to gauge public opinions related to the topic and devise response strategies accordingly. This case study demonstrates that SMART can be used to listen to public opinions during crisis events or disease outbreaks.

Case #3: Online discussion about marijuana

legalization

(12)

Working with domain expertise from the public health department at San Diego State University, 12 keywords related to the marijuana legalization movement in the U.S. were selected (legalization, pot, legalizing, anti-pot, anti-marijuana, medicalmarijuana,

(13)

topic, especially around 20 April 2015 (‘‘420’’), which has become a countercultural observance day in North America on which celebrations and events related to cannabis are hosted (O’Sullivan, 2005). The SMART dashboard provides the opportunity to extract text con-tent and embedded media, enabling inspection and

interpretation of major activities of the Twitter user community on that speciﬁc day. In addition to the tem-poral trends, SMART also highlighted the most active Twitter users in this topic community. This information could be useful for researchers, political consultants,

community organizers, NGOs, and government

Figure 11. SMART dashboard for public responses to Ebola in selected regions.

(14)

agencies to explore the opinion leaders and inﬂuential conversations in the topic-related social network, which can lead to further strategic and policy planning. For example, SMART was able to identify the Twitter user

@THEMMEXCHANGE (The Medical Marijuana Exchange) in the conversation related to marijuana

legalization. The handler of the

@THEMMEXCHANGE account is a medical mari-juana services and advertising company that is active and well recognized in the marijuana legalization dis-cussions on Twitter.

Evaluation of user interface

and the system

Shneiderman (1997) suggests that a successful design of user interfaces would depend on three aspects: (a) Guidelines, documents, and processes: provide a clear set of principles for developers to follow and ensure a harmony in design across products; (b) User interface software tools: use for prototyping and the implementation of actual design; and (c) Expert reviews and usability testing: ensure a successful product for users. To evaluate the design and usability of SMART, we invited nine graduate students from San Diego State University as testing users with domain expertise including geography, computer science, public health, communication, and sociology to con-duct expert reviews on the user interface design and functionalities. With a brief introduction of SMART, each reviewer was given seven days to test SMART from 9 February to 15 February. The review and test-ing was focused on identifytest-ing usability issues and Figure 12. Temporal trends of public responses to the Ebola outbreaks in (a) West Africa and (b) U.S. cities.

(15)

pointing out the positive and negative user experiences that should be maintained or improved. Based on the aggregated expert review reports, the identiﬁed high priority usability issues included:

1. Loading of the page can be slower for topics with large amount of messages.

2. Users cannot directly add new keywords inside the SMART dashboard.

3. The updating frequencies on analysis results are dif-ferent for topics.

4. It would be nice to have a quick tutorial page before entering the system.

5. Users cannot customize the display of modules in the SMART dashboard.

In addition to providing suggestions on signiﬁcant usability issues, we also asked the reviewers to narrow down to three categories (eﬃcient navigation, module organization and consistency, and, matching of user expectations) of the system and user interface design. Based on the aggregated suggestions and comments, key items of each category are described as follows.

Efficient navigation

. Not very intuitive to trigger the popup windows for detailed tweets;

. Feedback submission form exceeding screen extent if using low viewing resolutions;

. Several modules contain an ‘‘option’’ button but not sure what it is for without clicking it.

Module organization and consistency

. There are many analysis modules in the same page, making it somehow visually cluttered;

. The time selection option is consistent across many analysis modules;

. The analysis modules are not aligned perfectly in the two columns;

. The title text for theTop URLmodule is not aligned with the time selection options.

Matching user expectations

. The Trend (timeline) module shows the changes of topic-related tweets in diﬀerent temporal resolutions;

. Clicking on any tag of theWord Cloudmodules lists all the tweets containing the tag, which matches the expectation;

. The order of the module items in the left panel does not match the order with the actual module windows.

The expert reviews provide several important sugges-tions on how we could improve the functionalities and user experiences in future developments. Overall, SMART obtains positive feedback on its design, which is recognized in the review summary: (a) The graphic user interface design is friendly and intuitive; (b) the system functions as an easy-to-use tool to cap-ture tweets; (c) variety of analysis modules supports exploration of diﬀerent topics.

Discussion and conclusion

The multilevel M3D model anticipates that digital mes-sages such as tweets serve a significant function as a transmitter of cultural change in society. As such, the model conceptualizes multiple levels of memetic influ-ence, each of which is reflected to varying degrees in SMART, from (a) recovering the content of memes in tweets; (b) to the communicator level identified by users; (c) to the social network level recoverable from follower links, retweets, and URL links; (d) to the soci-etal level of identifying the trends of other topics com-peting for attention in cyberspace; and finally (e) to the geotechnical level of urban and population differences. There are four major challenges in developing and implementing SMART based on our own experiences: (1) control and access of social media data, (2) concerns over user privacy, (3) spatial uncertainty, and (4) biased user populations. First, similar to many web observa-tory applications, SMART extracts user-generated con-tent through public APIs from a commercial company, Twitter. Researchers do not have any control or service guarantee of these public APIs administrated by private sectors. Situations such as the Twitter geo-search func-tion breakdown problems are inevitable and can cause significant loss of data for researchers. Adding multiple social media data sources can be a possible solution for this challenge. SMART is originally designed to con-sume social media messages from Twitter. It is possible to extend the system design and data framework to collect other social media sources (such as Flickr,

Instagram, Weibo, Foursquare, and YouTube).

Adding more social media data sources might reduce the risks of missing key data due to the API failure or access deny. However, the integration of multiple social media sources under a uniﬁed database framework will be another challenge of system implementation.

(16)

taken several steps to protect the privacy of social media users. For example, data used in the analysis and the analytics presented on the SMART dashboard is only restricted to the public information that users agreed to share with the social media services. In the analysis stages, geo-tagged messages are analyzed at aggregated spatial resolutions, such as city or regions, to minimize the exposure of precise locations of social media users. At the visualization stage, to further enhance the protection, our privacy policy also imple-ments a feedback procedure where users can request and have speciﬁc conversations removed from the SMART dashboard. Another possible privacy protec-tion method is to convert all users to an anonymous ID. However, the user anonymization process may prevent future analysis of social networks from these messages. Future work needs to strike a viable balance between the protection of user privacy and the usefulness of social media messages.

Third, the spatial uncertainty in social media data affects the spatial analysis results of SMART. Social media messages were collected with different levels of location precision. Location information in the mes-sages can vary from the most precise latitude/longitude coordinates points, to street addresses, to city names, to regions or place names that might not have clear boundaries at all. The procedure of aggregating geo-location information to different spatial granularities can introduce uncertainty to the final analysis results. Even though the geo-locating engine can geo-locate many incoming messages, the validation of actual user locations remains a difficult task in SMART.

Fourth, according to Pew Research Center

(Duggan, 2015), young adults (age 15–24) are the majority users of most social media services. This could introduce demographic bias for the analysis of social media messages. To adjust to this demographic bias problem, one potential direction is to use a var-iety of keywords as sampling tools to collect opinions from different age groups of a population. For exam-ple, the use of the keyword #healthcare may collect a majority of messages from senior citizens rather than teenage users. But more validation studies are needed to verify the association between different keywords and different user groups.

Capturing human activities in real time using social media is now at the forefront of behavioral measure-ment and understanding human dynamics (Alshamsi et al., 2015). In this regard, the development of SMART is intended to provide a broader impact for both theory testing as well as future social response and monitoring. By following how the general population responds to trending topics, the role of public messa-ging in aﬀecting population behaviors can be ascer-tained (Zubiaga et al., 2015). Such access will

ultimately enable enhanced modeling of information propagation that can account for population-level behavioral responses and allow policymakers to tailor their communications to the specific needs of a given event or topic. This is especially relevant in public health cases such as influenza or disease outbreaks, where population responses can have a direct impact on the course of the outbreak, such as increased rates of vaccination, social isolation, and hand-washing. Further, knowledge of geographic disease diffusion and the communicative contents that correspond to such diffusion can be vital to designing and guiding public health campaigns.

To summarize, SMART can provide real-time spa-tiotemporal analytic functions for monitoring trending topics and online discussions by adding geo-targeting capability for cities and regions. Different cities and regions may reveal different patterns of conversations and trends. Linking space, time, and the content of social media messages, the SMART system is designed to discover meaningful patterns and insights of events such as disease outbreaks or social activities. The three case studies presented demonstrate the value of SMART in tracking disease outbreaks, monitoring public discussion, and measuring public perception during significant public health crises. With the capabil-ity of adding new topics and keywords to the monitor-ing system in the future, SMART is not limited to the public health domain and can be extended to other domains such as disaster responses, social movement studies, marketing, and political campaign.

Declaration of conflicting interests

The author(s) declared no potential conﬂicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following ﬁnancial sup-port for the research, authorship, and/or publication of this article:This material is based upon work supported by the

National Science Foundation under Grant No. 1416509, pro-ject titled ‘‘Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks.’’ Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation. The authors thank other HDMA team members’ contributions to the development of SMART.

Notes

1. http://topsy.com/. Topsy was acquired by Apple Inc. in 2013, and the system was shut down in December 2015. 2. https://www.google.org/flutrends/about/. GFT had

(17)

3. https://nodejs.org/ 4. https://www.mongodb.org/ 5. http://getbootstrap.com/ 6. http://dygraphs.com/ 7. http://d3js.org/ 8. http://leafletjs.com/

9. Unfortunately, an open issue of the geo-search returning sparse results was identified on 20 November 2014 and has not been resolved as of September 2015 (https://twitt ercommunity.com/t/search-api-returning-very-sparse-geo code-results/27998/90). In order to adjust to the situation, from 20 November 2014, SMART currently queries only with topics keywords and not limiting to any geographic areas. A new geo-locating procedure was developed and implemented in SMART for geo-locating the incoming tweets since December 2014.

10. http://expressjs.com/

References

Alshamsi A, Pianesi F, Lepri B, et al. (2015) Beyond conta-gion: Reality mining reveals complex patterns of social influence.PLoS One10(8): e0135740.

Andrienko G, Andrienko N, Bosch H, et al. (2013) Thematic patterns in georeferenced tweets through space-time visual analytics. Computing in Science and Engineering 15(3): 72–82.

Aslam AA, Tsou M-H, Spitzberg BH, et al. (2014) The reli-ability of tweets as a supplementary method of seasonal influenza surveillance. Journal of Medical Internet Research16(11): e250.

Bahk CY, Scales DA, Mekaru SR, et al. (2015) Comparing timeliness, content, and disease severity of formal and informal source outbreak reporting. BMC Infectious Diseases15: 135.

Barboza P, Vaillant L, Le Strat Y, et al. (2014) Factors influ-encing performance of internet-based biosurveillance sys-tems used in epidemic intelligence for early detection of infectious diseases outbreaks.PLoS One9(3): e90536. Brennan R (2014) Reflections on the Ebola outbreak:

Perspectives from the frontline. Journal of Communication in Healthcare7(4): 248–249.

Brownstein JS, Freifeld CC, Reis BY, et al. (2008) Surveillance sans frontie`res: Internet-based emerging infectious disease intelligence and the HealthMap project.

PLoS Med5: 1–6.

Chae J, Thom D, Bosch H, et al. (2012) Spatiotemporal social media analytics for abnormal event detection and examin-ation using seasonal-trend decomposition. In:IEEE con-ference on visual analytics science and technology 2012,

VAST 2012—proceedings, New York, NY: IEEE

Computer Society Press, pp.143–152.

Cheung CMK and Thadani DR (2012) The impact of elec-tronic word-of-mouth communication: A literature ana-lysis and integrative model. Decision Support Systems 54(1): 461–470.

Culotta A (2013) Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages. Language Resources and Evaluation47(1): 217–238.

Diakopoulos N, Naaman M and Kivran-Swaine F (2010) Diamonds in the rough: Social media visual analytics for journalistic inquiry. In: IEEE Conference on Visual Analytics Science and Technology 2010, VAST 2010–

Proceedings, New York, NY: IEEE Computer Society

Press, pp.115–122.

Duggan M (2015)Mobile Messaging and Social Media – 2015. Pew Research Center. Available at: http://www.pewinter net.org/2015/08/19/mobile-messaging-and-social-media-2015/.

Gajewski KN, Peterson AE, Chitale RA, et al. (2014) A review of evaluations of electronic event-based biosurveil-lance systems.PLoS One9(10): e111222.

GeoNames (2015) GeoNames (online). Available at: http:// geonames.org/ (accessed 27 September 2015).

Ghahremanlou L, Sherchan W and Thom JA (2014) Geotagging twitter messages in crisis management. The Computer Journal58(9): 1937–1954.

Ginsberg J, Mohebbi MH, Patel RS, et al. (2009) Detecting influenza epidemics using search engine query data. Nature457(7232): 1012–1014.

Graham M, Hale SA and Gaffney D (2014) Where in the world are you? Geolocation and language identification in Twitter.The Professional Geographer66(4): 568–578. Hanson CL, Burton SH, Giraud-Carrier C, et al. (2013)

Tweaking and tweeting: Exploring Twitter for nonmedical use of a psychostimulant drug (Adderall) among college students.Journal of Medical Internet Research15(4): e62. Hawelka B, Sitko I, Beinat E, et al. (2014) Geo-located Twitter as proxy for global mobility patterns.

Cartography and Geographic Information Science 41(3):

260–271.

Hecht B, Hong L, Suh B, et al. (2011) Tweets from Justin Bieber’s heart: The dynamics of the location field in user profiles. In: Proceedings of the SIGCHI conference on

human factors in computing systems. New York, NY:

ACM, pp.237–246.

Helsloot I and Groenendaal J (2013) Twitter: An underuti-lized potential during sudden crises? Journal of Contingencies and Crisis Management21(3): 178–183. Huston JE, Mekaru SR, Kluberg S, et al. (2015) Searching

the Web for influenza vaccines: HealthMap Vaccine Finder. American Journal of Public Health 105(8): e134–e139.

Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features.

Machine Learning: ECML-981398: 137–142.

Kamel Boulos MN, Resch B, Crowley DN, et al. (2011) Crowdsourcing, citizen sensing and sensor web technolo-gies for public and environmental health surveillance and crisis management: Trends, OGC standards and applica-tion examples.International Journal of Health Geographics 10(1): 67.

(18)

Kuo BY-L, Hentrich T, Good BM, et al. (2007) Tag clouds for summarizing web search results. In:Proceedings of the 16th international conference on World Wide Web—WWW ‘07. New York, NY: ACM, pp.1203–1204.

Li W, Serdyukov P, de Vries AP, et al. (2011) The where in the tweet. In:Proceedings of the 20th ACM international conference on Information and knowledge management

—CIKM ‘11. New York, NY: ACM, p.2473.

MacEachren AM, Robinson AC, Jaiswal A, et al. (2011) SensePlace2: GeoTwitter analytics support for situational awareness. In: IEEE Conference on Visual Analytics

Science and Technology 2011, VAST 2011–Proceedings.

New York, NY: IEEE Computer Society Press, pp.181–190.

McKelvey K and Menczer F (2013) Design and prototyping of a social media observatory. In:Proceedings of the 22nd international conference on World Wide Web. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee. New York: ACM, pp.1351–1358.

McKenzie G, Janowicz K, Gao S, et al. (2015) POI pulse: A multi-granular, semantic signature–based information observatory for the interactive visualization of big geoso-cial data. Cartographica: The International Journal for Geographic Information and Geovisualization50(2): 71–85. Mahmud J, Nichols J and Drews C (2014) Home location identification of Twitter users. ACM Trans Intell Syst Technol5(3): 47:1–47:21.

Miao J, Feng J, Zhou L, et al. (2014) The construction and application research of crisis early warning mechanism of reputation of colleges and universities in the new media environment. Journal of Chemical and Pharmaceutical Research6(6): 202.

Nagar R, Yuan Q, Freifeld CC, et al. (2014) A case study of the New York City 2012/2013 influenza season with daily geocoded Twitter data from temporal and spatiotemporal perspectives.Journal of Medical Internet Research16(10): e236.

Nagel AC, Tsou M-H, Spitzberg BH, et al. (2013) The complex relationship of realspace events and messages in cyberspace: Case study of influenza and pertussis using tweets. Journal of Medical Internet Research 15(10): e237.

Olsen J (2013) Infodemiology to improve public health situ-ational awareness: An investigation of 2010 pertussis out-breaks in California, Michigan and Ohio.PhD dissertation. The University of North Carolina at Chapel Hill, ProQuest, UMI Dissertations Publishing, 2013, p.3562785.

O’Sullivan J (2005) It’s 420: Do you know where your chil-dren are? Adolescent substance use and misuse. Clinical Excellence for Nurse Practitioners9(3): 127–129.

Rodrigues E, Assuncao R, Pappa GL, et al. (2013) Uncovering the location of Twitter users. In: 2013

Brazilian Conference on Intelligent Systems (BRACIS).

Washington, DC: IEEE Computer Society.

Sakaki T, Okazaki M and Matsuo Y (2010) Earthquake shakes Twitter users: Real-time event detection by social sensors. In: WWW’10: Proceedings of the 19th

international conference on World wide web. New York, NY: ACM, p.851.

Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and

Management24(5): 513–523.

Sasahara K, Hirata Y, Toyoda M, et al. (2013) Quantifying collective attention from tweet stream. PLoS ONE 8(4): 1–10.

Schwind JS, Wolking DJ, Brownstein JS, et al. (2014) Evaluation of local media surveillance for improved dis-ease recognition and monitoring in global hotspot regions.

PLoS ONE9(10): e110236.

Segev E, Nissenbaum A, Stolero N, et al. (2015) Families and networks of internet memes: The relationship between cohesiveness, uniqueness, and quiddity concrete-ness. Journal of Computer-Mediated Communication 20: n/a–n/a.

Seltzer EK, Jean NS, Kramer-Golinkoff E, et al. (2015) The content of social media’s shared images about Ebola: A retrospective study.Public Health129(9): 1273–1277. Shneiderman B (1997)Designing the User Interface: Strategies

for Effective Human-Computer Interaction, 3rd ed. Boston, MA: Addison-Wesley Longman Publishing Co., Inc. Slavkovikj V, Verstockt S, Van Hoecke S, et al. (2014)

Review of wildfire detection using social media. Fire Safety Journal68: 109–118.

Snoeijers EM, Poels K and Nicolay C (2014) #universitycrisis: The impact of social media type, source, and information on student responses toward a university crisis. Social Science Computer Review32(5): 647–661.

Spitzberg BH (2014) Toward a model of meme diffusion (M 3

D).Communication Theory24(3): 311–339.

SteelFisher GK, Blendon RJ and Lasala-Blanco N (2015) Ebola in the United States—Public reactions and implica-tions.New England Journal of Medicine373(9): 789–791. Sutton J, Spiro ES, Johnson B, et al. (2013) Warning tweets:

Serial transmission of messages during the warning phase of a disaster event. Information, Communication and Society17(6): 765–787.

Takhteyev Y, Gruzd A and Wellman B (2012) Geography of Twitter networks.Social Networks34(1): 73–81.

Thom D, Bosch H, Koch S, et al. (2012) Spatiotemporal anomaly detection through visual analysis of geolocated Twitter messages. In: IEEE Pacific Visualization

Symposium 2012, PacificVis 2012—Proceedings.

Washington, DC: IEEE Computer Society, pp.41–48. Towers S, Afzal S, Bernal G, et al. (2015) Mass media and the

contagion of fear: The case of Ebola in America. PLoS ONE10(6): e0129179.

Tsou M-H (2015) Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science42(supp1): 70–74. Tsou M-H, Kim I-H, Wandersee S, et al. (2014) Mapping

ideas from cyberspace to realspace: Visualizing the spatial context of keywords from web page search results. International Journal of Digital Earth7(4): 316–335. Tsou M-H and Leitner M (2013) Visualization of social

(19)

Tsou M-H, Yang J-A, Lusher D, et al. (2013) Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): A case study in 2012 US Presidential Election. Cartography and Geographic Information Science40(4): 337–348.

van der Veen JS, van der Waaij B and Meijer RJ (2012) Sensor data storage performance: SQL or NoSQL, phys-ical or virtual. In:2012 IEEE fifth international conference on cloud computing, pp.431–438.

Wang C and Zhang P (2012) The evolution of social com-merce: The people, management, technology, and

infor-mation dimensions. Communications of the Association for Information Systems31. Article 5.

Young SD and Shoptaw S (2013) Stimulant use among African American and Latino MSM social networking users.Journal of Addictive Diseases32(1): 39–45.

Zubiaga A, Spina D, Martı´nez R, et al. (2015) Real-time clas-sification of Twitter trends.Journal of the Association for Information Science and Technology66(3): 462–473.