Influence in social networks

Based on the action of a user following another user in the network, we can construct a directed graph. We focus mainly on the phenomenon of influence, which is a field of very active research in the last 10 years.

Introduction

Information diffusion models

Independent Cascade model

SetXuI can be understood as the influence set of nodes in the context of random activation of edgesI. We consider the influence set of the seed group as the union of the influence groups of all nodes included in S.

Linear Threshold model

The meaning of the above expression is that the total influence is the sum over all subsets of the product of the probability of getting a certain quantity with the number of activated nodes. It can be shown that for any instance of the Triggering Model the influence function is submodular, and also the reduction of the Linear Threshold Model in it.

Number of simulations

Hill-climbing algorithm

Efficiency of the Hill-climbing algorithm

In this section we will show that the hill-climbing algorithm ensures that the returning seed cluster will give us influence greater than or equal to (1−1/e) of the influence of the optimal cluster. It is important to note that the (1−1/e) term arises in the limit that the seed set size goes to infinity.

Lazy Evaluation

In this section, we describe algorithms that can be used to learn to influence probabilities among social connections. These algorithms rely on knowledge of the network at a given time and the actions taken by each user.

Problem description and general approach

Formal problem definition

We assume that all users in theaction_log are present in the social graph and that each user performs an action only once. A propagation graph is a directed graph, with edges connecting users in the propagation direction as indicated by time constraints.

Mathematical background

As we can see in the previous definition, there must be a social bond between two users performing an action before one of them performed the action for it to be considered an extended action. So, as you can see, the effect of adding a new node to the active set is always greater than or equal to zero.

Models

Static models

In order to distinguish between users who are influenced and users who are driven by external factors, we will define an influence score. This score will be the ratio of actions for which we have evidence that the user's action was influenced by their local network, compared to the total number of actions taken by the user. We should note here that the previous definition, as well as the following definitions, avoids including the fact that not all user actions should be included in the calculations.

Formally, you should only include actions taken by the user after creating a social connectionv →u. It is defined as the fraction of the size of the intersection over the size of the union of the sample sets. Partial credits (PC). When a user in the network executes and takes action, it may mean that he has been influenced by several of his friends/neighbors.

With partial credits, we could use Bernoulli or Jaccard as a base model.

Continuous Time Models (CT)

So it would make sense to attribute this event evenly to the previously triggered neighbors. So if the set of previously activated neighbors is of size|S|=d, we must assign equal credit1/d to all those neighbors. Because this function depends on time and activated neighbors, we cannot calculate its value incrementally when a new neighbor is activated, as the influence probabilities of each previously activated neighbor will be different.

Since each influence probability is maximum when the neighbor first performs the action, we expect the joint influence probability function to have as many local maxima as the number of active neighbors.

Discrete Time Models (DT)

Learning algorithms

The essential difference in the second run of the algorithm is the fact that the requirement to consider that an action propagated from user to the user noise is not only based on the fact that tu > tv, but we also demand that the delay smaller is as the average delay for the connection (v, u). In the previous algorithms, Etv indicates the edges (outsides if we have a directed graph) used at tydtv.

Evaluation algorithms

If nodev is not present in theresults_table, it means that it performed the action without being influenced by any of its neighbors1, so we consider it the initiator of the action for its neighborhood and set ormv = 2. Also some neighbors ofv may already be present in there gradually update their probability to perform the action. When all action tuples are read, the result_table must include all users who performed the action and all users who are neighbors of at least one user who performed the action.

First, we ignore the cases where none of the user's friends are active. We consider as T P the cases where a user performs an action and at least one of his neighbors performs the action before him and the model predicts that the user performs the action and so on for T N,F P,F N. 1These are possible contradictions when we run the algorithm on github data, we have a partial knowledge of the graph.

A subtle thing to keep in mind in our discussion of directed social graphs is the directionality of edges.

Digg

Statistics regarding learning probability

One of the features we learn in phase 1 is the number of propagated actions for each edge of the graph. If the connection between those two users was present before either user performed the action, and if userv performed the action before useru (we remind the reader that the flow of information is from the user being followed to the follower), then we assume that the action propagated from user v to useru. In phase 1, we keep track of the time it takes for an action to propagate from one user to another.

This way we can calculate the average time it takes for an action to propagate through each edge. The last part learned by stage 1 of the learning algorithm is the partial credits for each lead. Based on the data learned in phase 1 of the digg dataset algorithm, we run the evaluator algorithm that produces the confusion matrix.

The following plot shows the roc curves of the static Bernoulli model for different values of influence.

Utilize the different type of actions

A note we should make here is the fact that roc curves may not be ideal when evaluating unbalanced data sets. The reason for this is that for each user who performs an action, we scan all his/her neighbors to see if they performed the action themselves. In the following plot, we display the precision-recall curve for the same evaluation.

We can see that there is a clear improvement when distinguishing between influence probabilities for the different types of actions. Another conclusion we can draw from this plot is that there is quite a lot of room for improvement, something that is not entirely clear from the roc curves.

Github archive

In the following figure, we can see the number of votes divided by the number of depositories. As we can see in the following histogram, the github social graph exhibits a very clear power-law degree distribution found in social networks and other "scale-free" [11] networks, as these are commonly referred to throughout the literature. In the following histogram depicting the prevalent actions found in the first phase of the learning algorithm, we can see that we have a somewhat worse action log compared to the digg network.

Although we have a rather large action log, it is much sparser when we look at the amount of widespread actions. The reason for this could either be something we overlooked in the pre-processing of our data set, or simply the fact that the technical and scientific nature of this network's users makes them less prone to peer influence. As you would expect from the widespread action diagram, the influence possibilities in the github network are somewhat diminished.

In the graphs below we can see the roc curves produced by the estimation algorithm for different impact values.

Flickr Api

In order to run the influence maximization algorithm, we need to somehow obtain the social graph and the influence probabilities for each of the edges. This probability is proportional to the degree of the node, whose following equation represents the degree of the node. Based on the learning process described in Chapter 4, we were able to calculate influence probabilities for a large part of the edges of the social graph.

In order to run the influence maximization algorithm in a reasonable amount of time, we will need to further reduce the size of the graph as we saw through the different runs on the artificial graphs in the previous section. Under both these models, the influence function remains submodular which ensures the approximation efficiency of the greedy algorithm. We proceed with the implementation of the algorithms proposed in the literature that we use to calculate the influence probabilities and evaluate the accuracy of the various proposed models.

Tardos maximizes the spread of influence through a social network Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.