Structure and Dynamics of Information Pathways in Online Media

找出持久的话题和底层网络


 

ABSTRACT

Diffusion of information, spread of rumors and infectious diseases are all instances of stochastic processes that occur over the edges of an underlying network.译文:信息的扩散、谣言的传播和传染病都是发生在底层网络边缘的随机过程的实例。

Many times networks over which contagions spread are unobserved, and such networks are often dynamic and change over time.译文:很多时候,传染病传播的网络是没有被观察到的,这样的网络经常是动态的,并且会随着时间而变化。 In this paper, we investigate the problem of inferring dynamic networks based on information diffusion data. 译文:本文研究了基于信息扩散数据的动态网络推理问题。We assume there is an unobserved dynamic network that changes over time, while we observe the results of a dynamic process spreading over the edges of the network. The task then is to infer the edges and the dynamics of the underlying network.译文:我们假设存在一个未观察到的动态网络,它会随着时间而变化,而我们观察的是一个动态过程在网络的边缘扩散的结果。接下来的任务是推断底层网络的边缘和动态。

 

从可见的动态过程(的变元扩散结果)挖掘信息传播的(位于底层的)动态网络,并推断底层网络的变元和动态。

We develop an on-line algorithm that relies on stochastic convex optimization to efficiently solve the dynamic network inference problem.译文:提出了一种基于随机凸优化的动态网络推理算法

We apply our algorithm to information diffusion among 3.3 million mainstream media and blog sites and experiment with more than 179 million different pieces of information spreading over the network in a one year period. We study the evolution of information pathways in the online media space and find interesting insights. Information pathways for general recurrent topics are more stable across time than for on-going news events.译文:我们研究了网络媒体空间中信息路径的演变,并发现了有趣的见解。一般周期性话题的信息路径在时间上比持续的新闻事件更稳定

Clusters of news media sites and blogs often emerge and vanish in matter of days for on-going news events. 译文:新闻媒体网站和博客群经常会在几天内出现或消失,以关注持续的新闻事件。Major social movements and events involving civil population, such as the Libyan’s civil war or Syria’s uprise, lead to an increased amount of information pathways among blogs as well as in the overall increase in the network centrality of blogs and social media sites.译文:包括利比亚内战或叙利亚起义在内的重大社会运动和事件,导致博客之间的信息量增加,博客和社交媒体网站的网络中心地位整体上升。

Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications—Data mining

General Terms: Algorithms; Experimentation.

Keywords: Networks of diffusion, Information cascades, Blogs, News media, Meme-tracking, Social networks.

1. INTRODUCTION

  Networks represent a fundamental medium for spreading and diffusion of various types of behavior, information, rumors and diseases [27]. A contagion appears at some node of a network and then spreads like an epidemic from node to node over the edges of the underlying network. For example, in case of information diffusion, the contagion represents a piece of information [16, 18] and infection events correspond to times when nodes mention or copy the information from one of their neighbors in the network. Similarly, we can think about the spread of a new type of behavior or an action, e.g., purchasing a new cellphone [15], or the propagation of a contagious disease over the edges of the underlying social network [6].
   
  In the context of network diffusion, we often observe the temporal traces of diffusion while the pathways over which contagion spreads remain hidden. In other words, we observe the times when each node gets infected by the contagion, but the edges of the network that gave rise to the diffusion remain unobservable. For example, we can often measure and observe the time when people decide to adopt a new behavior while we do not explicitly observe which neighbor in the social network influenced them to do so. In case of information diffusion, we often observe people (or media sites) talking about a new piece of information without explicitly observing the path it took in the information diffusion network to reach the particular node of interest. And, epidemiologists often observe when a person gets sick but usually cannot tell who infected her. In all these examples, one can observe the infection events themselves while not knowing over which edges of the network the contagions spread. Therefore, one of the fundamental research problems in the context of network diffusion is inferring the structure of networks over which various types of contagions spread [10]. Moreover, many times networks over which contagions diffuse are not static but change over time. Depending on the type of contagion, the time of the day, or death of the existing and birth of new nodes, the underlying network may dynamically change and shift over time.
   
 

In recent years, several network inference algorithms have been developed [9, 10, 12, 20, 24, 30]. Some approaches infer only the network structure [10, 30], while others infer not only the network structure but also the strength or the average latency of every edge in the network [9, 20].译文:近年来,一些网络推理算法被开发出来[9,10,12,20,24,30]。有些方法仅推断网络结构[10,30],而另一些方法不仅推断网络结构,还可以推断网络中每条边的强度或平均延迟[9,20]。

However, to the best of our knowledge, previous work has always assumed networks to be static and contagion pathways to be constant over time.译文:然而,就我们所知,以前的工作总是假设网络是静态的,传染途径随着时间的推移是恒定的

However, in most cases, networks are dynamic, and contagion pathways change over time, depending upon the contagions that propagate through them [22, 28].译文:然而,在大多数情况下,网络是动态的,传播途径随着时间而改变,这取决于通过网络传播的传染性[22,28]。

For example, a blog can increase its popularity abruptly after one of its posts turns viral, this may create new edges in the information transmission network and so the content the blog produces in the future will likely spread to larger parts of the network.译文:例如,一个博客可以在它的一篇文章变成病毒式传播后突然增加它的受欢迎程度,这可能会在信息传播网络中创造新的边缘,所以博客在未来产生的内容可能会传播到网络的更大的部分.

Similarly, at any given time a particular unexpected event may occur and a topic or piece of news may become very popular for a limited period of time. 译文:同样,在任何给定的时间都可能发生一个特殊的意外事件,一个话题或一条新闻可能在一段时间内变得非常流行。

This again will lead to different emerging and vanishing information pathways, and thus to a time-varying underlying network.译文:这又将导致不同的出现和消失的信息路径,从而形成一个时变的底层网络。

In order to better understand these temporal changes, one needs to reconstruct the time-varying structure and underlying temporal dynamics of these networks and then study the information pathways of real-world events, topics or content.译文:为了更好地理解这些时间变化,需要重构这些网络的时变结构和潜在的时间动态,然后研究真实世界事件、主题或内容的信息路径.

   
   
 

Our approach to time-varying network inference.时变网络推理方法

In this paper we investigate the problem of inferring dynamic networks based on information diffusion data. 译文:本文研究了基于信息扩散数据的动态网络推理问题

We assume there is an unobserved dynamic network that changes over time, while we observe the node infection times of many different contagions spreading over the edges of the network.译文:我们假设存在一个未观察到的动态网络,它会随着时间而变化,同时我们观察许多不同传染的节点感染时间在网络的边缘扩散。

The task then is to infer the edges and the dynamics of the underlying network.译文:接下来的任务是推断底层网络的边缘和动态。

We develop an efficient on-line dynamic network inference algorithm, INFOPATH, that allows us to infer daily networks of information diffusion between online media sites over a one year period using more than 179 million different contagions diffusing over the underlying media network.

   
 

We model diffusion processes as discrete networks of fully continuous temporal processes occurring at different rates building on our previous work [9, 11]. Our model allows information to propagate at different rates across different edges by adopting a datadriven approach, where only the recorded temporal diffusion events are used.

译文:我们的模型通过采用datadriven方法,允许信息以不同的速率在不同的边缘传播,其中只使用记录的时间扩散事件。

The model considers the information which propagates through the network due only to diffusion, while ignoring any external sources [22]. However, our original diffusion model considered only static networks [9].译文:该模型只考虑了由于扩散而通过网络传播的信息,而忽略了任何外部源[22]。然而,我们原来的扩散模型只考虑静态网络[9]。

Here, we generalize the model and develop a new inference method to support dynamic networks. Our time-varying network inference algorithm, INFOPATH, uses stochastic gradient [26] to provide estimates of the time-varying structure and temporal dynamics of the inferred network.译文:在此,我们对模型进行了推广,并开发了一种新的支持动态网络的推理方法。我们的时变网络推理算法INFOPATH,使用随机梯度[26]提供估计的时变结构和时间动态推断网络

 

The framework enables us to study the temporal evolution of information pathways in the online media space.

译文:该框架使我们能够研究网络媒体空间中信息路径的时间演化。

   
 

We apply the INFOPATH algorithm to synthetic as well as real Web information propagation data. We study 179 million different information cascades spreading among 3.3 million blog and news media sites over a one year period, from March 2011 till February 2012.1 Results on synthetic data show INFOPATH is able to track changes in the topology of dynamic networks and provides accurate on-line estimates of the time-varying transmission rates of the edges of the network.译文:我们研究1.79亿种不同信息之间的级联传播330万博客和新闻媒体网站在一年时间内,从2011年3月到2012.1年2月结果合成数据显示电子表单能够跟踪动态网络拓扑结构的变化,并提供准确的在线估计时变的边缘网络的传输速率。

INFOPATH is also robust across network topologies, and temporal trends of edge transmission dynamics.译文:INFOPATH在网络拓扑和动态边缘传输的时间趋势方面也很健壮

   
  Experiments on large-scale real news and social media data lead to interesting insights and findings. For example, we find that the information pathways over which general recurrent topics propagate remain more stable over time, while unexpected events lead to dramatically changing information pathways. Clusters of mainstream news and blogs often emerge and vanish in a matter of days, and our on-line algorithm is able to uncover such structures. News events that involve large-scale social movements, as the Libyan civil war, Egypt’s revolution or Syria’s uprise, result in a greater increase in information transfer among blogs than among main stream media. Perhaps surprisingly, the amount of mainstream media and blogs among the most influential nodes for most topics or news events are comparable. However, we find that growing numbers of influential blogs on some topics or news events are often temporally correlated with large-scale social movements (e.g., the Occupy Wall Street movement in Sept-Nov 2011).
   
  Further related work. Previous methods for inferring diffusion networks [9, 10, 12, 20] also use a generative probabilistic model for modeling cascading processes over networks. NETINF [10] and MULTITREE [12] infer the network connectivity using submodular optimization. NETRATE [9] and CONNIE [20] infer not only the network connectivity but also transmission rates of infection or prior probabilities of infection using convex optimization. Moreover, there have been also attempts to model information diffusion without assuming the existence of an underlying network [33, 32].
   
 

However, to the best of our knowledge, all previous approaches to network inference assume the network and the underlying dynamics of the edges to be constant, i.e., the network structure and the transmission rates of each edge do not change over time.译文:然而,就我们所知,所有以前的网络推理方法都假设网络和边缘的潜在动态是恒定的,即,网络结构和每条边的传输速率不会随时间而改变

 

Therefore, they consider the pathways over which information propagates to be time-invariant. 译文:因此,他们认为信息传播的途径是定常的。

The main contribution of this paper is to combine stochastic gradient and the diffusion model introduced in [9] to develop an efficient on-line network inference algorithm that provides time-varying estimates of the edges of a network and the transmission rates of each edge.译文:本文的主要贡献是将随机梯度和[9]中引入的扩散模型结合在一起,开发了一个有效的在线网络推理算法,该算法提供了对网络边缘和每个边缘的传输速率的时变估计

This allows us to detect how information pathways emerge and vanish over time, and identify when nodes produce highly viral content.译文:这使我们能够检测信息通路是如何随着时间的推移而出现和消失的,并识别节点何时产生高度病毒化的内容。

   
  The remainder of the paper is organized as follows: in Sec. 2, we revisit the model of diffusion and state the dynamic network inference problem. 译文:我们重新讨论了扩散模型和动态网络推理问题。Section 3 describes the proposed time-varying network inference method, called INFOPATH.译文:描述了所提出的时变网络推理方法INFOPATH。 Section 4 evaluates INFOPATH quantitatively and qualitatively using synthetic and real diffusion data. We conclude with a discussion of results in Section 5.
   

 

2. PROBLEM FORMULATION

  In this section, we build on our fully continuous time model of diffusion [9, 11]. We start by briefly describing the generative model for the observed data. We then revisit how to compute the likelihood of a cascade using the model and state the continuous time network inference problem for both static and dynamic networks. Across the section, we explicitly point out which assumptions of the original model need to be extended in order to support dynamic networks.
  译文:在本节中,我们建立了扩散的全连续时间模型[9,11]。我们首先简要描述观测数据的生成模型。然后,我们回顾如何计算级联的可能性使用模型和状态连续时间网络推理问题的静态和动态网络。在整个章节中,我们明确地指出,为了支持动态网络,原始模型的假设需要被扩展。
   
   Given a set of node infection times of many different contagions, our goal is to infer the underlying dynamic network over which contagions propagated. We apply the Maximum Likelihood principle in order to infer the network that most likely generated the observed data. We proceed by assuming a static network and describe the generative model of information diffusion. We then generalize the model to dynamic networks.
   译文:给定一组不同传染的节点感染时间,我们的目标是推断传染传播的潜在动态网络。我们应用最大似然原理来推断最有可能产生观测数据的网络。通过假设一个静态网络,描述了信息扩散的生成模型。然后,我们将该模型推广到动态网络。
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

 

 

 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

 

4. EXPERIMENTAL EVALUATION

  We evaluate the performance of INFOPATH on time-varying synthetic networks that mimic the structure of real networks as well as on a dataset of more than 179 million information cascades extracted from 300 million blogs and news articles from 3.3 million media sites over a period of one year, from March 2011 till February 2012. All the data, code and additional results are available at the supporting website [1].
   
  4.1 Experiments on synthetic data
  The goal of the experiments with synthetic data is to understand how temporal changes in a network affect the performance of our algorithm. We aim to detect not only when an edge appears (i.e., its transmission rate becomes > 0) or disappears (i.e., its transmission rate becomes 0) but also provide instantaneous transmission rate estimates that track the true edge transmission rates over time.
   
  Experimental setup. First, we generate synthetic networks using Kronecker graph models of directed real-world networks [17]. For all our experiments, we consider two different Kronecker networks, both with 1,024 nodes and 2,048 edges: A core-periphery Kronecker network with parameter matrix [0.9,0.5; 0.5,0.3]) and a hierarchical Kronecker network with parameters [0.9,0.1; 0.1,0.9].
   
 

The next step is to make each edge to follow a particular edge transmission rate evolution pattern. Our goal later will be to recover the network as well as the evolution of the transmission rate of each individual edge.

译文:下一步是使每条边遵循特定的边传输速率演化模式。我们稍后的目标将是恢复网络以及每个单独边缘的传输速率的演变。

 
 

Figure 1: True and inferred edge transmission rates for edges with different 4 transmission rate evolution patterns: (a) Slab, (b) Square, (c) Chainsaw, (d) Hump. Results are for the Kronecker core-periphery with exponential edge transmission model for 200 time units with 1,000 cascades per time unit. Our INFOPATH method is able to track the evolving edge transmission rates over time. INFOPATH works better for continuously evolving edge transmission rates (c, d).

译文:我们的INFOPATH方法能够跟踪随着时间变化的边缘传输速率。INFOPATH更适合不断变化的边缘传输速率(c, d)。

 

We consider five edge evolution patterns: Slab, Square, Chainsaw, Hump and constant (see Figure 1). Slab and Hump patterns model outgoing connections of sites that become popular for a short period of time. Square and Chainsaw patterns model incoming connections to sites that perform updates periodically at specific times of the day or days of the week. Constant pattern represents connections between sites that interact at any time and during a long period of time, usually large media sites. We consider Chainsaw, Hump and Continuous to be examples of Type I pattern, without discontinuities, and Slab and Square to be examples of Type II patterm, with discontinuities.

译文:我们考虑了五种边缘演化模式:平板、方形、链锯、驼峰和常数(见图1)。平板和驼峰模式模拟了短时间内流行的站点的出线连接。方形和链锯模式为进入到站点的连接建模,这些站点在每天或每周的特定时间定期执行更新。恒定模式表示在任何时间和很长一段时间内相互作用的网站之间的联系,通常是大型媒体网站。我们认为链锯、驼峰和连续是I型图形的例子,没有不连续。而板型和方型则是具有不连续的第II型图案的例子。

   
 

We assign to each edge in the network an evolution pattern chosen uniformly at random from the set of the above 5 patterns. Then, we generate transmission rate values α j,i (t) for each edge according to its chosen evolution pattern.译文:我们给网络中的每条边分配一个进化模式,从以上5个模式的集合中均匀随机选择。然后,根据每条边所选择的演化模式,我们生成其数据包的传输速率值(aj,i (t))。

 

The evolving edge transmission rate α j,i (t) models how quickly information spreads from one node to another. Finally, we generate 1,000 information cascades per time step. For each cascade we randomly pick the cascade initiator node.译文:演进的边缘传输速率aj,i (t)模拟信息从一个节点传播到另一个节点的速度。最后,我们每个时间步长产生1000个信息级联。对于每个级联,我们随机选择级联启动器节点。

   
 

Given the node infection times from the recorded cascades, our goal then is to find the true edges of the network and for each edge discover its transmission rate evolution pattern. 译文:根据记录的级联中的节点感染时间,我们的目标是找到网络的真实边缘,并发现每个边缘的传输速率演化模式。

In other words, inferring how each edge transmission rate α(t) evolves over time. Figure 1 shows the true and inferred edge transmission rates for four different edges, each with a different evolution pattern: Slab, Square, Chainsaw and Hump. Observe that INFOPATH is able to track the evolving edge transmission rates over time for all evolution patterns. INFOPATH gives near perfect performance when edge transmission rate evolves continuously (Chainsaw, Hump). Interestingly, even when the edge transmission rate evolves discontinuously (Slab, Square), INFOPATH manages to track it.

 
   
 

Figure 2: Precision and Recall (P-R), Accuracy and Mean Squared Error (MSE) of our INFOPATH method against time. (a,c,e): Core-periphery (C-P) Kronecker network with exponential edge transmission model (b, d, f), and Hierarchical (HI) Kronecker network with Rayleigh edge transmission model. Performance on Type I (Chainsaw, Hump) and Type II (Slab, Square) edge transmission rate evolution patterns is plotted.

 

Figure 2 shows Precision, Recall, Accuracy, and MSE over time for the time-varying core-periphery Kronecker network with exponential edge transmission model, and hierarchical Kronecker network with Rayleigh edge transmission model. Observe that the performance of our method is stable across time, and as mentioned before, continuous evolution patterns are easier to track and estimate than discontinuous ones.

译文:观察我们的方法的性能在时间上是稳定的,正如前面提到的,连续的演化模式比不连续的更容易跟踪和估计

   
 

Accuracy vs. running time in static networks. Our stochastic gradient descend based method, INFOPATH, can be also used to speed-up inference of static networks.译文:在静态网络中,精度与运行时间的对比。我们的随机梯度下降方法INFOPATH,也可以用于加速静态网络的推理

In such scenario, stochastic gradient descend processes cascades in a random round-robin fashion. Here, we compare INFOPATH to the state of the art methods for inference of static networks: NETINF [10] and NETRATE [9].译文:在这种情况下,随机梯度下降过程以随机循环的方式进行级联。在这里,我们比较INFOPATH和静态网络推断的最新方法:NETINF[10]和NETRATE[9]。

First, we compare the methods by computing the accuracy against running time. Second, we compare INFOPATH to NETRATE in terms of mean squared error of the estimated transmission rates against the running time.译文:首先,我们通过计算精度运行时间来比较这些方法。其次,我们比较INFOPATH和NETRATE在估计传输速率的平均平方误差对运行时间。

We omit NETINF from this last comparison since it only infers the network structure (and no edge transmission rates).译文:我们在最后的比较中省略了NETINF,因为它只推断网络结构(没有边缘传输速率)。 For the sake of fairness in the running time comparison we implemented all methods in C++. Our C++ implementation of NETRATE is much faster than the public Matlab implementation.译文:为了在运行时间比较上的公平性,我们用c++实现了所有的方法。我们的NETRATE的c++实现比公共的Matlab实现快得多。

   
posted on 2020-07-03 07:07  海阔凭鱼跃越  阅读(404)  评论(0编辑  收藏  举报