[Reading Notes] 2010 CEAS Detecting Spammers on Twitter

Abstract

This paper discribes machine learning method to classify spam users from non-spam users in Twitter, a popular microblog, based on user attributes including content attributes and user behavior attributes. The authors first collected a large dataset of Twitter that includes more than 54M users, 1.9G links, and almost 1.8G tweets (Twitter posts). Using tweets related to three famous trending topics from 2009, they construct a large labeled collection of users, manually classified into spammers and non-spammers. They then identify a number of characteristics related to tweet content and user social behavior, which could potentially be used to detect spammers. They used these characteristics as attributes of machine learning process for classifying users as either spammers or non-spammers.

Background

Microblogs are popular social system where users share and discuss about everything, including news, jokes, their take about events, and even their mood. They are very popular all over the word. Twitter is the most popular microblog system all over the world, while in China, we have sina microblog, tencent microblog, etc. With a simple interface where only 140 character messages can be posted, Twitter is increasingly becoming a system for obtaining real time information.

Because of Twitter's popularity, it attracts lots of spammers, and some interfaces, e.g. trending topics, become the target of spammers to mislead users to completely unrelated websites. Since tweets are usually posted containing shortened URLs, it is difficult for users to identify the URL content without loading the webpage. Therefore, it is very necessary to detect spammers from polluting the Twitter system.

Dataset

Dataset collection

Since Twitter assigns each user a numeric ID which uniquely identifies the user's profile. The authors lannched their crawler in August 2009 to collect all user IDs ranging from 0 to 80 million. This work was done when one of the authors was visiting the MPI-SWS. In order not to block by the Twitter, they asked Twiiter to allow them to collect such data and at last Twitter white-listed 58 servers located at MPI-SWS. In total the following data is collected:

  • 54,981,152 users with 1,963,263,821 social links: 8% is private;
  • 1,755,925,520tweets.

The dataset is said to be share by sending a email. I tried to send a email, but it always fail. I'm still trying on it.

The labeling process

Requirements:

  1. A significant number spammers and non-spammers;
  2. The labeled collection needs to include spammers mostly affect the system;
  3. Users are chosen randomly and not based on their characteristics.

Three trending topics largely discussed in 2009 are chosen: (1) the Micheal Jackson's death (keywords: "Michael Jachson", #michaeljackson, #mj); (2)Susan Boyle's emergence (keywords: "Susan Boyle", #susanboyle), and (3) the hashtag "#musicmonday" (keywords: #musicmonday). Since the spammers aim to mislead users to unrelated urls, the authors randomly selected users among the ones that posted at least one tweet containing a URL with at least one key word described previously. Then a careful manual classification was processed. In orde to minimize the impact of human erro, two volunteers analyzed each user in order to independently label a single user. In case of tie, a third independent user was involved. In total: 8,207 users were labeled, including 355 spammers and 7,852 non-spammers. Since the number of non-spammers is much higher, the authors randomly select only 710 of the legitimate users to include in their collection. At last, 1,065 users were labeled.

User Attributes

  • Content atrributes:
    • Basic attributes:
      • frac{#hashtags}{#words} (maximum, minimum, average)
      • frac{#URLs}{#words} (maximum, minimum, average)
      • #words (maximum, minimum, average)
      • #characters (maximum, minimum, average)
      • #users metioned (maximum, minimum, average)
      • #URLs (maximum, minimum, average)
      • #times retweeted (maximum, minimum, average)
      • The fraction of spam words used
      • The fraction of reply messages
      • The fraction of URLs contained
      • ...
    • User behavior attributes:
      • #folowers
      • #folowees
      • frac{#followers}{#followees}
      • #tweets
      • account age
      • #times metioned
      • #times replied
      • ...
Analysis method: the CDF (cumulative distribution functions between spammers and non-spammers)

Methods

Evaluation metrics

  • recall
  • precision
  • Micro-F1 and Macro-F1

Classifier

SVM with Radial Basis Function (RBF) kernel

Experiments

5-fold cross-validation

Feature selection

  • Information gain
  • chi^2 (Chi-Squared)

Top 10 attributes:

  1. fraction of tweets with URLs
  2. age of the user account
  3. average number of URLs per tweet
  4. fraction of followers per followees
  5. fraction of tweets the user had replied
  6. number of tweets the user replied
  7. number of tweets the user receive a reply
  8. number of followees
  9. number of followers
  10. average number of hashtags per tweet

 

[1]. Benevenuto, F.A.M.G. Detecting Spammers on Twitter. in Seventh annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference. 2010. Redmond, Washington, US.

posted on 2010-10-09 06:33  小橋流水  阅读(254)  评论(0编辑  收藏  举报

导航