Projects_Tweets Analysis

This is a course project of real time big data analysis. My idea is to use tweets to extract trending topics and analyze political popularity using Hadoop map/reduce. About 120,000 tweets are collected using Twitter Streaming APIs in a JSON format. The data is loaded into HDFS on NYU HPC clusters, to take advantage of their fast computing ability. In map task, each tweet is parsed with displayURL, hashtagEntities, text as key and their corresponding contents as value. Then Mark the occurence of each entity and send to reducer. The reducer will get the total count for each displayURL or hashtagEntities. Do a second map/reduce to get the order of each entity. Then I got the top popular links which has the most people watched during that time, also for the top hashtags. The total time for the map/reduce computing of the 4GB data takes about 18 seconds, including the JVM lauching for each map task.

For political popularity analysis of presendential selection, I use text matching to see how many people mentioned each politician. This will get a general sense of how popular they are on social media.

Detailed procedure:

1) data collection

 Twitter streaming API is used. First create a StatusListener to listen to the twitter interface and get the tweet once there is one. 

2) Parser

 This parser is used on the map task which automatically takes a data line. The parser parses tweet into a hashmap which using expanded url, hashtagentity, text as key and their corresponding content as value.

3) Map task

Takes the parsed tweet and mark each occurence of entity as one in the context. 

4) Reduce task

The mapped output go through shuffle and sort and then send to the reducer as input. The input contains each entity and the list of occurrence(which is 1) in tweets. The reducer will add up all occurrences and get the total occurence of each entity.

5) Second map/reduce

Note that the output of reducer is that the left column is all entities in alphbetical order and right column is its occurrence. The goal here is to get the most popular url and hashtags. So we need to reverse these two columns and sort again.

 

Obstacle: Advertisements, 

HashMap = [“displayURL” : “some_url_here”,

        “hashtagEntities” : “#tag1, #tag2, …”,

        “text” : “something_typed_here”]

posted on 2016-02-04 22:32  touchdown  阅读(143)  评论(0编辑  收藏  举报

导航