【337】Text Mining Using Twitter Streaming API and Python
Reference: An Introduction to Text Mining using Twitter Streaming API and Python
Reference: How to Register a Twitter App in 8 Easy Steps
- Getting Data from Twitter Streaming API
- Reading and Understanding the data
- Mining the tweets
Key Methods:
- Map()
- Lambda()
- Set()
- Pandas.DataFrame()
- matplotlib
1. Getting Data from Twitter Streaming API
twitter_streaming.py, this file is used to extract information from Twitter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | #Import the necessary methods from tweepy library from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream #Variables that contains the user credentials to access Twitter API access_token = "ENTER YOUR ACCESS TOKEN" access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET" consumer_key = "ENTER YOUR API KEY" consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout. class StdOutListener(StreamListener): def on_data( self , data): print (data) return True def on_error( self , status): print (status) if __name__ = = '__main__' : #This handles Twitter authetification and the connection to Twitter Streaming API l = StdOutListener() auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby' stream. filter (track = [ 'python' , 'javascript' , 'ruby' ]) |
You can use the following command to store information in the specific file. (By CMD)
1 | python twitter_streaming.py > twitter_data.txt |
Then we will get the information from the above text file and store them in JSON format.
1 2 3 4 5 6 7 8 9 10 | import json tweets_data_path = r "..\twitter_data.txt" tweets_data = [] tweets_file = open (tweets_data_path, "r" ) for line in tweets_file: try : tweet = json.loads(line) tweets_data.append(tweet) except : continue |
Data are stored in tweets_data, and we can get the specific information by the following scripts.
Reference: python JSON only get keys in first level
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # get the text content, language from the specific tweets num = 0 for tweet in tweets_data: num + = 1 if num = = 10 : break else : tweet_text = tweet[ "text" ] tweet_lang = tweet[ "lang" ] print ( str (num)) print (tweet_lang) print (tweet_text) print () # get all the keys from json tweets_data[ 0 ].keys() |
2. Reading and Understanding the data
Now we can also get the specific key by list(), map() and lambda() with the following scripts.
Reference: Python中map与lambda的结合使用
1 2 3 4 5 | >>> a = list ( map ( lambda tweet: tweet[ 'text' ], tweets_data)) >>> len (a) 1633 >>> a[ 0 ] 'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…' |
Or we can also use set() method to get the unique values of the list.
Reference: Python set() 函数
Reference: Python统计列表中的重复项出现的次数的方法
1 2 3 4 5 | >>> langs = list ( map ( lambda tweet: tweet[ 'lang' ], tweets_data)) >>> len (langs) 1633 >>> set (langs) { 'zh' , 'de' , 'es' , 'et' , 'th' , 'cy' , 'ru' , 'in' , 'lt' , 'pt' , 'tl' , 'en' , 'it' , 'ja' , 'ro' , 'fa' , 'pl' , 'fr' , 'ht' , 'ar' , 'tr' , 'ca' , 'cs' , 'und' , 'da' } |
Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.
1 2 3 4 5 6 7 8 9 10 11 12 | >>> import pandas as pd >>> tweets = pd.DataFrame() >>> tweets[ 'text' ] = list ( map ( lambda tweet: tweet[ 'text' ], tweets_data)) >>> tweets[ 'lang' ] = list ( map ( lambda tweet: tweet[ 'lang' ], tweets_data)) >>> tweets[ 'country' ] = list ( map ( lambda tweet: tweet[ 'place' ][ 'country' ] if tweet[ 'place' ] ! = None else None , tweets_data)) >>> tweets[ 'lang' ].value_counts() en 1119 ja 278 es 113 pt 36 und 26 ... |
Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | >>> tweets_by_lang = tweets[ 'lang' ].value_counts() >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots() >>> ax.tick_params(axis = 'x' , labelsize = 15 ) >>> ax.tick_params(axis = 'y' , labelsize = 10 ) >>> ax.set_xlabel( 'Languages' , fontsize = 15 ) Text( 0.5 , 0 , 'Languages' ) >>> ax.set_ylabel( 'Number of tweets' , fontsize = 15 ) Text( 0 , 0.5 , 'Number of tweets' ) >>> ax.set_title( 'Top 5 languages' , fontsize = 15 , fontweight = 'bold' ) Text( 0.5 , 1.0 , 'Top 5 languages' ) >>> tweets_by_lang[: 5 ].plot(ax = ax, kind = 'bar' , color = 'red' ) <matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630 > >>> plt.show() |
Next, we will create a chart describing the Top 5 countries from which the tweets were sent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | >>> tweets_by_country = tweets[ 'country' ].value_counts() >>> fig, ax = plt.subplots() >>> ax.tick_params(axis = 'x' , labelsize = 15 ) >>> ax.tick_params(axis = 'y' , labelsize = 10 ) >>> ax.set_xlabel( 'Countries' , fontsize = 15 ) Text( 0.5 , 0 , 'Countries' ) >>> ax.set_ylabel( 'Number of tweets' , fontsize = 15 ) Text( 0 , 0.5 , 'Number of tweets' ) >>> ax.set_title( 'Top 5 countries' , fontsize = 15 , fontweight = 'bold' ) Text( 0.5 , 1.0 , 'Top 5 countries' ) >>> tweets_by_country[: 5 ].plot(ax = ax, kind = 'bar' , color = 'blue' ) <matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0 > >>> plt.show() |
3. Mining the tweets
Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:
- We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
- Target tweets that have "programming" or tutorial" keywords.
- Extract links from the relevant tweets.
Adding Python, Ruby, and Javascript tags
First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).
Python provides a library for regular expression called re. We will start by importing this library.
Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.
1 2 3 4 5 6 7 8 | >>> import re >>> def word_in_text(word, text): word = word.lower() text = text.lower() match = re.search(word, text) if match: return True return False |
Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().
1 2 3 | >>> tweets[ 'python' ] = tweets[ 'text' ]. apply ( lambda tweet: word_in_text( 'python' , tweet)) >>> tweets[ 'ruby' ] = tweets[ 'text' ]. apply ( lambda tweet: word_in_text( 'ruby' , tweet)) >>> tweets[ 'javascript' ] = tweets[ 'text' ]. apply ( lambda tweet: word_in_text( 'javascript' , tweet)) |
We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:
1 2 3 4 5 6 | >>> print (tweets[ 'python' ].value_counts()[ True ]) 447 >>> print (tweets[ 'ruby' ].value_counts()[ True ]) 529 >>> print (tweets[ 'javascript' ].value_counts()[ True ]) 275 |
We can make a simple comparison chart by executing the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | >>> prg_langs = [ 'python' , 'ruby' , 'javascript' ] >>> tweets_by_prg_lang = [tweets[ 'python' ].value_counts()[ True ], tweets[ 'ruby' ].value_counts()[ True ], tweets[ 'javascript' ].value_counts()[ True ]] >>> x_pos = list ( range ( len (prg_langs))) >>> width = 0.8 >>> fig, ax = plt.subplots() >>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha = 1 , color = 'g' ) <BarContainer object of 3 artists> >>> # Setting axis labels and ticks >>> ax.set_ylabel( 'Number of tweets' , fontsize = 15 ) Text( 0 , 0.5 , 'Number of tweets' ) >>> ax.set_title( 'Ranking: python vs. javascript vs. ruby (Raw data)' , fontsize = 10 , fontweight = 'bold' ) Text( 0.5 , 1.0 , 'Ranking: python vs. javascript vs. ruby (Raw data)' ) >>> ax.set_xticks([p + 0.4 * width for p in x_pos]) [<matplotlib.axis.XTick object at 0x00000189BA5D1F28 >, <matplotlib.axis.XTick object at 0x00000189BA603D30 >, <matplotlib.axis.XTick object at 0x00000189BA5D15F8 >] >>> ax.set_xticklabels(prg_langs) [Text( 0 , 0 , 'python' ), Text( 0 , 0 , 'ruby' ), Text( 0 , 0 , 'javascript' )] >>> plt.grid() >>> plt.show() |
This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.
Targeting relevant tweets
We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.
1 2 | >>> tweets[ 'programming' ] = tweets[ 'text' ]. apply ( lambda tweet: word_in_text( 'programming' , tweet)) >>> tweets[ 'tutorial' ] = tweets[ 'text' ]. apply ( lambda tweet: word_in_text( 'tutorial' , tweet)) |
We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.
1 | >>> tweets[ 'relevant' ] = tweets[ 'text' ]. apply ( lambda tweet: word_in_text( 'programming' , tweet) or word_in_text( 'tutorial' , tweet)) |
We can print the counts of relevant tweet by executing the commands below.
1 2 3 4 5 6 | >>> print (tweets[ 'programming' ].value_counts()[ True ]) 55 >>> print (tweets[ 'tutorial' ].value_counts()[ True ]) 22 >>> print (tweets[ 'relevant' ].value_counts()[ True ]) 74 |
We can compare now the popularity of the programming languages by executing the commands below.
1 | tweets[tweets[ 'relevant' ] = = True ][ 'python' ] # 将 relevant 为 True 的索引对应 Python 组成一个新的列 |
1 2 3 4 5 6 | >>> print (tweets[tweets[ 'relevant' ] = = True ][ 'python' ].value_counts()[ True ]) 31 >>> print (tweets[tweets[ 'relevant' ] = = True ][ 'ruby' ].value_counts()[ True ]) 8 >>> print (tweets[tweets[ 'relevant' ] = = True ][ 'javascript' ].value_counts()[ True ]) 11 |
Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | >>> tweets_by_prg_lang = [tweets[tweets[ 'relevant' ] = = True ][ 'python' ].value_counts()[ True ], tweets[tweets[ 'relevant' ] = = True ][ 'ruby' ].value_counts()[ True ], tweets[tweets[ 'relevant' ] = = True ][ 'javascript' ].value_counts()[ True ]] >>> x_pos = list ( range ( len (prg_langs))) >>> width = 0.8 >>> fig, ax = plt.subplots() >>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha = 1 ,color = 'g' ) <BarContainer object of 3 artists> >>> ax.set_ylabel( 'Number of tweets' , fontsize = 15 ) Text( 0 , 0.5 , 'Number of tweets' ) >>> ax.set_title( 'Ranking: python vs. javascript vs. ruby (Relevant data)' , fontsize = 10 , fontweight = 'bold' ) Text( 0.5 , 1.0 , 'Ranking: python vs. javascript vs. ruby (Relevant data)' ) >>> ax.set_xticks([p + 0.4 * width for p in x_pos]) [<matplotlib.axis.XTick object at 0x00000189B6E9E128 >, <matplotlib.axis.XTick object at 0x00000189B430F9E8 >, <matplotlib.axis.XTick object at 0x00000189B430F5C0 >] >>> ax.set_xticklabels(prg_langs) [Text( 0 , 0 , 'python' ), Text( 0 , 0 , 'ruby' ), Text( 0 , 0 , 'javascript' )] >>> plt.grid() >>> plt.show() |
Extracting links from the relevants tweets
Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.
1 2 3 4 5 6 | >>> def extract_link(text): regex = r 'https?://[^\s<>"]+|www\.[^\s<>"]+' match = re.search(regex, text) if match: return match.group() return '' |
Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.
1 | >>> tweets[ 'link' ] = tweets[ 'text' ]. apply ( lambda tweet: extract_link(tweet)) |
Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.
将原有 DataFrame 进行截取。
1 2 | >>> tweets_relevant = tweets[tweets[ 'relevant' ] = = True ] >>> tweets_relevant_with_link = tweets_relevant[tweets_relevant[ 'link' ] ! = ''] |
We can now print out all links for python, ruby, and javascript by executing the commands below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | >>> print (tweets_relevant_with_link[tweets_relevant_with_link[ 'python' ] = = True ][ 'link' ]) 40 https: / / t.co / zoAgyQuMAZ 105 https: / / t.co / ogaPbuIbEW 274 https: / / t.co / y4sUmovFOn 329 https: / / t.co / A030fqWeWA 339 https: / / t.co / LaaVc5T2rQ 391 https: / / t.co / 8bYvlziCZb 413 https: / / t.co / 8bYvlziCZb 436 https: / / t.co / EByqxT1qyN 444 https: / / t.co / 8bYvlziCZb 445 https: / / t.co / 5Jujg6h31B 462 https: / / t.co / UrFHlOaJYf 476 https: / / t.co / 5Jujg6h31B 477 https: / / t.co / EByqxT1qyN 589 https: / / t.co / UrFHlOaJYf 603 https: / / t.co / 5Jujg6h31B 822 https: / / t.co / Oc21FrzQc5 1060 https: / / t.co / qOAIuKfyD0 1097 https: / / t.co / qOAIuKfyD0 1248 https: / / t.co / V3ZNKuYsK7 1278 https: / / t.co / qOAIuKfyD0 1411 https: / / t.co / szHRHavQKy 1594 https: / / t.co / X6KWMlzlv6 Name: link, dtype: object >>> print (tweets_relevant_with_link[tweets_relevant_with_link[ 'ruby' ] = = True ][ 'link' ]) 782 https: / / t.co / JgY40r2NSo 833 https: / / t.co / JgY40r2NSo 1177 https: / / t.co / xycOG3ndi9 1254 https: / / t.co / xycOG3ndi9 1293 https: / / t.co / LMHW050TGs 1328 https: / / t.co / SS4DzEnSBZ 1393 https: / / t.co / NZlUce5Ne8 1619 https: / / t.co / e4nwrn3N2j Name: link, dtype: object >>> print (tweets_relevant_with_link[tweets_relevant_with_link[ 'javascript' ] = = True ][ 'link' ]) 130 https: / / t.co / AbJFaSI0B8 286 https: / / t.co / 7dNBIsQ5Gq 467 https: / / t.co / 3YIK588j8t 471 https: / / t.co / vjBJWWzvfv 830 https: / / t.co / T4mUjwUcgL 1093 https: / / t.co / wvLZLjuVKF 1180 https: / / t.co / luxL2qbxte 1526 https: / / t.co / G3ZTFL0RKv Name: link, dtype: object |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2011-11-12 【C012】Python - 基础教程学习(三)
2011-11-12 【C011】Python - 基础教程学习(二)