【473】Twitter数据处理总结
一、数据收集
数据收集通过 Twitter API,搜集 US 境内全部 Twitter 数据,以 JSON 格式存储在 txt 文件中。
二、数据读取
从 txt 文件中,以 JSON 格式去获取每条 tweet 的信息,然后存储于 csv 文件中。读取时候的编码选的是 gbk。
代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | from math import radians, sin import json, os, codecs # area of bounding box def area(lon1, lat1, lon2, lat2): lon1, lat1, lon2, lat2 = map (radians, [lon1, lat1, lon2, lat2]) r = 6372 return abs (r * * 2 * (lon2 - lon1) * (sin(lat2) - sin(lat1))) # tweets of txt to csv def txt2csv(foldername, filename): files = os.listdir(foldername) os.chdir(foldername) fo = open (filename, "w" ) # fo.write("\ufeff") fo.write( "id,created_at,coordinates,co_lon,co_lat,geo,geo_lat,geo_lon," + "user_location,place_type,place_name," + "place_full_name,place_country,place_bounding_box,pb_avg_lon,pb_avg_lat," + "min_lon,min_lat,max_lon,max_lat,bb_area,lang,source,text" ) count = 0 for file in files: # determine is file or directory if os.path.isdir( file ): continue count + = 1 print (count, ":" , file ) tweets_file = open ( file , "r" ) for line in tweets_file: try : tweet = json.loads(line) csv_text = "\n" # id csv_text + = tweet[ "id_str" ] csv_text + = "," # created_at csv_text + = str (tweet[ "created_at" ]) csv_text + = "," # coordinates if (tweet[ "coordinates" ]): csv_text + = "Yes," csv_text + = str (tweet[ "coordinates" ][ "coordinates" ][ 0 ]) csv_text + = "," csv_text + = str (tweet[ "coordinates" ][ "coordinates" ][ 1 ]) else : csv_text + = "None,None,None" csv_text + = "," # geo if (tweet[ "geo" ]): csv_text + = "Yes," csv_text + = str (tweet[ "geo" ][ "coordinates" ][ 0 ]) csv_text + = "," csv_text + = str (tweet[ "geo" ][ "coordinates" ][ 1 ]) else : csv_text + = "None,None,None" csv_text + = "," # user->location ul = str (tweet[ "user" ][ "location" ]) ul = ul.replace( "\n" , " " ) ul = ul.replace( "\"" , "") ul = ul.replace( "\'" , "") csv_text + = "\"" + ul + "\"" csv_text + = "," # place->type csv_text + = str (tweet[ "place" ][ "place_type" ]) csv_text + = "," # place->name csv_text + = "\"" + str (tweet[ "place" ][ "name" ]) + "\"" csv_text + = "," # place->full_name csv_text + = "\"" + str (tweet[ "place" ][ "full_name" ]) + "\"" csv_text + = "," # place->country csv_text + = "\"" + str (tweet[ "place" ][ "country" ]) + "\"" csv_text + = "," # place->bounding_box if (tweet[ "place" ][ "bounding_box" ][ "coordinates" ]): # min_lon min_lon = tweet[ "place" ][ "bounding_box" ][ "coordinates" ][ 0 ][ 0 ][ 0 ] # min_lat min_lat = tweet[ "place" ][ "bounding_box" ][ "coordinates" ][ 0 ][ 0 ][ 1 ] # max_lon max_lon = tweet[ "place" ][ "bounding_box" ][ "coordinates" ][ 0 ][ 2 ][ 0 ] # max_lat max_lat = tweet[ "place" ][ "bounding_box" ][ "coordinates" ][ 0 ][ 2 ][ 1 ] # avg of lon and lat lon = (min_lon + max_lon) / 2 lat = (min_lat + max_lat) / 2 # area of bounding box area_bb = area(min_lon, min_lat, max_lon, max_lat) csv_text + = "Yes," csv_text + = str (lon) csv_text + = "," csv_text + = str (lat) csv_text + = "," csv_text + = str (min_lon) csv_text + = "," csv_text + = str (min_lat) csv_text + = "," csv_text + = str (max_lon) csv_text + = "," csv_text + = str (max_lat) csv_text + = "," csv_text + = str (area_bb) else : csv_text + = "None, None, None" csv_text + = "," # lang csv_text + = str (tweet[ "lang" ]) csv_text + = "," # source csv_text + = "\"" + str (tweet[ "source" ]) + "\"" csv_text + = "," # text # replace carriage return, double quotation marks, single quotation marks with space or nothing text = str (tweet[ "text" ]) text = text.replace( "\r" , " " ) text = text.replace( "\n" , " " ) text = text.replace( "\"" , "") text = text.replace( "\'" , "") csv_text + = "\"" + text + "\"" fo.write(csv_text) except : continue fo.close() txt2csv(r "E:\USA\test" , r "D:\OneDrive - UNSW\01-UNSW\02-Papers_Plan\02-CCIS\04-US_Tweets\tt.csv" ) import pandas as pd df = pd.read_csv(r "D:\OneDrive - UNSW\01-UNSW\02-Papers_Plan\02-CCIS\04-US_Tweets\tt.csv" , encoding = 'gbk' ) df.head() |
数据的显示效果如下:
一共是 24 列,分别存储与时间和地点相关的信息,包括创建时间、经纬度、text 信息等。
三、数据处理
3.1 获取 tweets 总数量
实现起来很简单,还要计算出有多少列就行。
代码如下:
1 2 3 4 5 | import pandas as pd df = pd.read_csv(r "D:\OneDrive - UNSW\01-UNSW\02-Papers_Plan\02-CCIS\04-US_Tweets\tt.csv" , encoding = 'gbk' ) # 数据量 df.shape |
结果类似 (715, 24),说明有 715 条记录。
3.2 获取不重复 tweets 总数量
由于在收集的过程中可能重复提取,因此需要进行删除重复数据
代码如下:
1 2 3 4 5 | # delete duplicate tweets df = df.drop_duplicates([ 'id' ]) # 无重复数据量 df.shape |
显示结果如上
3.3 修改某些列的数据类型
默认的很多列都是 object 类型,为了进行计算需要进行修改,例如时间的列修改成 datetime 类型,经纬度为 float 等。
代码如下:
1 2 3 | # change data type to datetime # co_lon and co_lat are NONE sometimes df = df.astype({ "created_at" : "datetime64[ns]" }) |
修改之后可以提取其中的年与日信息。
3.4 获取 tweets 的来源
主要是查询是 web 还是 iPhone、Android、Instagram 等。
代码如下:
1 2 | # get total number of every source print ( "\nsource counts:\n\n" , df.source.value_counts()) |
会将不同来源的数量按大到小打印出来。
3.5 获取 geo-tagged tweets 数量
获取带有地理信息的 tweets 数量。
代码如下:
1 2 | # get total number of tweets with goe-tags print ( "\nGeotagged tweets counts:\n\n" , df.coordinates.value_counts()) |
3.6 获取位于 US 境内并且为 ENG 的 tweets 数量
代码如下:
1 2 3 4 5 6 7 8 | # get tweets from Aus df = df[df[ 'place_country' ] = = 'United States' ] # get English tweets df = df[df[ 'lang' ] = = 'en' ] df.shape print ( "\n US English tweets count: " , df.shape[ 0 ]) |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)