【443】Tweets Analysis Q&A
【Question 01】
When converting Tweets info to csv file, commas in the middle of data (i.e. location: Sydney, NSW) can make a mistake of the csv file (creaing more columns).
The solution is to add double quotation marks on both sides of the content, like this:
1 | fo.write( "\"" + str (tweet[ "user" ][ "location" ]) + "\"" ) |
【Question 02】
When open csv file with Excel, sometimes it will show messy code, but it can show well with Notepad.
One solution is opening this file with notepad++.
Another solution is adding codes at the beginning of the writing file, like this:
1 2 | fo = open (r "D:\Twitter Data\Data\test\tweets.csv" , "w" ) fo.write( "\ufeff" ) |
【Question 03】
Text contents contain carriage return, double quotation marks, single quotation marks. Those info will make mistakes when creating csv file.
So we should replace those characters with space or nothing, like this:
1 2 3 4 5 | text = str (tweet[ "text" ]) text = text.replace( "\n" , " " ) text = text.replace( "\"" , "") text = text.replace( "\'" , "") fo.write( "\"" + text + "\"" ) |
Including tweet["user"]["location"] and tweet["text"], for these two attributes, user can write whatever they want, so it's easy to make mistakes.
【Question 04】
After converting Tweets to csv file, but I can't open this file by pandas.read_csv(). The reason is there must be some problems in those data. Since there are about more than 100000+ rows of this csv file, how can I locate the error line?
Solution is coverting the first 10000 rows, if there are not errors, and then converting the next 10000 rows. If error occurs, trying to narrow the range of numbers, like error occurs between 20000 to 30000, we can change the range of numbers with 20000 to 25000. Using this method several times, we can locate the error line and find the real problems. For this spicific case, most problems are about contents include carriage return, double quotation marks, etc.
Codes like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ... count = 0 or line in tweets_file: try : count + = 1 if (count < 10000 ): continue ... if (count > 20000 ): break except : continue ... |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2016-10-17 【229】Raster Calculator - 栅格计算器