我的微博关键字

Posted on 2013-12-26 15:22 Jonee 阅读(1477) 评论(2) 编辑收藏举报

分析过程中需要用到的R包：Rweibo、 Rwordseg（或者rsmartcn）、wordcloud。

根据个人情况，下载了以下搜狗细胞词库：

“统计学名词”、“数学词汇大全”、“机器学习”、“财经金融词汇大全”、“互联网词库（2006版）”、“哈工大停用词表扩展”

1、加载Rweibo包，进行授权申请。其中"key"与“Secret”需要通过开通新浪开发者账号获得，具体过程和细节可以看Rweibo包中的help文档。

library(Rweibo)
registerApp(app_name = "Jonee", "key", "Secret")
roauth <- createOAuth(app_name ="Jonee", access_name = "Rweibo")

2、抓取个人微博的全部信息

Allwb <- analysis.getUserTimeline(roauth, screen_name = "Jonee-SH")
nrow(Allwb)

输出结果：327，说明一共发了327条微博。

3、查看自己原创微博数量占总微博数的百分比：

sum(is.na(Allwb$retweeted_text))/nrow(Allwb)*100

结果： 3.058104%

可以看出自己基本上都是转发的别人的微博，基本上不自己写微博，这也跟自己把微博当成信息媒介而非社交网络有关。有鉴于此，后面的关键词分析将包含自己所转发的内容，以此来探索自己主要关注什么内容。

4、数据预处理

在转发别人微博的时候，往往已经有人做了评论或者转发，因此会包括“//@”、“@”、“：”等字符以及别人的微博名，所以需要做一下预处理：

text_clean=unlist(lapply(Allwb$text,strsplit,split="//@",perl=T))
clean=lapply(text_clean,gregexpr,pattern=":",perl=T)
omit=unlist(lapply(clean,function(x) x[[1]][1]))
locate=which(omit>0)
temp.text=text_clean[locate]
text_clean[locate]=substr(temp.text,omit[locate]+1,max(nchar(temp.text))+1)

5、利用Rwordseg包来做分词。

因为Rwordseg包支持导入搜狗细胞词库，结合个人情况，我在这里导入“统计学名词”、“数学词汇大全”、“机器学习”、“财经金融词汇大全”、“互联网词库（2006版）”五个词库。

library(Rwordseg)
installDict("SogouLabDic.dic", dictname = "SogouLabDic.dic", dicttype = c("text"))
installDict("统计学名词.scel", dictname = "stat.scel", dicttype = c("scel"))
installDict("数学词汇大全.scel", dictname = "math.scel", dicttype = c("scel"))
installDict("机器学习.scel", dictname = "machine_learning.scel", dicttype = c("scel"))
installDict("财经金融词汇大全.scel", dictname = "financial.scel", dicttype = c("scel"))

输出结果：

157202 words were loaded! ... New dictionary 'SogouLabDic.dic' was installed!

625 words were loaded! ... New dictionary 'stat.scel' was installed!

15764 words were loaded! ... New dictionary 'math.scel' was installed!

9 words were loaded! ... New dictionary 'machine_learning.scel' was installed!

11319 words were loaded! ... New dictionary 'financial.scel' was installed!

可以看出，“互联网词库”的词组最多，将近16万，“机器学习”的最少，才9个。

进行分词：

retweet_text=unlist(Allwb$retweeted_text)
text.final=c(text_clean,retweet_text)
segment.options(isNameRecognition = TRUE)
text.seg=unlist(segmentCN(text.final))
text.seg=text.seg[text.seg!=""]

6、去噪音。

这里只是简单的去掉词长仅为1的单个字，并且删掉一些没有实际含义但出现频率较高的词组（需要用到“哈工大停用词表扩展”）。

处理方式如下：

text.seg=text.seg[nchar(text.seg)>1]
text.table=table(text.seg)
word=names(text.table)
stopwords=readLines("哈工大停用词表扩展.txt")
stopwords=c(stopwords,"http","cn","www",“net”,"com")
word=setdiff(word,stopwords)
word <- word[word != '转发']
word <- word[word != '微博']
word <- word[word != '的人']
text.table=text.table[word]

7、结果展示（最小频数设为5，将调色板中第四种颜色（亮黄色）去掉，过于刺眼）：

library(wordcloud)
pal <- brewer.pal(7,"Accent")
pal <- pal[-(4)]
wordcloud(word, text.table,min.freq=5,random.order=FALSE, colors=pal)

结果如下：

刷新页面返回顶部

逆水行舟，不进则退

公告

我的微博关键字