利用python对新浪微博用户标签进行分词并推荐相关用户
新浪微博的开放平台的开发者日益活跃,除了商业因素外还有很大的一股民间工程师力量;大量热衷于群体行为研究与自然语言处理以及机器学习和数据挖掘的研究者 and 攻城师们开始利用新浪真实的数据和平台为用户提供更好的应用或者发现群体的行为规律包括一些统计信息,本文就是利用新浪开放平台提供的API对微博的用户标签进行分词处理,然后根据分词后的关键字给用户推荐感兴趣的人,在此记录下以备后用。
requisition:
python+sinaWeibo python SDK+ICTCLAS
备注:ICTCLAS是中国科学院计算技术研究所提供的中文分词包
开始上代码:
1.先要注册新浪开发者以获得APP_KEY和APP_SECRET
2.根据python SDK的howto根据Authou2机制获得授权(得到code进而得到access_token与expires_in),代码如下:
1 #-*-coding:UTF-8-*- 2 ''' 3 Created on 2012-12-10 4 5 @author: jixianwu 6 ''' 7 from weibo import APIClient,APIError 8 import urllib,httplib 9 10 class AppClient(object): 11 ''' initialize a app client ''' 12 def __init__(self,*aTuple): 13 self._appKey = aTuple[0] #your app key 14 self._appSecret = aTuple[1] #your app secret 15 self._callbackUrl = aTuple[2] #your callback url 16 self._account = aTuple[3] #your weibo user name (eg.email) 17 self._password = aTuple[4] # your weibo pwd 18 self.AppCli = APIClient(app_key=self._appKey,app_secret=self._appSecret,redirect_uri=self._callbackUrl) 19 self._author_url = self.AppCli.get_authorize_url() 20 self._getAuthorization() 21 22 def __str__(self): 23 return 'your app client is created with callback %s' %(self._callbackUrl) 24 25 def _get_code(self):#使用该函数避免了手动输入code,实现了模拟用户授权后获得code的功能 26 conn = httplib.HTTPSConnection('api.weibo.com') 27 postdict = {"client_id": self._appKey, 28 "redirect_uri": self._callbackUrl, 29 "userId": self._account, 30 "passwd": self._password, 31 "isLoginSina": "0", 32 "action": "submit", 33 "response_type": "code", 34 } 35 postdata = urllib.urlencode(postdict) 36 conn.request('POST', '/oauth2/authorize', postdata, {'Referer':self._author_url,'Content-Type': 'application/x-www-form-urlencoded'}) 37 res = conn.getresponse() 38 location = res.getheader('location') 39 code = location.split('=')[1] 40 conn.close() 41 return code 42 43 def _getAuthorization(self):#将上面函数获得的code再发送给新浪认证服务器,返回给客户端access_token和expires_in,有了这两个东西,咱就可以调用api了 44 ''' get the authorization from sinaAPI with oauth2 authentication method ''' 45 code = self._get_code() 46 r = self.AppCli.request_access_token(code) 47 access_token = r.access_token # The token return by sina 48 expires_in = r.expires_in 49 self.AppCli.set_access_token(access_token, expires_in)
3.根据api获得用户标签:
1 def getTags(self,userid): 2 ''' get last three tags stored by weight of this user''' 3 try: 4 tags = self.AppCli.tags.get(uid=userid) 5 except Exception: 6 print 'get tags failed' 7 return 8 userTags = [] 9 sortedT = sorted(tags,key=operator.attrgetter('weight'),reverse=True) 10 if len(sortedT) > 3: 11 sortedT = sortedT[-3:] 12 for tag in sortedT: 13 for item in tag: 14 if item != 'weight': 15 userTags.append(tag[item]) 16 return userTags
4.获得用户以关注的人:
1 def getFocus(self,userid): 2 ''' get focused users list by current user ''' 3 focus = self.AppCli.friendships.friends.ids.get(uid=userid) 4 try: 5 return focus.get('ids') 6 except Exception: 7 print 'get focus failed' 8 return
5.对3中获得的用户标签进行分词处理:(之前要写个class进行分词处理,本文最后给出完整源码)
1 from wordSegmentation import tokenizer 2 3 tkr = tokenizer() 4 #concatenate all the tags of the user into a string ,then segment the string 5 for tag in userTags: 6 utf8_tag = tag.encode('utf-8') 7 #print utf8_tag 8 lstrwords += utf8_tag 9 words = tkr.parse(lstrwords)
6.根据5中获得的关键词+新浪api中搜索接口最终给出用户未关注但感兴趣的用户:
1 for keyword in words: 2 print keyword.decode('utf-8').encode('gbk') 3 searchUsers = self.AppCli.search.suggestions.users.get(q=keyword.decode('utf-8'),count=10) 4 5 #recommendation the top ten users 6 ''' 7 if len(searchUsers) >6: 8 searchUsers = searchUsers[-6:] 9 ''' 10 for se_user in searchUsers: 11 #print se_user 12 uid = se_user['uid'] 13 #filter those had been focused by the current user 14 if uid not in userFocus: 15 recommendUsers[uid] = se_user['screen_name'].encode('utf-8')
------
实际运行:
下面是自己微博的例子,我的标签是:
运行推荐程序后得到的结果为:
红线框中为推荐结果,这些微博用户都是与被推荐用户标签一致并具有较高影响力,同时也是最有可能给用户传递效用较高信息的用户。(图中只标注了部分用户)
到此,真个推荐任务完成,完整源码在个github上,还望感兴趣的同学指正。