使用训练好的word2vector进行文本聚类

尝试了使用词频的词表征进行kmeans,效果不好,所以考虑看看使用word2vec的词表征会有什么不同。

1.加载word2vec

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('word2vector.bigram-char')

文件是网上下载的,使用百度百科语料训练的300维词向量,看下效果:

model.most_similar(['男人'])

[('女人', 0.874478816986084),
('老男人', 0.7225901484489441),
('大男人', 0.7179129123687744),
('女孩', 0.6780898571014404),
('臭男人', 0.6778838038444519),
('中年男人', 0.6763597726821899),
('男孩', 0.6762259006500244),
('真男人', 0.6674383878707886),
('好男人', 0.6661351919174194),
('单身男人', 0.6624549031257629)]

len(model.vocab):635974

2.词嵌入

将我们自己的语料(3万左右新闻数据且抽取了关键词)嵌入word2vec词向量:

#词向量嵌入
from datetime import datetime
import numpy as np
start = datetime.now()
embedding = []

for idx, line in enumerate(keywords):
    vector = np.zeros(300)
    
    for word in line:
        
        if word not in model.vocab:
            vector += np.zeros(300)
        else:
            vector += model[word]
    embedding.append(vector/20)
    if (idx%100==0):
        print(idx)
    
end = datetime.now() 
end-start

因为我每篇文本取20个词,所以将所有词的vector/20取了个均值作为文本的向量

3.使用sklearn的kmeans进行聚类

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=100, random_state=1).fit(embedding)
y_pred = kmeans.labels_
cluster_center = kmeans.cluster_centers_
from collections import Counter

center_dict = Counter(y_pred)
center_dict

查看各个类别的数量

Counter({16: 314,
         27: 384,
         21: 160,
         30: 370,
         99: 223,
         15: 158,
         36: 882,
         48: 180,
         14: 184,
         43: 447,
         98: 726,
         88: 601,
         52: 195,
         53: 351,
         13: 565,
         5: 523,
         22: 417,
         23: 365,
         71: 604,
         37: 740,
         63: 355,
         29: 492,
         25: 554,
         82: 335,
         50: 727,
         41: 676,
         47: 344,
         4: 141,
         70: 274,
         12: 559,
         78: 481,
         84: 820,
         40: 237,
         75: 340,
         3: 394,
         10: 574,
         56: 564,
         59: 414,
         51: 301,
         73: 503,
         6: 560,
         60: 268,
         86: 405,
         2: 611,
         28: 485,
         66: 489,
         76: 334,
         77: 296,
         33: 226,
         65: 464,
         97: 501,
         18: 188,
         7: 218,
         54: 251,
         35: 511,
         92: 404,
         19: 454,
         74: 228,
         67: 325,
         49: 591,
         24: 306,
         69: 547,
         72: 330,
         11: 280,
         95: 374,
         81: 464,
         58: 636,
         32: 274,
         79: 115,
         87: 205,
         62: 425,
         34: 281,
         38: 330,
         96: 269,
         64: 445,
         68: 416,
         9: 382,
         91: 113,
         80: 251,
         20: 517,
         44: 264,
         93: 276,
         26: 240,
         17: 381,
         55: 129,
         57: 470,
         0: 501,
         83: 167,
         8: 261,
         89: 134,
         85: 69,
         31: 200,
         90: 147,
         46: 188,
         94: 492,
         1: 91,
         42: 401,
         45: 124,
         61: 189,
         39: 91})

看起来数量挺平均的,随便拿出来一类看看文章的标题,发现有的好有的坏,总体效果还行。

本来还想用DBSCAN算法,发现时间太久,如果跑需要将vector维度降低,所以就算了。

posted @ 2019-05-16 18:26  xiaonezha  阅读(5243)  评论(0编辑  收藏  举报