使用训练好的word2vector进行文本聚类
尝试了使用词频的词表征进行kmeans,效果不好,所以考虑看看使用word2vec的词表征会有什么不同。
1.加载word2vec
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('word2vector.bigram-char')
文件是网上下载的,使用百度百科语料训练的300维词向量,看下效果:
model.most_similar(['男人'])
[('女人', 0.874478816986084),
('老男人', 0.7225901484489441),
('大男人', 0.7179129123687744),
('女孩', 0.6780898571014404),
('臭男人', 0.6778838038444519),
('中年男人', 0.6763597726821899),
('男孩', 0.6762259006500244),
('真男人', 0.6674383878707886),
('好男人', 0.6661351919174194),
('单身男人', 0.6624549031257629)]
len(model.vocab):635974
2.词嵌入
将我们自己的语料(3万左右新闻数据且抽取了关键词)嵌入word2vec词向量:
#词向量嵌入 from datetime import datetime import numpy as np start = datetime.now() embedding = [] for idx, line in enumerate(keywords): vector = np.zeros(300) for word in line: if word not in model.vocab: vector += np.zeros(300) else: vector += model[word] embedding.append(vector/20) if (idx%100==0): print(idx) end = datetime.now() end-start
因为我每篇文本取20个词,所以将所有词的vector/20取了个均值作为文本的向量
3.使用sklearn的kmeans进行聚类
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=100, random_state=1).fit(embedding) y_pred = kmeans.labels_ cluster_center = kmeans.cluster_centers_
from collections import Counter center_dict = Counter(y_pred) center_dict
查看各个类别的数量
Counter({16: 314, 27: 384, 21: 160, 30: 370, 99: 223, 15: 158, 36: 882, 48: 180, 14: 184, 43: 447, 98: 726, 88: 601, 52: 195, 53: 351, 13: 565, 5: 523, 22: 417, 23: 365, 71: 604, 37: 740, 63: 355, 29: 492, 25: 554, 82: 335, 50: 727, 41: 676, 47: 344, 4: 141, 70: 274, 12: 559, 78: 481, 84: 820, 40: 237, 75: 340, 3: 394, 10: 574, 56: 564, 59: 414, 51: 301, 73: 503, 6: 560, 60: 268, 86: 405, 2: 611, 28: 485, 66: 489, 76: 334, 77: 296, 33: 226, 65: 464, 97: 501, 18: 188, 7: 218, 54: 251, 35: 511, 92: 404, 19: 454, 74: 228, 67: 325, 49: 591, 24: 306, 69: 547, 72: 330, 11: 280, 95: 374, 81: 464, 58: 636, 32: 274, 79: 115, 87: 205, 62: 425, 34: 281, 38: 330, 96: 269, 64: 445, 68: 416, 9: 382, 91: 113, 80: 251, 20: 517, 44: 264, 93: 276, 26: 240, 17: 381, 55: 129, 57: 470, 0: 501, 83: 167, 8: 261, 89: 134, 85: 69, 31: 200, 90: 147, 46: 188, 94: 492, 1: 91, 42: 401, 45: 124, 61: 189, 39: 91})
看起来数量挺平均的,随便拿出来一类看看文章的标题,发现有的好有的坏,总体效果还行。
本来还想用DBSCAN算法,发现时间太久,如果跑需要将vector维度降低,所以就算了。