词频统计两种实现方法

第一种：vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

举例：

from collections import Counter

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

c = Counter(colors)

print (dict(c))

most_common：取top-k的数据

第二种:

def generate_vocab_file(input_seg_file, output_vocab_file):
　　with open(input_seg_file, 'r',encoding='UTF-8') as f:
　　lines = f.readlines()
　　word_dict = {}
　　for line in lines:
　　label, content = line.strip('\r\n').split('\t')
　　for word in content.split():
　　word_dict.setdefault(word, 0)
　　word_dict[word] += 1
　　# [(word, frequency), ..., ()]
　　sorted_word_dict = sorted(
　　word_dict.items(), key = lambda d:d[1], reverse=True)
　　with open(output_vocab_file, 'w',encoding='UTF-8') as f:
　　f.write('<UNK>\t10000000\n')
　　for item in sorted_word_dict:
　　f.write('%s\t%d\n' % (item[0], item[1]))

类似实现：

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

result = {}

for color in colors:

　　if result.get(color)==None:

　　　　 result[color]=1

　　else:

　　　　result[color]+=1

print (result) #{'red': 2, 'blue': 3, 'green': 1}

posted @ 2020-08-25 17:17 kpwong 阅读(1003) 评论(0) 收藏举报

刷新页面返回顶部

词频统计 两种实现方法

公告

词频统计两种实现方法