词频统计 两种实现方法

第一种:vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

举例:

from collections import Counter 

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

c = Counter(colors)

print (dict(c))

most_common:取top-k的数据

第二种:

def generate_vocab_file(input_seg_file, output_vocab_file):
  with open(input_seg_file, 'r',encoding='UTF-8') as f:
  lines = f.readlines()
  word_dict = {}
  for line in lines:
  label, content = line.strip('\r\n').split('\t')
  for word in content.split():
  word_dict.setdefault(word, 0)
  word_dict[word] += 1
  # [(word, frequency), ..., ()]
  sorted_word_dict = sorted(
  word_dict.items(), key = lambda d:d[1], reverse=True)
  with open(output_vocab_file, 'w',encoding='UTF-8') as f:
  f.write('<UNK>\t10000000\n')
  for item in sorted_word_dict:
  f.write('%s\t%d\n' % (item[0], item[1]))

类似实现:

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

result = {}

for color in colors:

  if result.get(color)==None:

     result[color]=1

  else:

    result[color]+=1

print (result) #{'red': 2, 'blue': 3, 'green': 1}

posted @ 2020-08-25 17:17  kpwong  阅读(925)  评论(0编辑  收藏  举报