随笔档案「2016年7月」 - 奋斗中的菲比

计算两篇文章相似度代码

摘要：# -*- coding:gb2312 -*- from gensim import corpora, models, similarities from nltk.tokenize import word_tokenize from nltk.corpus import brown courses=[] temp="" for line in file('aaa'): if(line!... 阅读全文

posted @ 2016-07-29 12:26 奋斗中的菲比阅读(763) 评论(0) 推荐(0)

主题模型

摘要：首先，需要对英文文本做一些预处理工作，譬如去停用词，对文本进行tokenize，stemming以及过滤掉低频的词然后，通过这些文档抽取一个“词袋（bag-of-words)“，将文档的token映射为id，然后就可以将字符串转换为用id表示的文档向量 {‘a’: 0, ‘damaged’: 阅读全文

posted @ 2016-07-11 14:23 奋斗中的菲比阅读(262) 评论(0) 推荐(0)

123

posted @ 2016-07-05 19:15 奋斗中的菲比阅读(190) 评论(0) 推荐(0)

奋斗中的菲比

07 2016 档案

公告