摘要: import jieba import matplotlib.pyplot as plt from wordcloud import WordCloud from scipy.misc import imread with open('lagou.txt',encoding='utf-8') as f: tmp_line=f.read() jieba_cut=jieba.cut... 阅读全文
posted @ 2017-12-06 18:46 不可叽叽歪歪 阅读(333) 评论(0) 推荐(0) 编辑
摘要: import jieba import sys import jieba.analyse from optparse import OptionParser textrank = jieba.analyse.textrank with open('lagoujobdatails.txt',encoding='utf-8') as f: tmp_line=f.read() jie... 阅读全文
posted @ 2017-12-06 16:58 不可叽叽歪歪 阅读(455) 评论(0) 推荐(0) 编辑
摘要: 其中"爬虫‘,’spanclass','岗位职责‘是垃圾数据,数据清理时没清理完 阅读全文
posted @ 2017-12-06 16:54 不可叽叽歪歪 阅读(215) 评论(0) 推荐(0) 编辑
摘要: from nltk.corpus import PlaintextCorpusReader import nltk corpus_root=r"C:\Users\sun\AppData\Roaming\nltk_data\corpora\jieba" file_pattern=r".*/.*\.txt" ptb=PlaintextCorpusReader(corpus_root,file_pa... 阅读全文
posted @ 2017-12-06 15:20 不可叽叽歪歪 阅读(171) 评论(0) 推荐(0) 编辑
摘要: 1 import jieba 2 with open('lagoujobdatails.txt',encoding='utf-8') as f: 3 tmp_line=f.read() 4 jieba_cut=jieba.cut(tmp_line) 5 ans=' '.join(jieba_cut) 6 with open('jieba5.txt','w',enc... 阅读全文
posted @ 2017-12-06 15:18 不可叽叽歪歪 阅读(269) 评论(0) 推荐(0) 编辑
摘要: 1 words2=re.sub("[\s+\.\!\/_,$%^*(+\"\'\n]+|[+——;!,”。《》,。:“?、~@#¥%……&*()1234567①②③④)]+", "", words) 阅读全文
posted @ 2017-12-06 14:55 不可叽叽歪歪 阅读(117) 评论(0) 推荐(0) 编辑
摘要: 文件不能同名(import re re.py) 阅读全文
posted @ 2017-12-06 14:43 不可叽叽歪歪 阅读(512) 评论(0) 推荐(0) 编辑