文本清洗常用的工具
原文地址:https://zhuanlan.zhihu.com/p/53286270 贪心科技李文哲老师的文章 学习笔记
1、去除标点符号
s = ''.join(c for c in word if c not in string.punctuation)
2、英文转换为小写
s.lower()
3、数字归一化
s = '#number' if s.isdigit() else s
4、停用词库/低频词库
停用词库:我们可以直接在搜索引擎上搜索“停用词库”或“english stop words list”,能找到很多停用词库。例如:
stop_words = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your","ain't","aren't","can't","could've","couldn't","didn't","doesn't","don't","hasn't","he'd","he'll","he's","how'd","how'll","how's","i'd","i'll","i'm","i've","isn't","it's","might've","mightn't","must've","mustn't","shan't","she'd","she'll","she's","should've","shouldn't","that'll","that's","there's","they'd","they'll","they're","they've","wasn't","we'd","we'll","we're","weren't","what'd","what's","when'd","when'll","when's","where'd","where'll","where's","who'd","who'll","who's","why'd","why'll","why's","won't","would've","wouldn't","you'd","you'll","you're","you've"]
低频次库:我们可以使用Counter等库获取所有句子中所有词的词频,通过筛选词频获得低频词库。例如:
from collections import Counter # 获取词典 word_dict = Counter(sentence_list) # 建立低频词库 low_frequency_words = [] low_frequency_words.append([k for (k,v) in word_dict.items() if v <2])
获取停用词库和低频词库后,将词库中的词语删除
if s not in stop_words and s not in low_frequency_words: sentence += s
5、去除不必要的标签
这一块在实际工作中需要灵活的使用,例如使用re库对文本做正则删除、替换,利用json库去解析json数据,又或者使用规则对文本进行相应的处理。