1、keyBERT(英文效果>中文)
链接:https://hidadeng.github.io/blog/keybert_tutorial/
用法:
!pip3 install gensim==4.0.0
!pip3 install keybert==0.5.1
!pip3 install gensim==3.8.3
from keybert import KeyBERT import jieba bertModel = KeyBERT('distiluse-base-multilingual-cased') # bertModel = KeyBERT('distilbert-base-nli-mean-tokens') doc="Primovist 10ml pre-filled glass syringes" doc = " ".join(jieba.cut(doc)) kw = bertModel.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, top_n=10) print(kw)
from keybert import KeyBERT import spacy import jieba zh_model = spacy.load("zh_core_web_sm") bertModel = KeyBERT(model=zh_model) # 中文测试数据 doc = "各有关单位: 为提高体外诊断试剂临床试验的科学性与合理性,优化临床试验技术要求和管理要求,按照中共中央办公厅、国务院办公厅印发《关于深化审评审批制度改革鼓励药品医疗器械创新的意见》(厅字〔2017〕42号)总体要求和国家药品监督管理局的统一部署,结合《医疗器械监督管理条例》及配套规章的修改情况,我中心对《体外诊断试剂临床试验技术指导原则》(国家食品药品监督管理总局通告 2014年第16号)进行修订,形成了《体外诊断试剂临床试验指导原则(征求意见稿)》。" # 整理成类似于英语这样空格间隔词语形式的文本 doc = ' '.join(jieba.lcut(doc)) # 关键词提取 keywords = bertModel.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, top_n=10) print(keywords)
from keybert import KeyBERT import spacy doc="Primovist 10ml pre-filled glass syringes: Possible issue with plunger resistance A Dear Healthcare Professional Letter has been issued by Bayer to inform healthcare professionals of possible increased plunger resistance with single Primovist 10 mL pre-filled glass syringes. " en_model = spacy.load("en_core_web_sm") kw_model = KeyBERT() keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, top_n=20) print(keywords)
常用extract_keywords参数
bertModel.extract_keywords(docs, keyphrase_ngram_range, stop_words, top_n)
- docs 文档字符串(空格间隔词语的字符串)
- keyphrase_ngram_range 设置ngram,默认(1, 1)
- stop_words 停用词列表
- top_n 显示前n个关键词,默认5
- highlight 可视化标亮关键词,默认False
- use_maxsum: 默认False;是否使用Max Sum Similarity作为关键词提取标准,
- use_mmr: 默认False;是否使用Maximal Marginal Relevance (MMR) 作为关键词提取标准
- diversity 如果use_mmr=True,可以设置该参数。参数取值范围从0到1
对于keyphrase_ngram_range参数,
- (1, 1) 只单个词
- (2, 2) 考虑词组
- (1, 2) 同时考虑以上两者情况
spacy 的版本与"zh_core_web_sm"、"en_core_web_sm" 存在不匹配情况
装不上可先下载 然后本地安装 :pip install /文件路径/en_core_web_sm-2.2.5.tar.gz
下载地址:
https://github.com/explosion/spacy-models/tree/master/meta
YAKE!(对中文效果不好,支持20+外文)
链接: https://github.com/LIAAD/yake
用法:直接安装
pip install git+https://github.com/LIAAD/yake
import yake text = "近日,国家药品监督管理局经审查,批准了腾讯医疗健康(深圳)有限公司生产的^慢性青光眼样视神经病变眼底图像辅助诊断软件^创新产品注册申请。" language = "zh" # language = "en" # language = "el" # language = "pt" # language = "ar" max_ngram_size = 1 deduplication_threshold = 0.9 # 重复数据删除阈值 deduplication_algo = 'seqm' # 重复数据删除 windowSize = 1 numOfKeywords = 20 # # custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, # dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, # # # with open(r'D:\app\Python310\Lib\site-packages\yake\StopwordsList\stopwords_zh.txt', 'r', encoding='utf-8') as f: stop_words= (f.read()) custom_kw_extractor = yake.KeywordExtractor(lan=language,stopwords=stop_words) keywords = custom_kw_extractor.extract_keywords(text) for kw in keywords: print(kw)
YAKE!支持多种语言,
通过 language = "ar" 来更改语言,对应的参数一般为语言中文名的前两个汉字首字母小写。安装完yake 可在路径下面找到对应语言的驻停祠
TF-IDF(针对中文)
参考链接:https://blog.csdn.net/asialee_bird/article/details/81486700
text = "Affects the quality of early treatment of patients with blood poisoning Supervision of health services must contribute to safer services with a higher quality. But little is known about the extent to which supervision succeeds in achieving these goals. " import jieba.analyse kw = jieba.analyse.extract_tags(text, topK=20, withWeight=True, allowPOS=('ns', 'n', 'vn', 'v')) # 可选名次、动词、动名词等参数 print(kw)
Rake(英文效果好)
链接:
https://github.com/laserwave/keywords_extraction_rake
用法在链接里