分词
就是讲一句话,根据名词,短句,专有名词,分开
text = "我爱自然语言处理"
分割后 = "我/爱/自然语言/处理"
NLTK(Natural Language Toolkit)
1 2 3 4 5 6 7 | import nltk nltk.download( 'punkt' ) # 首次使用需要下载数据 from nltk.tokenize import word_tokenize text = "This is a sample sentence." tokens = word_tokenize(text) print(tokens) |
spaCy
1 2 3 4 5 6 7 | import spacy nlp = spacy.load( "en_core_web_sm" ) # 加载模型 text = "This is a sample sentence." doc = nlp(text) tokens = [token.text for token in doc] print(tokens) |
Jieba(针对中文)
1 2 3 4 5 6 7 8 9 | import jieba text = "我爱自然语言处理" # 精确模式 tokens = jieba.cut(text, cut_all=False) print( "精确模式: " + "/ " .join(tokens)) # 全模式 tokens = jieba.cut(text, cut_all=True) print( "全模式: " + "/ " .join(tokens)) |
基于词表
import xml.etree.ElementTree as ET
import nltk
from nltk.tokenize import word_tokenize
# 首次使用前需下载相关数据
# nltk.download('punkt')
# nltk.download('stopwords')
tree = ET.parse('./data/MeSH/desc2023.xml')
root = tree.getroot()
# 从XML文件中提取MeSH术语
mesh_terms = set()
for descriptor in root.findall('DescriptorRecord'):
term = descriptor.find('DescriptorName').find('String').text
mesh_terms.add(term)
# 示例文本
text = "The present invention discloses the construction of magnetically driven tumor antigen capturing system and its application in tumor in-situ vaccine. The magnetically driven tumor antigen capturing system includes magnetized tumor antigen capturing nanometer particle (Fe3O4@Ca/MnCO3/PROTAC) and magnetized DC cell; Wherein, the magnetized tumor antigen capture nanoparticles are used for specifically killing tumor cells, releasing tumor antigens, and constructing magnetized tumor antigens in situ; Under the drive of magnetic field, magnetized DC cells actively take in magnetized tumor antigens, so as to raise the capture and presentation efficiency of antigens and enhance vaccine effect. Therefore, the magnetically driven tumor antigen capturing system of the present invention may be used in constructing in-situ tumor vaccine for tumor Tinea Versicolor immunotherapy."
word_tokens = word_tokenize(text)
'''Tinea Versicolor 在MeSH中这个是一个单词,测试案例没问题'''
# 用于存储最终分词结果
final_tokens = []
# 遍历每个单词,检查是否属于MeSH术语
i = 0
while i < len(word_tokens):
matched = False
# 尝试匹配最长的可能术语
for j in range(len(word_tokens), i, -1):
temp_phrase = " ".join(word_tokens[i:j])
if temp_phrase in mesh_terms:
final_tokens.append(temp_phrase)
i = j - 1 # 更新索引以跳过当前术语
matched = True
break
if not matched:
final_tokens.append(word_tokens[i])
i += 1
print(final_tokens)
去停用词
1 2 | import string from nltk.corpus import stopwords# 加载英文停用词<br>stop_words = set(stopwords.words( 'english' ))# 过滤停用词和标点符号 filtered_tokens = [word for word in final_tokens if word.lower() not in stop_words and word not in string.punctuation] filtered_tokens |
词性还原
- 动词的还原:将不同时态或语态的动词还原为基本形式。例如,“running”、“ran”和“runs”都还原为“run”。
- 名词的还原:通常是将复数形式的名词转换为单数形式。例如,“mice”还原为“mouse”,“geese”还原为“goose”。
- 形容词或副词的还原:将比较级或最高级形式的形容词或副词转换为原形
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
2023-01-17 nav_msgs
2023-01-17 geometry_msgs
2023-01-17 困扰了我好久都概念