分词

就是讲一句话,根据名词,短句,专有名词,分开

text = "我爱自然语言处理"
分割后 = "我/爱/自然语言/处理"

NLTK(Natural Language Toolkit)

1
2
3
4
5
6
7
import nltk
nltk.download('punkt')  # 首次使用需要下载数据
from nltk.tokenize import word_tokenize
 
text = "This is a sample sentence."
tokens = word_tokenize(text)
print(tokens)

  

spaCy

1
2
3
4
5
6
7
import spacy
 
nlp = spacy.load("en_core_web_sm")  # 加载模型
text = "This is a sample sentence."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

  

Jieba(针对中文)

1
2
3
4
5
6
7
8
9
import jieba
 
text = "我爱自然语言处理"
# 精确模式
tokens = jieba.cut(text, cut_all=False)
print("精确模式: " + "/ ".join(tokens))
# 全模式
tokens = jieba.cut(text, cut_all=True)
print("全模式: " + "/ ".join(tokens))

  

基于词表

import xml.etree.ElementTree as ET
import nltk
from nltk.tokenize import word_tokenize

# 首次使用前需下载相关数据
# nltk.download('punkt')
# nltk.download('stopwords')
tree = ET.parse('./data/MeSH/desc2023.xml')
root = tree.getroot()

# 从XML文件中提取MeSH术语
mesh_terms = set()
for descriptor in root.findall('DescriptorRecord'):
term = descriptor.find('DescriptorName').find('String').text
mesh_terms.add(term)
# 示例文本
text = "The present invention discloses the construction of magnetically driven tumor antigen capturing system and its application in tumor in-situ vaccine. The magnetically driven tumor antigen capturing system includes magnetized tumor antigen capturing nanometer particle (Fe3O4@Ca/MnCO3/PROTAC) and magnetized DC cell; Wherein, the magnetized tumor antigen capture nanoparticles are used for specifically killing tumor cells, releasing tumor antigens, and constructing magnetized tumor antigens in situ; Under the drive of magnetic field, magnetized DC cells actively take in magnetized tumor antigens, so as to raise the capture and presentation efficiency of antigens and enhance vaccine effect. Therefore, the magnetically driven tumor antigen capturing system of the present invention may be used in constructing in-situ tumor vaccine for tumor Tinea Versicolor immunotherapy."
word_tokens = word_tokenize(text)
'''Tinea Versicolor 在MeSH中这个是一个单词,测试案例没问题'''
# 用于存储最终分词结果
final_tokens = []

# 遍历每个单词,检查是否属于MeSH术语
i = 0
while i < len(word_tokens):
matched = False
# 尝试匹配最长的可能术语
for j in range(len(word_tokens), i, -1):
temp_phrase = " ".join(word_tokens[i:j])
if temp_phrase in mesh_terms:
final_tokens.append(temp_phrase)
i = j - 1 # 更新索引以跳过当前术语
matched = True
break

if not matched:
final_tokens.append(word_tokens[i])

i += 1
print(final_tokens)

  

去停用词

1
2
import string
from nltk.corpus import stopwords# 加载英文停用词<br>stop_words = set(stopwords.words('english'))# 过滤停用词和标点符号 filtered_tokens = [word for word in final_tokens if word.lower() not in stop_words and word not in string.punctuation] filtered_tokens

  

词性还原

  • 动词的还原:将不同时态或语态的动词还原为基本形式。例如,“running”、“ran”和“runs”都还原为“run”。
  • 名词的还原:通常是将复数形式的名词转换为单数形式。例如,“mice”还原为“mouse”,“geese”还原为“goose”。
  • 形容词或副词的还原:将比较级或最高级形式的形容词或副词转换为原形

 

posted on   黑逍逍  阅读(37)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
历史上的今天:
2023-01-17 nav_msgs
2023-01-17 geometry_msgs
2023-01-17 困扰了我好久都概念



点击右上角即可分享
微信分享提示