分词

就是讲一句话,根据名词,短句,专有名词,分开

text = "我爱自然语言处理"
分割后 = "我/爱/自然语言/处理"

NLTK(Natural Language Toolkit)

import nltk
nltk.download('punkt')  # 首次使用需要下载数据
from nltk.tokenize import word_tokenize

text = "This is a sample sentence."
tokens = word_tokenize(text)
print(tokens)

  

spaCy

import spacy

nlp = spacy.load("en_core_web_sm")  # 加载模型
text = "This is a sample sentence."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

  

Jieba(针对中文)

import jieba

text = "我爱自然语言处理"
# 精确模式
tokens = jieba.cut(text, cut_all=False)
print("精确模式: " + "/ ".join(tokens))
# 全模式
tokens = jieba.cut(text, cut_all=True)
print("全模式: " + "/ ".join(tokens))

  

基于词表

import xml.etree.ElementTree as ET
import nltk
from nltk.tokenize import word_tokenize

# 首次使用前需下载相关数据
# nltk.download('punkt')
# nltk.download('stopwords')
tree = ET.parse('./data/MeSH/desc2023.xml')
root = tree.getroot()

# 从XML文件中提取MeSH术语
mesh_terms = set()
for descriptor in root.findall('DescriptorRecord'):
term = descriptor.find('DescriptorName').find('String').text
mesh_terms.add(term)
# 示例文本
text = "The present invention discloses the construction of magnetically driven tumor antigen capturing system and its application in tumor in-situ vaccine. The magnetically driven tumor antigen capturing system includes magnetized tumor antigen capturing nanometer particle (Fe3O4@Ca/MnCO3/PROTAC) and magnetized DC cell; Wherein, the magnetized tumor antigen capture nanoparticles are used for specifically killing tumor cells, releasing tumor antigens, and constructing magnetized tumor antigens in situ; Under the drive of magnetic field, magnetized DC cells actively take in magnetized tumor antigens, so as to raise the capture and presentation efficiency of antigens and enhance vaccine effect. Therefore, the magnetically driven tumor antigen capturing system of the present invention may be used in constructing in-situ tumor vaccine for tumor Tinea Versicolor immunotherapy."
word_tokens = word_tokenize(text)
'''Tinea Versicolor 在MeSH中这个是一个单词,测试案例没问题'''
# 用于存储最终分词结果
final_tokens = []

# 遍历每个单词,检查是否属于MeSH术语
i = 0
while i < len(word_tokens):
matched = False
# 尝试匹配最长的可能术语
for j in range(len(word_tokens), i, -1):
temp_phrase = " ".join(word_tokens[i:j])
if temp_phrase in mesh_terms:
final_tokens.append(temp_phrase)
i = j - 1 # 更新索引以跳过当前术语
matched = True
break

if not matched:
final_tokens.append(word_tokens[i])

i += 1
print(final_tokens)

  

去停用词

import string
from nltk.corpus import stopwords
# 加载英文停用词
stop_words = set(stopwords.words('english'))
# 过滤停用词和标点符号 filtered_tokens = [word for word in final_tokens if word.lower() not in stop_words and word not in string.punctuation] filtered_tokens

  

词性还原

  • 动词的还原:将不同时态或语态的动词还原为基本形式。例如,“running”、“ran”和“runs”都还原为“run”。
  • 名词的还原:通常是将复数形式的名词转换为单数形式。例如,“mice”还原为“mouse”,“geese”还原为“goose”。
  • 形容词或副词的还原:将比较级或最高级形式的形容词或副词转换为原形

 

posted on 2024-01-17 16:34  黑逍逍  阅读(9)  评论(0编辑  收藏  举报