NLP学习1

NLP学习

使用书籍《PyTorch自然语言处理入门与实践》

只是作为一个没有实践过的机器学习小白的入门，具体了解可以去看《动手学深度学习》

NLP学习

自然语言处理实战-在线资源 (es2q.com)

1.常用库

numpy 科学计算

matplotlib 图表可视化

scikit-learn 数据挖掘和数据分析

nltk 包含50种语料和常见算法

spacy 实体命名，预训练词向量需要先安装对应语言的模型

jieba 中文分词

pkuseg pku论文的中文分词

wn 加载使用wordnet的包

pandas 数据处理

torchtext 更方便利用Pytorch处理文本

2.python处理字符串

1.str类型

不可变对象

ord()获得字符编码值

chr()编码值转换字符

split+join转换为列表

常用方法

find 返回第一次出现下标
rfind 倒数第一次出现下标
count 出现次数
startswith 是否以某串开头
endswith 是否以某串结尾
isdigit 是否为数字
isalpha 是否为字母
isupper 是否为大写字母
istrip 删除开头指定字符
rstrip 删除结尾指定字符
strip 删除首尾指定字符
replace 字符替换
center 指定宽度字符串居中

2.bytes类型

>>> byte1 = b"hello"

与字符串转换

>>> print(str(byte1))
b'hello'
>>> print((byte1.decode()))
hello

str可以用encode指定一种编码方式编码为byte

3.StringIO类

可变

>>> import io
>>> sio = io.StringIO()
>>> sio.write('hello')
5
>>> sio.write(' ')
1
>>> sio.write('world')
5
>>> print(sio.getvalue())
hello world
>>> sio.close()

3.python 处理语料

1.读取语料

txt文本

f = open('text.txt',encoding='utf8') #用utf8编码打开文件
words = [] #定义空的list用于存放所有词语
for l in f:
    word = l.strip().splt(' ') # 删除行尾换行符，切分单词和中文
    words.append(word)
f.close() #关闭文件

csv

import csv
f = open('file.csv',encoding='utf8') #用utf8编码打开文件
reader = csv.reader(f)
lines = [] 
for l in reader:
    lines.append(l)

json

import json
f = open('file.json', 'r', encoding='utf8')  # 用utf8编码以读取模式打开文件
data = json.load(f)  # 直接读取JSON文件内容

2，去重

使用set去重(add添加，in判断是否在内)，大数据使用BitMap或Bloom Filter

3.停用词

去GitHub找stopwords

4.编辑距离

衡量两个字符串之间的差异。定义了三种操作：插入一个字符，删除一个字符，替换一个字符，编辑距离就是一个字符串变成另一个字符串的最小操作，可以使用dp来进行计算

def minDistance(word1:str,word2:str)->int:
    n = len(word1)
    m = len(word2)
    dp = [[0]*(m+1) for _ in range(n+1)]
    for i in range(m+1):dp[0][i]=I
    for i in range(n+1):dp[i][0]=i
    for i in range(1,n+1):
        for j in range(1,m+1):
            if word1[i-1] == word2[j-1]:
				dp[i][j] = dp[i-1][j-1]
            else
            	dp[i][j] = min(dp[i][j-1],dp[i-1][j],dp[i-1][j-1])+1
    return dp[-1][-1] #最后一个元素

5.文本规范化

6.分词

7 .词频-逆文本频率

8.独热编码

4.PyTorch & Transformers的安装

PyTorch

【布客】PyTorch 中文翻译 (apachecn.org)

英伟达显卡

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

对于AMD

唉，windows不支持

ROCm is not available on Windows

CPU

pip3 install torch torchvision torchaudio

检查

>>> import torch
>>> torch.version
<module 'torch.version' from '\\.conda\\envs\\nlp\\Lib\\site-packages\\torch\\version.py'>
>>> torch.cuda.is_available
<function is_available at 0x000001F4D67EE0C0>

Transformers

pip install transformers

检查

>>> from transformers import pipeline
>>> print(pipeline('sentiment-analysis')('I love you'))
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

如果报错，可能是网络原因

requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /distilbert/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"), '(Request ID: cf626477-ad07-40c9-b4ce-dcf8371fe213)')

5.Pytorch基本使用

基础数据类型是张量(tensor)

是一种可以定义和运行在GPU上的多维数组。(关于GPU可以看龚大的系列视频上帝视角看GPU（1）：图形流水线基础)

运行在GPU上就能利用GPGPU的并行性来进行快速运算。

1.张量的创建

从列表或numpy.array创建

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]],dtype=torch.float32)
>>> print(t,t.shape,t.dtype)
tensor([[1., 2., 3.],
        [4., 5., 6.]]) torch.Size([2, 3]) torch.float32

创建全0、1或随机张量

>>> import torch
>>> rand_t = torch.rand((3,3)) #均匀分布，还有randint指定范围,randn标准正态分布,normal高斯分布
>>> ones_t = torch.ones((2,2)) #可以通过arange(x)创建0到x-1行向量
>>> zeros_t = torch.zeros((1,8))
>>> print(rand_t)
tensor([[0.5173, 0.6960, 0.7608],
        [0.6487, 0.5882, 0.0938],
        [0.7563, 0.0548, 0.2958]])
>>> print(ones_t)
tensor([[1., 1.],
        [1., 1.]])
>>> print(zeros_t)
tensor([[0., 0., 0., 0., 0., 0., 0., 0.]])

填充张量

>>> import torch
>>> t = torch.full((4,5),9)
>>> print(t)
tensor([[9, 9, 9, 9, 9],
        [9, 9, 9, 9, 9],
        [9, 9, 9, 9, 9],
        [9, 9, 9, 9, 9]])

2.张量的变换

拼接(cat)、堆叠(stack)

>>> import torch
>>> t1 = torch.tensor([1,2,3]) #只有一维 dim也是0
>>> t2 = torch.tensor([4,5,6])
>>> t3 = torch.cat([t1,t2]) # dim默认为0
>>> print(t3)
tensor([1, 2, 3, 4, 5, 6])
>>> t4 = torch.tensor([[1,2,3],[4,5,6]])
>>> t5 = torch.tensor([[4,5,6],[7,8,9]])
>>> t6 = torch.cat([t4,t5])
>>> t7 = torch.cat([t4,t5],dim = 1) #指定拼接成二维
>>> print(t6)
tensor([[1, 2, 3],
        [4, 5, 6],
        [4, 5, 6],
        [7, 8, 9]])
>>> print(t7)
tensor([[1, 2, 3, 4, 5, 6],
        [4, 5, 6, 7, 8, 9]])
>>> t8 = torch.stack([t1,t2])
>>> print(t8)
tensor([[1, 2, 3],
        [4, 5, 6]])

切分(chunk/split)

>>> import torch
>>> t1 = torch.tensor([1,2,3,4,5])
>>> print(torch.chunk(t1,1))
(tensor([1, 2, 3, 4, 5]),)
>>> print(torch.chunk(t1,2))
(tensor([1, 2, 3]), tensor([4, 5]))
>>> print(torch.chunk(t1,3))
(tensor([1, 2]), tensor([3, 4]), tensor([5]))
>>> print(torch.chunk(t1,4))
(tensor([1, 2]), tensor([3, 4]), tensor([5]))
>>> print(torch.chunk(t1,5))
(tensor([1]), tensor([2]), tensor([3]), tensor([4]), tensor([5]))
>>> t2 = torch.tensor([[1,2,3],[4,5,6],[7,8,9]])
>>> print(torch.split(t2,2,0))
(tensor([[1, 2, 3],
        [4, 5, 6]]), tensor([[7, 8, 9]]))
>>> print(torch.split(t2,2,1))
(tensor([[1, 2],
        [4, 5],
        [7, 8]]), tensor([[3],
        [6],
        [9]]))

改变形状(reshape)

>>> import torch
>>> t = torch.tensor([1,2,3,4,5,6])
>>> print(torch.reshape(t,(2,3)))
tensor([[1, 2, 3],
        [4, 5, 6]])

交换维度(transpose)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]])
>>> print(t)
tensor([[1, 2, 3],
        [4, 5, 6]])
>>> print(torch.transpose(t,0,1))
tensor([[1, 4],
        [2, 5],
        [3, 6]])

插入/去掉维度(unsqueeze/squeeze)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
>>> print(t,t.shape)
tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]) torch.Size([4, 3])
>>> t1 = torch.unsqueeze(t,0)
>>> print(t1,t1.shape)
tensor([[[ 1,  2,  3],
         [ 4,  5,  6],
         [ 7,  8,  9],
         [10, 11, 12]]]) torch.Size([1, 4, 3])
>>> t2 = torch.unsqueeze(t,1)
>>> print(t2,t2.shape)
tensor([[[ 1,  2,  3]],

        [[ 4,  5,  6]],

        [[ 7,  8,  9]],

        [[10, 11, 12]]]) torch.Size([4, 1, 3])
>>> t3 = t2.squeeze()
>>> print(t3,t3.shape)
tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]) torch.Size([4, 3])

扩展维度(expand)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]])
>>> print(t,t.shape)
tensor([[1, 2, 3],
        [4, 5, 6]]) torch.Size([2, 3])
>>> s = t.expand(1,2,2,3) #目标维度
>>> print(s,s.shape)
tensor([[[[1, 2, 3],
          [4, 5, 6]],

         [[1, 2, 3],
          [4, 5, 6]]]]) torch.Size([1, 2, 2, 3])

重复(repeat)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]])
>>> print(t,t.shape)
tensor([[1, 2, 3],
        [4, 5, 6]]) torch.Size([2, 3])
>>> s = t.repeat(1,1,2,2) # 复制的倍数
>>> print(s,s.shape)
tensor([[[[1, 2, 3, 1, 2, 3],
          [4, 5, 6, 4, 5, 6],
          [1, 2, 3, 1, 2, 3],
          [4, 5, 6, 4, 5, 6]]]]) torch.Size([1, 1, 4, 6])

3.张量的索引

item

import torch
t = torch.tensor([[1,2,3],[4,5,6]])
print(t[1])
print(t[1][2])
print(t[1][2].item())

输出

tensor([4, 5, 6])
tensor(6)
6

[:,1]

import torch
t = torch.tensor([[1,2,3],[4,5,6],[9,8,7]])
print(t)
print(t[:,1])
t[:,0] = t[:,2]
print(t)

输出

tensor([[1, 2, 3],
        [4, 5, 6],
        [9, 8, 7]])
tensor([2, 5, 8])
tensor([[3, 2, 3],
        [6, 5, 6],
        [7, 8, 7]])

4.张量的计算

add,sub,mul,div +-*/

import torch
t1 = torch.tensor([[1,2,3],[4,5,6]])
t2 = torch.tensor([[1,2,3],[4,5,6]])
print(t1+t2)
print(torch.add(t1,t2))
print(t1+1)
print(t1+torch.tensor([1,2,3]))# 维数不一样自动广播

输出

tensor([[ 2,  4,  6],
        [ 8, 10, 12]])
tensor([[ 2,  4,  6],
        [ 8, 10, 12]])
tensor([[2, 3, 4],
        [5, 6, 7]])
tensor([[2, 4, 6],
        [5, 7, 9]])

5.torch的神经网络(nn)

Module

神经网络基类，继承此类以用pytorch实现自己的神经网络

Module — PyTorch 2.4 documentation

RNN

循环神经网络

参数名称	参数说明
input_size	输入数据每个元素的维度
hidden_size	隐藏层大小
num_layers	层数
nonlinearity	非线性函数种类，tanh或relu，默认relu
bias	是否有bias权重，默认True
batch_first	数据默认第二个维度是batch，设为True让batch作为第一维度
dropout	如果非零，除了最后一次都会添加一个dropout层
bidirectional	如果为True，则变为双向RNN，默认为False

LSTM

长短期记忆网络，参数与RNN类似

GRU

门控循环单元，参数与LSTM相同

Transformer

参数名称	参数说明
d_model	编码器/解码器中的特征维度，默认为512
nhead	多头注意力中的head数，默认为8
num_encoder_layers	编码器层数，默认6
num_decoder_layers	解码器层数，默认6
dim_feedforward	前馈网络模型的维度，默认2048
dropout	dropout比例，默认0.1
custom_encoder	自定义编码器，默认None
custom_decoder	自定义解码器，默认None

Linear

线性层

\(y=xA^{T}+b\)

Bilimear

双线性层

\(y=x_1A^{T}x_2+b\)

Dropout

按照概率p随机把数据置为0，默认0.5

inplace原地操作，默认为false

Embedding

实现ID到向量的转化

6.激活函数

非线性可微分的函数，为网络加入非线性特性。

Sigmoid

\[f(x) = \frac{1}{(1+e^{-x})} \]

Pytorch提供torch.nn.Sigmoid torch.nn.functional.sigmoid torch.sigmoid和torch.Tensor.sigmoid

Tanh

\[f(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}} \]

Pytorch提供torch.nn.tanh torch.nn.functional.tanh torch.tanh和torch.Tensor.tanh

ReLU

\[f(x)=max(x,0) \]

Pytorch提供torch.nn.ReLu torch.nn.functional.ReLU

Softmax

\[Softmax(x_i)=\frac{e^{x_i}}{\sum_j{e^{x_j}}} \]

Pytorch提供torch.nn.Softmax torch.nn.functional.softmax

Softmin

\[Softmin(x_i)=\frac{e^{-x_i}}{\sum_j{e^{-x_j}}} \]

Pytorch提供torch.nn.Softmin

LogSoftmax

\[LogSoftmax(x_i)=log(\frac{e^{x_i}}{\sum_j{e^{x_j}}}) \]

Pytorch提供torch.nn.LogSoftmax torch.nn.functional.log_softmax

7.损失函数

评估模型输出结果和真实值的差距

0-1损失函数

\[L(y,\hat y)=\left\{ \begin{aligned} 0,y=\hat y \\ 1, y \neq \hat y \\ \end{aligned} \right. \]

平方损失函数

\[L(y,\hat y)=(y- \hat y)^2 \]

pytorch提供torch.nn.MSEloss计算平方损失函数

绝对值损失函数

\[L(y,\hat y)=|y-\hat y| \]

pytorch提供torch.nn.L1Loss计算绝对值损失函数

对数损失函数

也称对数似然损失函数，用于分类问题。真实值是类别，模型输出概率

\[L(y,P(y|x))=-log(y|x) \]

pytorch提供torch.nn.NLLLoss和torch.nn.CrossEntropyLoss计算绝对值损失函数

8.优化器

根据损失函数的值，更新神经网络权重

SGD优化器

SGD全称Stochastic Gradient Descent，随机梯度下降。每次选择一个mini-batch，而不是全部样本，使用梯度下降来更新模型参数。pytorch提供了torch.optim.SGD类，相关参数

参数	含义
params	模型参数
lr	学习率，learning rate，默认0.01
momentum	用于提高训练速度的方法，默认为0
weight_decay	权重衰减，默认为0
dampening	Momentum参数，默认为0
nesterov	启用Nesterov Momentum，默认为False

Adam优化器

Adam全称Adaptive Moment Estimation，自适应矩估计。它结合了Momentum和RMSprop优化器的优点，为每个参数计算自适应学习率。Adam优化器在深度学习中被广泛使用，pytorch提供了torch.optim.Adam类

参数	含义
params	模型参数
lr	学习率，learning rate，默认0.001
betas	用于计算梯度的一阶矩和二阶矩的系数，默认为(0.9, 0.999)
eps	添加到分母中，以提高数值稳定性，默认为1e-8
weight_decay	权重衰减，默认为0
amsgrad	是否使用AMSGrad变体，默认为False

AdamW优化器

AdamW是Adam优化器的一种变体，它在Adam的基础上加入了权重衰减。在很多任务中，AdamW优化器表现优于原始的Adam优化器。pytorch提供了torch.optim.AdamW类，参数与Adam优化器相同。

9.数据加载

显存有限，所以训练数据往往难以一次载入。Pytorch提供Dataset类存放数据，DataLoader类加载数据

Dataset

torch.utils.data.Dataset作为基类，使用时定义自己的data类继承于此，并实现__getitem__和__len_方法分别用于获取数据集中指定下标数据和得到数据集大小。

class MyDataSet(torch.utils.data.Dataset):
    def __init__(self,examples):
        self.examples = examples
    def __len__(self):
        return len(self.examples)
    def __getitem__(self,index):
        example = self.examples[index]
        s1 = example[0] #假设是当前数据第一个句子
        s2 = example[1]	#假设是当前数据第二个句子
        l1 = len(s1)
        l2 = len(s2)
       	return s1,l1,s2,l2,index

DataLoader

torch.utils.data.DataLoader用于帮助加载数据，一般用于把原始数据转换为张量，使用多进程处理和加载数据。主要参数如下

参数	含义
dataset	数据集，可以是`torch.utils.data.Dataset`的任何子类
batch_size	每个批次的样本数量，默认为1
shuffle	每个epoch开始时是否打乱数据，默认为False
sampler	定义数据加载的采样策略，如果指定，则shuffle必须为False
batch_sampler	类似于sampler，但是一次返回一批索引，而不是单个索引。与batch_size, shuffle, sampler和drop_last互斥
num_workers	用于数据加载的子进程数。0表示数据将在主进程中加载，默认为0
collate_fn	将一个列表的样本合并成一个批次的数据的函数，默认为None
pin_memory	如果设置为True，数据加载器将返回张量，这些张量将使用CUDA的固定内存（如果可用），默认为False
drop_last	如果数据集大小不能被batch_size整除，设置为True将丢弃最后一个不完整的批次，默认为False
timeout	如果是正数，则为从worker队列中收集批次的时间限制，默认为0
worker_init_fn	每个worker子进程的初始化函数，默认为None

通常需要把Dataset中地多条数据组合成一个batch，并转化为张量，利用collate_fn函数。DataLoader一个典型的collate_fn的代码

def the_collate_fn(batch):
    src = [[0]*batch_size] #开始标志
    tar = [[0]*batch_size]
   	# 计算整个Batch中第一个句子(源句)的最大长度
    src_max_l = 0
    for b in batch:
        src_max_l = max(src_max_l,b[1])
        #计算整个Batch中第二个句子(目标句)的最大长度
    tar_max_l = 0
    for b in batch:
        tar_max_l = max(tar_max_l,b[3])
	for i in range(src_max_l):
        l = []
        for x in batch:
            if i < x[1]:
                l.append(en2id[x[0][i]])
            else:
                # 当前句子已经结束，填入填充字符
                l.append(pad_id)
		src.append(l)
 	for i in range(tar_max_l):
        l = []
        for x in batch:
            if i < x[3]:
                l.append(zh2id[x[0][i]])
            else:
                # 当前句子已经结束，填入填充字符
                l.append(pad_id)
		tar.append(l)
 	indexs = [b[4] for b in batch]
    src.append([1]*batch_size) #结束标志
    tar.append([1]*batch_size) 
    s1 = torch.LongTensor(src)
    s2 = torch.LongTensor(tar)
    return s1,s2,indexs

完整地使用Dataset和DataLoader的代码

train_dataset = MyDataSet(train_set)
train_data_loader = torch.utils.data.DataLoader(
	train_dataset,
    batch_size = batch_size,
    shuffle = True, #是否打乱顺序
    num_workers = data_workers, #工作进程数
    collate_fn = the_collate_fn,
)

dev_dataset = MyDataSet(dev_set)
dev_data_loader = torch.utils.data.DataLoader(
	dev_dataset,
    batch_size = batch_size,
    shuffle = True, #是否打乱顺序
    num_workers = data_workers, #工作进程数
    collate_fn = the_collate_fn,
)

10.TorchText

Data类

Dataset类

Vocab

6.初步使用字符级RNN分类帖子

原书的帖子地址没法用，所以我把自己发的博客标题下载下来，然后可以去data.7z - 蓝奏云 (lanzn.com)下载

将博客分为“problems”和“others”

1.简单的爬虫程序

使用requests库发送https请求，再使用BeautifulSoup解析html内容

import requests
import time
import tqdm #进度条库

for pid in tqdm(range(1,n)):
    r = requests.get(url)

    with open(target.html,'wb') as f:
        f.write(r.content)
	b = BeautifulSoup(r.text)
    table = b.find('table')#查找table标签
#接下来自己解析就完了

2.从文件读取数据

problems = []
others = []
with open('problems.txt',encoding='utf8') as f:
    for l in f:
        problems.append(l.strip()) #strip 去除空格
with open('others.txt',encoding='utf8') as f:
    for l in f:
        others.append(l.strip())

3.输入与输出

在这里使用字符级RNN，无需进行分词，使用One-Hot或词嵌入

One-Hot

In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).[1] A similar implementation in which all bits are '1' except one '0' is sometimes called one-cold.[2] In statistics, dummy variables represent a similar technique for representing categorical data.

基本步骤

确定类别数量：首先，确定数据集中有多少个不同的类别。

创建二进制向量：对于每个类别，创建一个长度等于类别总数的向量。在这个向量中，除了对应于该类别的位置是“1”之外，其他所有位置都是“0”。

词嵌入

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.[1] Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Methods to generate this mapping include neural networks,[2] dimensionality reduction on the word co-occurrence matrix,[3][4][5] probabilistic models,[6] explainable knowledge base method,[7] and explicit representation in terms of the context in which words appear.[8]

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[9] and sentiment analysis.[10]

可以看【官方双语】GPT是什么？直观解释Transformer | 深度学习第5章,有对于词嵌入的生动解释

1.统计数据集出现的字符数目

char_set = set() # 创建集合，集合可自动去除重复元素
for problem in problems:
    for ch in problem:
        char_set.add(ch)
for other in others:
    for ch in other:
        char_set.add(ch)
print(len(char_set))

输出

2.使用one-hot编码

import torch
char_list = list(char_set)
n_chars = len(char_list) + 1 #加一个UNK代表未知字符

def title_to_tensor(title):
    tensor = torch.zeros(len(title),1,n_chars)
    for li,ch in enumerate(title): #enumerate返回索引和字符
        try:
            ind = char_list.index(ch)
        except ValueError:
            ind = n_chars - 1
        tensor[li][0][ind] = 1
    return tensor

3.使用词嵌入表示

import torch
char_list = list(char_set)
n_chars = len(char_list) + 1 #加一个UNK代表未知字符

def title_to_tensor(title):
    tensor = torch.zeros(len(title),dtype = torch.long)
    for li,ch in enumerate(title): #enumerate返回索引和字符
        try:
            ind = char_list.index(ch)
        except ValueError:
            ind = n_chars - 1
        tensor[li] = ind
    return tensor
embedding = torch.nn.Embedding(n_chars,100) # 词向量维度通常选择一个比词语数目少得多的值
# 实际使用embedding应该定义在模型里，以便训练时更新参数

4.输出

目标是判断一个博客的标题是属于"算法题"还是"其它"，0代表前者，1代表后者，可以设置阈值，例如0.5然后据此划分。或者输出两个值，第一个代表前者概率，第二个代表后者概率，较大的认为更准确，使用张量的topk获取张量中最大元素及其下标

t = torch.tensor([0.3,0.7])
topn,topi = t.topk(1)
print(topn,topi)

输出

tensor([0.7000]) tensor([1])

4.使用RNN

1.定义模型

class RNN(torch.nn.Module):
    def __init__(self,word_count,embedding_size,hidden_size,output_size):
        super(RNN,self).__init__() # 调用父类的构造函数，初始化模型
        self.hidden_size = hidden_size #隐藏层大小
        self.embedding = torch.nn.Embedding(word_count,embedding_size) #词嵌入
        self.i2h = torch.nn.Linear(embedding_size + hidden_size,hidden_size) #输入到隐藏层
        self.i2o = torch.nn.Linear(embedding_size + hidden_size,output_size) #输入到输出
        self.softmax = torch.nn.LogSoftmax(dim=1) #softmax层
    
    def forward(self,input_tensor,hidden): #调用模型会自动执行该方法
        word_vector = self.embedding(input_tensor) #字id作为词嵌入向量
        combined = torch.cat((word_vector,hidden),1)  #拼接词向量和隐藏层
        hidden = self.i2h(combined) #隐藏层输出
        output = self.i2o(combined) #得到输出
        output = self.softmax(output) #softmax化
        return output,hidden
        
    def initHidden(self): #初始化隐藏层为全零
        return torch.zeros(1,self.hidden_size)

参数有词表大小(word_count),词嵌入维度(embedding_size),隐藏层维度(hidden_size),输出维度(output_size)

2.预处理数据

合并数据并添加标签

all_data = [] #新列表保存全部信息
categories = ['算法题','其它']
for l in problems:
    all_data.append((title_to_tensor(l),torch.tensor([0],dtype=torch.long))) #标签0
for l in others:
    all_data.append((title_to_tensor(l),torch.tensor([1],dtype=torch.long)))  #标签1

将数据划分为训练集和测试集

import random
random.shuffle(all_data) #打乱数组顺序
data_len = len(all_data)
split_ratio = 0.7 #训练集占比ss
train_data = all_data[:int(data_len*split_ratio)]
test_data = all_data[int(data_len*split_ratio):]
print('Train data size:',len(train_data))
print('Test data size:',len(test_data))

输出

Train data size: 74
Test data size: 32

3.训练和评估

def run_rnn(rnn,input_tensor):
    hidden = rnn.initHidden()
    for i in range(input_tensor.size()[0]):
        output,hidden = rnn(input_tensor[0].unsqueeze(dim=0),hidden)
    return output

def train(rnn,criterion,input_tensor,category_tensor):
    rnn.zero_grad() # 重置梯度
    output = run_rnn(rnn,input_tensor) #运行模型获取输出
    loss =criterion(output,category_tensor) #计算损失
    loss.backward() #反向传播

    #根据梯度进行调参
    for p in rnn.parameters():
        p.data.add_(p.grad.data,alpha = -learning_rate)
    return output, loss.item()

def evaluate(rnn,input_tensor):
    with torch.no_grad():
        hidden = rnn.initHidden()
        output = run_rnn(rnn,input_tensor)
        return output

4.开始训练

from tqdm import tqdm
epoch = 1 #训练轮数
learning_rate = 0.005 #学习率
criterion = torch.nn.NLLLoss() #损失函数
loss_sum = 0 #当前损失累加
all_losses = [] #记录训练过程中的损失变化用于绘制损失变化图
plot_every = 10 #每100个数据记录一次平均损失
for e in range(epoch):
    for ind,(title_tensor,label) in enumerate(tqdm(train_data)):
        output,loss = train(rnn,criterion,title_tensor,label)
        loss_sum += loss
        if ind % plot_every == 0:
            all_losses.append(loss_sum/plot_every)
            loss_sum = 0
    c = 0
    for title,category in tqdm(test_data):
        output = evaluate(rnn,title)
        topn,topi = output.topk(1)
        if topi.item() == category[0].item():
            c += 1
    print('accuracy',c / len(test_data))

输出

100%|█████████████████████████████████████████████████████████████████████████████████| 74/74 [00:00<00:00, 284.91it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 850.39it/s]
accuracy 0.625

数据太少了，并不理想，把训练集比率调到0.8后的输出

100%|█████████████████████████████████████████████████████████████████████████████████| 84/84 [00:00<00:00, 317.70it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 22/22 [00:00<00:00, 785.69it/s]
accuracy 0.9090909090909091

5.绘制训练过程中损失率下降图

import matplotlib.pyplot as plt

plt.figure(figsize = (10,7))
plt.ylabel('Average Loss')
plt.plot(all_losses[1:])

6.保存和加载模型

#保存模型
torch.save(rnn,'rnn_model.pkl')
#加载模型
rnn = torch.load('rnn_model.pkl')

可能输出警告，因为'没有设置 weights_only 参数为 True。这意味着加载过程中可能会执行pickle数据中的任意代码，这可能存在安全风险。'

7.使用该神经网络

def get_category(title):
    title = title_to_tensor(title)
    output = evaluate(rnn,title) #利用这个评估函数
    topn,topi = output.topk(1)
    #print(categories[topi.item()])
    return categories[topi.item()]

def print_test(title):
    print('%s\t%s'% (title,get_category(title)))
print_test('题解')
print_test('训练赛')
print_test('图形学')
print_test('opengl')
print_test('动态规划')
print_test('Codeforces1234')
print_test('挑战赛')
print_test('炉石传说')
print_test('Markdown')
print_test('黑神话悟空')
print_test('你这猴子真令我欢喜')
print_test('中国石油大学训练赛')

输出

题解	算法题
训练赛	算法题
图形学	其它
opengl	其它
动态规划	算法题
Codeforces1234	算法题
挑战赛	算法题
炉石传说	算法题
Markdown	其它
黑神话悟空	算法题
你这猴子真令我欢喜	算法题
中国石油大学训练赛	其它

可以看出来对于很多未知的评判很不准确，可能是数据集太少，也可能是单纯的RNN在这个问题上其实并不太完美

7.分词

1.经典方法介绍

1.基于词典匹配分词

1.最大正向匹配

从前往后匹配最长的，例如能分'我们'，就不会得到'我'

2.最大逆向匹配

从后往前匹配最长的，例如'品如果汁'，可能会得到'品如/果汁'，而不是'品/如果/汁'

3.双向匹配

结合前面两种方法，进行两种匹配后根据自定义规则选择结果，例如切分次数更少

2.基于频率分词

从给定语料中比较不同分词方法出现的频率，找到一种概率最大的分法。

3.基于机器学习的分词

利用标注好的语料做训练数据来训练分词模型，模型判断每个字符是否为新词语的开始，结巴分词工具使用了双向GRU模型进行分词

2.第三方分词工具

1.S-MSRSeg

微软亚洲研究院2004年发布的中文分词工具，似乎太老了？

Download S-MSRSeg from Official Microsoft Download Center

2.ICTCLAS

中科院开发的中文分词系统

NLPIR-team/NLPIR (github.com)

3.结巴分词

可以直接用pip安装，支持多种模式，包括精确模式、全模式和搜索引擎模式，并且可以添加自定义词典。

import jieba
print('/'.join(list(jieba.cut('只要心中还有放不下的偶像，终有一天，它将化为修行路上的无解业障'))))

输出

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\test\AppData\Local\Temp\jieba.cache
Loading model cost 0.778 seconds.
Prefix dict has been built successfully.
只要/心中/还有/放不下/的/偶像/，/终/有/一天/，/它/将/化为/修行/路上/的/无解/业障

fxsjy/jieba: 结巴中文分词 (github.com)

4.PKUSeg

可以直接用pip安装

import pkuseg
seg = pkuseg.pkuseg()
text = seg.cut('只要心中还有放不下的偶像，终有一天，它将化为修行路上的无解业障')
print(text)
seg = pkuseg.pkuseg(postag=True) #开启词性标注，回去GitHub下载postag.zip
text = seg.cut('只要心中还有放不下的偶像，终有一天，它将化为修行路上的无解业障')
print(text)

输出

['只要', '心中', '还有', '放', '不', '下', '的', '偶像', '，', '终', '有', '一', '天', '，', '它', '将', '化为', '修行路', '上', '的', '无解', '业障']
Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip" to C:\Users\xxxxxx/.pkuseg\postag.zip
100%|███████████████████████████████████████████████████████████████████| 41424981/41424981 [10:45<00:00, 64139.35it/s]
[('只要', 'c'), ('心中', 's'), ('还有', 'v'), ('放', 'v'), ('不', 'd'), ('下', 'v'), ('的', 'u'), ('偶像', 'n'), ('，', 'w'), ('终', 'd'), ('有', 'v'), ('一', 'm'), ('天', 'q'), ('，', 'w'), ('它', 'r'), ('将', 'd'), ('化为', 'v'), ('修行路', 'v'), ('上', 'v'), ('的', 'u'), ('无解', 'vn'), ('业障', 'n')]

lancopku/pkuseg-python: pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation (github.com)

posted @ 2024-09-16 20:48 qbning 阅读(80) 评论(0) 收藏举报

刷新页面返回顶部