一步一步手写GPT
本文记录一下模仿nanoGPT项目,使用自顶向下的编程法一步步手写GPT的过程。阅读本文需要了解Transformer,GPT,和PyTorch的基础知识。
下面是会用到的所有python库
import math # will use math.sqrt
from dataclasses import dataclass # for configuration
import torch
import torch.nn as nn
import torch.nn.functional as F # F.softmax
from torch import FloatTensor, LongTensor # for type annotation
总体框架
首先定义一个总体的GPT框架,这里只包含最最基本的方法:__init__
,forward
,和generate
。
- 这里把
generate
也在一开始就考虑进来是因为逼近GPT是一个语言模型,应当先天就支持文本生成文本的方法。注意这里forward
和generate
的返回值类型不同。forward
返回的是下一个词的logits,而generate
则是一部到位生成一串token ids。其他方法,诸如from_pretrained
等,反而可以以后再考虑。 - 在
__init__
方法中,我们定义一个GPT的主要部件,即- Token Embedding;
- Position Embedding;
- Transformer Blocks;以及
- Language Modelling Head
class MyGPT(nn.Module):
def __init__(self, conf) -> None:
super().__init__()
self.conf = conf
self.tok_embd = ...
self.pos_embd = ...
self.tfm_blocks = ...
self.lm_head = ...
def forward(self, x_id: LongTensor) -> FloatTensor:
'''(padded) sequence of token ids -> logits of shape [batch_size, 1, vocab_size]'''
pass
def generate(self, x_id: LongTensor, max_new_tokens: int) -> LongTensor:
'''(padded) sequence of token ids -> sequence of token ids'''
pass
搭建好主体框架后,我们开始填充细节。但再动手实现这三个方法钱,我们要先思考一下,定义一个GPT模型,需要考虑哪些超参数(hyper-parameters)?我们这里只考虑最主要的超参,也就是
- 词汇表大小
- 词嵌入的维度
- transformer层(块)数
- 上下文长度
- 使用多头注意机制的话,还得考虑用几个头
@dataclass
class MyGPTConfig:
'''minimum hyper-parameters'''
vocab_size: int = 50257
ctx_window: int = 1024
dim_embd: int = 768
n_attn_head: int = 12
n_tfm_block: int = 12
# Dropout probability
dropout: float = 0.0
实现forward和generate
所有GPT部件中,最重要也最复杂的是transformer blocks,这个我们最后来实现,先搞定其他简单的部件和方法,现在只假设我们已经有了一个叫做MyTransformerBlock
的类。token embedding 和 position embedding都是普通的nn.Embedding
,而language modelling head也只是一个普通线性变换,根据前面所有token的contextual embedding输出后面一个token的logits(logarithmic probabilities)
class MyGPT(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.tok_embd = nn.Embedding(conf.vocab_size, conf.dim_embd)
self.pos_embd = nn.Embedding(conf.ctx_window, conf.dim_embd)
self.tfm_blocks = nn.ModuleList([MyTransformerBlock(conf) for _ in range(conf.n_tfm_block)])
self.lm_head = nn.Linear(conf.dim_embd, conf.vocab_size)
def forward(self, x_id: LongTensor) -> FloatTensor:
'''(padded) sequence of token ids -> logits of shape [batch_size, 1, vocab_size]'''
pass
def generate(self, x_id: LongTensor, max_new_tokens: int) -> LongTensor:
'''(padded) sequence of token ids -> sequence of token ids'''
pass
forward
方法要做的事情就是把输入的token ids先嵌入成词向量,以及生成一批编码位置的向量,将这批向量喂入Transformer Blocks得到这串输入的contextual embedding,最后预测出下一个token的logits。根据经验,使用Dropout和LayerNorm机制可以广泛地提高神经网络模型地型能,所以在__init__
方法加入这两个部件,是forward
中也注意使用一下。
class MyGPT(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.tok_embd = nn.Embedding(conf.vocab_size, conf.dim_embd)
self.pos_embd = nn.Embedding(conf.ctx_window, conf.dim_embd)
self.tfm_blocks = nn.ModuleList([MyTransformerBlock(conf) for _ in range(conf.n_tfm_block)])
self.lm_head = nn.Linear(conf.dim_embd, conf.vocab_size)
self.dropout = nn.Dropout(conf.dropout)
self.layer_norm = nn.LayerNorm(conf.dim_embd)
def forward(self, x_id: LongTensor) -> FloatTensor:
'''(padded) sequence of token ids -> logits of shape [batch_size, 1, vocab_size]'''
pos = torch.arange(x_id.shape[1], device=x_id.device)
tok_embd = self.tok_embd(x_id)
pos_embd = self.pos_embd(pos)
x_embd = tok_embd + pos_embd
x_embd = self.dropout(x_embd)
for tfm_block in self.tfm_blocks:
x_embd = tfm_block(x_embd)
x_embd = self.layer_norm(x_embd)
# note: using list [-1] to preserve the time dimension
logits = self.lm_head(x_embd[:, [-1], :])
return logits
def generate(self, x_id: LongTensor, max_new_tokens: int) -> LongTensor:
'''(padded) sequence of token ids -> sequence of token ids'''
pass
generate
方法使用forward
根据输入序列一步步生成一段长序列:当forward
预测了下一个token地logits后,选择概率最大的一个词加入到原序列中,然后把这个新生成的序列当作新的输入继续生成一个词,直到达到最大值为之。这里需要注意一点,如果生成的序列太长了,超过了整个GPT的上下文长度,需要截掉一点前面旧的序列,只保留最近(最右侧)的序列。
class MyGPT(nn.Module):
...
def generate(self, x_id: LongTensor, max_new_tokens: int) -> LongTensor:
'''(padded) sequence of token ids -> sequence of token ids'''
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at ctx_window
x_id_cond = x_id if x_id.size(1) <= self.conf.ctx_window else x_id[:, -self.conf.ctx_window:]
logits = self.forward(x_id_cond)
new_tok_id = logits.argmax(dim=-1)
x_id = torch.cat([x_id, new_tok_id], dim=1)
return x_id
不过,这种生成方法往往效果不好,实际上我们要加入一点随机性:
- 得到logits后,并不一定选取概率最大的一次词,而是根据logits随机采样一次词加入到原序列中;
- 在采样前,可以先等比例缩放logits,调整一下这个“随机性”到底多么随机。在实现时,我们使用一个temperature参数控制logits转换成概率的过程
probs = softmax(logits/temperature)
。当temperature越大,得到的每个token的probs越均匀,也就是说随机性越大。 - 我们可以每次不止选择一个词,而是选择k个词,同时保存k个序列,最后再返回总体概率最大的哪一个序列
所以,最后我们的generate
方法会多几个参数,并且比原始版本复杂些
class MyGPT(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.tok_embd = nn.Embedding(conf.vocab_size, conf.dim_embd)
self.pos_embd = nn.Embedding(conf.ctx_window, conf.dim_embd)
self.tfm_blocks = nn.ModuleList([MyTransformerBlock(conf) for _ in range(conf.n_tfm_block)])
self.lm_head = nn.Linear(conf.dim_embd, conf.vocab_size)
self.dropout = nn.Dropout(conf.dropout)
self.layer_norm = nn.LayerNorm(conf.dim_embd)
def forward(self, x_id: LongTensor) -> FloatTensor:
'''(padded) sequence of token ids -> logits of shape [batch_size, 1, vocab_size]'''
pos = torch.arange(x_id.shape[1], device=x_id.device)
tok_embd = self.tok_embd(x_id)
pos_embd = self.pos_embd(pos)
x_embd = tok_embd + pos_embd
x_embd = self.dropout(x_embd)
for tfm_block in self.tfm_blocks:
x_embd = tfm_block(x_embd)
x_embd = self.layer_norm(x_embd)
# note: using list [-1] to preserve the time dimension
logits = self.lm_head(x_embd[:, [-1], :])
return logits
def generate(self, x_id: LongTensor, max_new_tokens: int, temperature=1.0, top_k:int=1) -> LongTensor:
'''(padded) sequence of token ids -> sequence of token ids'''
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at ctx_window
x_id_cond = x_id if x_id.size(1) <= self.conf.ctx_window else x_id[:, -self.conf.ctx_window:]
logits = self.forward(x_id_cond)
logits = logits[:, -1, :] / temperature
if top_k > 1:
v, _ = torch.topk(logits, min(top_k, logits.shape[-1])) # top_k cannot exceed vocab_size
# logits less then top_k are set to -Inf, thus probabilities of those tokens become 0
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
new_tok_id = torch.multinomial(probs, num_samples=1)
x_id = torch.cat([x_id, new_tok_id], dim=1)
return x_id
实现MyTransformerBlock
接下来时我们的重头戏,实现最主要的部件transformer block,也就是代码中的MyTransformerBlock
。这个部件也可以拆分为两个子部件,其中核心的部件自然是自注意力机制self-attention,以及后面的非线性变化——一个简单的MLP。数据会先后流过这两个子部件,然后流向下一个block。当然,数据流入这两个子部件前可以(应当)先layer norm一下。
class MyTransformerBlock(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.ln1 = nn.LayerNorm(conf.dim_embd)
self.attn = MyMultiHeadAttention(conf)
self.ln2 = nn.LayerNorm(conf.dim_embd)
self.mlp = MyMLP(conf)
def forward(self, x: FloatTensor) -> FloatTensor:
'''[batch_size, seq_len, dim_embd] -> [batch_size, seq_len, dim_embd]'''
x = x + self.attn(self.ln1(x)) # layer norm + attention + residual
x = x + self.mlp(self.ln2(x)) # layer norm + MLP + residual
return x
这其中的后者MLP模块可以实现成一个两层的全连接前馈网络
class MyMLP(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.fc1 = nn.Linear(conf.dim_embd, conf.dim_embd * 4)
# here output dimension (dim_embd*4) is set arbitrary
self.gelu = nn.GELU()
self.fc2 = nn.Linear(conf.dim_embd * 4, conf.dim_embd)
self.dropout = nn.Dropout(conf.dropout)
def forward(self, x: FloatTensor) -> FloatTensor:
x = self.fc1(x)
x = self.gelu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
注意力机制
而最核心的注意力机制,我们也先做一个大致的设计:
class MyMultiHeadAttention(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
def forward(self, x: FloatTensor) -> FloatTensor:
pass
注意力机制又是什么意思?在日常的语境里,注意就是集中大部分资源(时间、心力)处理小部分重要的/相关的信息,而只用小部分资源处理其他大部分不重要/不想关的信息。 使用文档检索做比喻,就是给定一个query,找到那些与这个query高度相关的文档K,然后集中资源处理这些文档的内容V。
在数学上,我们可以把上面说到的注意力机制设计成这么一个函数:
这里面\(Q\)是一批query矩阵,\(K\)是一批文档的矩阵,\(V\)是这些文档的内容。\(QK^{\top}/d_k\)可以理解为计算出每一个文档与给定query的相似度,使用\(Softmax\)是吧这些相似度压缩到0-1之间,得到应该对每一个文档分配多少资源,最后乘以矩阵V就是根据注意力处理文档内容了。在使用GPT处理序列时,序列中的每一个token既是query,也是文档k和文档内容v。当我们处理某一个token时,序列中包括该token在内的所有token都是我们需要处理的内容,这也是自注意力机制名称的由来。翻译成代码就是:
class MyMultiHeadAttention(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
def atention(self, q, k, v: FloatTensor) -> FloatTensor:
d_k = k.shape[-1]
return F.softmax(q @ k.mT / math.sqrt(d_k), dim=-1) @ v
def forward(self, x: FloatTensor) -> FloatTensor:
q, k, v = self.make_QKV(x)
return self.attention(q, k, v)
在使用auto-regressive方式进行建模/生成的过程中,我们要确保每一个token只能注意到它前(左)边的token,而不能注意到后(右)边的token,而在编程中我们可以使用一个类似mask=[True,True,...,False,False]
的技巧将尚未建模/生成的后边子序列的注意力分数设置为0(softmax前的无穷小)。另外,在实际使用的时候我们也可以dropout一下。那么升级版的attention函数就变成了
class MyMultiHeadAttention(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
# causal mask to ensure that attention is only applied to the left in the input sequence
bias = torch.tril(torch.ones(conf.ctx_win, conf.ctx_win)).view(1, 1, conf.ctx_win, conf.ctx_win)
self.register_buffer("bias", bias)
self.attn_dropout = nn.Dropout(conf.dropout)
self.resid_dropout = nn.Dropout(conf.dropout)
def attention(self, q, k, v, mask) -> FloatTensor:
scores = q @ k.mT / math.sqrt(k.shape[-1])
# ensure that attention is only applied to the left
scores = scores.masked_fill(mask==False, float('-inf'))
attn = torch.softmax(scores, dim=-1)
attn = self.attn_dropout(attn)
attn = attn @ v
return attn
def forward(self, x: FloatTensor) -> FloatTensor:
q, k, v, mask = self.make_QKV(x)
seq_len = x.shape[1]
mask = self.bias[:, :, :seq_len, :seq_len]
y = self.attention(q, k, v)
y = self.resid_dropout(y)
return y
那么,自注意力机制中的矩阵\(Q,K,V\)是怎么得来的?在GPT中,他们都是把输入向量\(x\)线性映射到三个不同的空间得到的,也就是通过学习出三个矩阵\(W_Q,W_K,W_Q\),然后用这仨矩阵乘以输入向量\(x\)。处于编码的方便(也是通常的做法),我们这里设\(W_Q,W_K,W_Q\)的维度都是dim_embd
。
class MyMultiHeadAttention(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.W_q = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_k = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_v = nn.Linear(conf.dim_embd, conf.dim_embd)
# causal mask to ensure that attention is only applied to the left in the input sequence
bias = torch.tril(torch.ones(conf.ctx_win, conf.ctx_win)).view(1, 1, conf.ctx_win, conf.ctx_win)
self.register_buffer("bias", bias)
self.attn_dropout = nn.Dropout(conf.dropout)
self.resid_dropout = nn.Dropout(conf.dropout)
def attention(self, q, k, v, mask) -> FloatTensor:
scores = q @ k.mT / math.sqrt(k.shape[-1])
# ensure that attention is only applied to the left
scores = scores.masked_fill(mask==0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
attn = self.attn_dropout(attn)
attn = attn @ v
return attn
def make_QKV(self, x):
q = self.W_q(x)
k = self.W_k(x)
v = self.W_v(x)
return q, k, v
def forward(self, x: FloatTensor) -> FloatTensor:
q, k, v = self.make_QKV(x)
seq_len = x.shape[1]
mask = self.bias[:, :, :seq_len, :seq_len]
y = self.attention(q, k, v, mask)
y = self.resid_dropout(y)
return y
在Transformer中,我们不止使用一个注意力,而是多个注意力,直觉上这也是有多种注意角度,或者说注意一个序列的不同方面,最后把各个注意头的输出拼接起来再做一次线性变换(乘以一个矩阵\(W^O\)):
其中\(head_i = Attention(QW_i^Q,KW_i^K,VW_i^V)\)。
将这些想法整合起来,最后的多头注意力机制的代码就是
class MyMultiHeadAttention(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.W_q = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_k = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_v = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_out = nn.Linear(conf.dim_embd, conf.dim_embd)
# causal mask to ensure that attention is only applied to the left in the input sequence
bias = torch.tril(torch.ones(conf.ctx_win, conf.ctx_win)).view(1, 1, conf.ctx_win, conf.ctx_win)
self.register_buffer("bias", bias)
self.attn_dropout = nn.Dropout(conf.dropout)
self.resid_dropout = nn.Dropout(conf.dropout)
def attention(self, q, k, v, mask) -> FloatTensor:
scores = q @ k.mT / math.sqrt(k.shape[-1])
# ensure that attention is only applied to the left
scores = scores.masked_fill(mask==0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
attn = self.attn_dropout(attn)
attn = attn @ v
return attn
def make_QKV(self, x, batch_size, seq_len, n_attn_head, dim_embd):
q = self.W_q(x).view(batch_size, seq_len, n_attn_head, dim_embd // n_attn_head).transpose(1, 2)
k = self.W_k(x).view(batch_size, seq_len, n_attn_head, dim_embd // n_attn_head).transpose(1, 2)
v = self.W_v(x).view(batch_size, seq_len, n_attn_head, dim_embd // n_attn_head).transpose(1, 2)
return q, k, v
def forward(self, x: FloatTensor) -> FloatTensor:
batch_size, seq_len, dim_embd = x.shape
n_attn_head, dim_embd = self.conf.n_attn_head, self.conf.dim_embd
q, k, v = self.make_QKV(x, batch_size, seq_len, n_attn_head, dim_embd)
mask = self.bias[:, :, :seq_len, :seq_len]
y = self.attention(q, k, v, mask)
y = y.transpose(1, 2).contiguous().view(batch_size, seq_len, dim_embd) # re-assemble all head outputs side by side
y = self.W_out(y) # output projection
y = self.resid_dropout(y)
return y
完整代码
至此,我们一步一步手写出了一个GPT了。完整代码如下
'''
Hand-made GPT, adapted from [nanoGPT](https://github.com/karpathy/nanoGPT/)
Environment:
# Name Version Build Channel
python 3.12.4 h5148396_1
pytorch 2.4.0 py3.12_cuda12.4_cudnn9.1.0_0 pytorch
cuda-runtime 12.4.0 0 nvidia
......
'''
import math
from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import LongTensor, FloatTensor
@dataclass
class MyGPTConfig:
vocab_size: int = 50257
ctx_win: int = 1024
dim_embd: int = 768
n_attn_head: int = 12
n_tfm_block: int = 12
dropout: float = 0.0
use_bias: bool = True
class MyMLP(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.fc1 = nn.Linear(conf.dim_embd, conf.dim_embd * 4)
self.gelu = nn.GELU()
self.fc2 = nn.Linear(conf.dim_embd * 4, conf.dim_embd)
self.dropout = nn.Dropout(conf.dropout)
def forward(self, x: FloatTensor) -> FloatTensor:
x = self.fc1(x)
x = self.gelu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
class MyMultiHeadAttention(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.W_q = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_k = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_v = nn.Linear(conf.dim_embd, conf.dim_embd)
self.W_out = nn.Linear(conf.dim_embd, conf.dim_embd)
# causal mask to ensure that attention is only applied to the left in the input sequence
bias = torch.tril(torch.ones(conf.ctx_win, conf.ctx_win)).view(1, 1, conf.ctx_win, conf.ctx_win)
self.register_buffer("bias", bias)
self.attn_dropout = nn.Dropout(conf.dropout)
self.resid_dropout = nn.Dropout(conf.dropout)
def attention(self, q, k, v, mask) -> FloatTensor:
scores = q @ k.mT / math.sqrt(k.shape[-1])
# ensure that attention is only applied to the left
scores = scores.masked_fill(mask==0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
attn = self.attn_dropout(attn)
attn = attn @ v
return attn
def make_QKV(self, x, batch_size, seq_len, n_attn_head, dim_embd):
q = self.W_q(x).view(batch_size, seq_len, n_attn_head, dim_embd // n_attn_head).transpose(1, 2)
k = self.W_k(x).view(batch_size, seq_len, n_attn_head, dim_embd // n_attn_head).transpose(1, 2)
v = self.W_v(x).view(batch_size, seq_len, n_attn_head, dim_embd // n_attn_head).transpose(1, 2)
return q, k, v
def forward(self, x: FloatTensor) -> FloatTensor:
batch_size, seq_len, dim_embd = x.shape
n_attn_head, dim_embd = self.conf.n_attn_head, self.conf.dim_embd
q, k, v = self.make_QKV(x, batch_size, seq_len, n_attn_head, dim_embd)
mask = self.bias[:, :, :seq_len, :seq_len]
y = self.attention(q, k, v, mask)
y = y.transpose(1, 2).contiguous().view(batch_size, seq_len, dim_embd) # re-assemble all head outputs side by side
y = self.W_out(y) # output projection
y = self.resid_dropout(y)
return y
class MyTransformerBlock(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.ln1 = nn.LayerNorm(conf.dim_embd)
self.attn = MyMultiHeadAttention(conf)
self.ln2 = nn.LayerNorm(conf.dim_embd)
self.mlp = MyMLP(conf)
def forward(self, x: FloatTensor) -> FloatTensor:
'''[batch_size, seq_len, dim_embd] -> [batch_size, seq_len, dim_embd]'''
x = x + self.attn(self.ln1(x)) # layer norm + attention + residual
x = x + self.mlp(self.ln2(x)) # layer norm + MLP + residual
return x
class MyGPT(nn.Module):
def __init__(self, conf: MyGPTConfig) -> None:
super().__init__()
self.conf = conf
self.tok_embd = nn.Embedding(conf.vocab_size, conf.dim_embd)
self.pos_embd = nn.Embedding(conf.ctx_win, conf.dim_embd)
self.tfm_blocks = nn.ModuleList([MyTransformerBlock(conf) for _ in range(conf.n_tfm_block)])
self.lm_head = nn.Linear(conf.dim_embd, conf.vocab_size)
self.dropout = nn.Dropout(conf.dropout)
self.layer_norm = nn.LayerNorm(conf.dim_embd)
def forward(self, x_id: LongTensor) -> FloatTensor:
'''(padded) sequence of token ids -> logits of shape [batch_size, 1, vocab_size]'''
pos = torch.arange(x_id.shape[1], device=x_id.device)
tok_embd = self.tok_embd(x_id)
pos_embd = self.pos_embd(pos)
x_embd = tok_embd + pos_embd
x_embd = self.dropout(x_embd)
for tfm_block in self.tfm_blocks:
x_embd = tfm_block(x_embd)
x_embd = self.layer_norm(x_embd)
# note: using list [-1] to preserve the time dimension
logits = self.lm_head(x_embd[:, [-1], :])
return logits
def generate(self, x_id: LongTensor, max_new_tokens: int, temperature=1.0, top_k:int=1) -> LongTensor:
'''(padded) sequence of token ids -> sequence of token ids'''
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at ctx_win
x_id_cond = x_id if x_id.size(1) <= self.conf.ctx_win else x_id[:, -self.conf.ctx_win:]
logits = self.forward(x_id_cond)
logits = logits[:, -1, :] / temperature
if top_k > 1:
v, _ = torch.topk(logits, min(top_k, logits.shape[-1])) # top_k cannot exceed vocab_size
logits[logits < v[:, [-1]]] = -float('Inf') # logits less then top_k are set to -Inf, thus probabilites of those tokens become 0
probs = F.softmax(logits, dim=-1)
new_tok_id = torch.multinomial(probs, num_samples=1)
x_id = torch.cat([x_id, new_tok_id], dim=1)
return x_id
if __name__ == '__main__':
conf = MyGPTConfig()
model = MyGPT(conf)
print(model)
inp = torch.LongTensor([
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
])
print(model(inp))
print(model.generate(inp, max_new_tokens=3))
众所周知,从头开始预训练一个模型所需要的资源是巨量的,普通个人很少能做到,哪怕能做到,也不愿意去做。因此直接导入别人预训练好的模型可谓是新时代的刚需。因此,有必要实现一个函数导入预训练好的模型。这里实现一个from_safetensors
函数导入OpenAI在Hugginface上传的模型文件。导入一个模型所必须的文件当然是模型本身的权重文件model.safetensors
,而定义模型架构的各种配置也是不可或缺的,所以config.json
也得包含在内。那么这个函数做的就是两件事:
- 根据配置文件初始化一个
MyGPT
模型 - 把预训练模型的权重导入到我们的
MyGPT
模型
# skipped ...
class MyGPT(nn.Module):
# skipped ...
@classmethod
def from_safetensors(cls, model_dir: str) -> 'MyGPT':
'''load model from huggingface safetensors files
Required files in `model_dir`:
- config.json
- model.safetensors
'''
# 1. read configs, and initialize a MyGPT model
# 2. load weights into the MyGPT model
pass
在进一步编程以前我们要先了解一下config.json
里面有什么,里面的配置项怎样与我们自己定义的配置项对应
>>> config_path = os.path.join(model_dir, 'config.json')
>>> config_dict = json.load(open(config_path))
>>> print(config_dict)
{'activation_function': 'gelu_new',
'architectures': ['GPT2LMHeadModel'],
'attn_pdrop': 0.1,
'bos_token_id': 50256,
'embd_pdrop': 0.1,
'eos_token_id': 50256,
'initializer_range': 0.02,
'layer_norm_epsilon': 1e-05,
'model_type': 'gpt2',
'n_ctx': 1024,
'n_embd': 768,
'n_head': 12,
'n_layer': 12,
'n_positions': 1024,
'resid_pdrop': 0.1,
'summary_activation': None,
'summary_first_dropout': 0.1,
'summary_proj_to_labels': True,
'summary_type': 'cls_index',
'summary_use_proj': True,
'task_specific_params': {'text-generation': {'do_sample': True,
'max_length': 50}},
'vocab_size': 50257}
那么,从这些key我们可以猜到,我们所需要的配置项就是'n_ctx', 'vocab_size', 'n_embd', 'n_head', 'n_layer'
。OpenAI对各种dropout概率的设置都是0.1,所以都可以用'attn_pdrop'
来代表。
class MyGPT(nn.Module):
# ......
@classmethod
def from_safetensors(cls, model_dir) -> 'MyGPT':
'''load model from huggingface safetensors files
Required files in `model_dir`:
- config.json
- model.safetensors
'''
import json
import os
# read configs, and initialize model
config_path = os.path.join(model_dir, 'config.json')
config_dict = json.load(open(config_path))
config = MyGPTConfig(ctx_win=config_dict['n_ctx'],
vocab_size=config_dict['vocab_size'],
dim_embd=config_dict['n_embd'],
n_attn_head=config_dict['n_head'],
n_tfm_block=config_dict['n_layer'],
dropout=config_dict['attn_pdrop'])
mygpt_model = cls(config)
接下来要看看model.tensors
里面定义的state_dict
长什么样子,怎样与我们的模型架构对应上
>>> from safetensors import safe_open
>>> f = safe_open(os.path.join(model_dir, 'model.safetensors'), framework='pt')
>>> print(f.keys())
['h.0.attn.bias',
'h.0.attn.c_attn.bias',
'h.0.attn.c_attn.weight',
'h.0.attn.c_proj.bias',
'h.0.attn.c_proj.weight',
'h.0.ln_1.bias',
'h.0.ln_1.weight',
'h.0.ln_2.bias',
'h.0.ln_2.weight',
'h.0.mlp.c_fc.bias',
'h.0.mlp.c_fc.weight',
'h.0.mlp.c_proj.bias',
'h.0.mlp.c_proj.weight',
'h.1.attn.bias',
# ......
'ln_f.bias',
'ln_f.weight',
'wpe.weight',
'wte.weight']
可以看到基本上OpenAI的模型架构与我们的模型架构是类似的,基本上可以一一导入。这当然不是什么巧合。我这里的MyGPT
是模仿nanoGPT写的,而nanoGPT这个项目又是模仿开源GPT写出来的。
@classmethod
def from_safetensors(cls, model_dir) -> 'MyGPT':
'''load model from huggingface safetensors files
Required files in `model_dir`:
- config.json
- model.safetensors
'''
pass
import json, os
# read configs, and initialize model
config_path = os.path.join(model_dir, 'config.json')
config_dict = json.load(open(config_path))
config = MyGPTConfig(ctx_win=config_dict['n_ctx'],
vocab_size=config_dict['vocab_size'],
dim_embd=config_dict['n_embd'],
n_attn_head=config_dict['n_head'],
n_tfm_block=config_dict['n_layer'],
dropout=config_dict['attn_pdrop'])
mygpt_model = cls(config)
# load weights into the MyGPT model
from safetensors import safe_open
with safe_open(os.path.join(model_dir, 'model.safetensors'), framework='pt') as f:
mygpt_model.tok_embd.weight.data = f.get_tensor('wte.weight')
mygpt_model.pos_embd.weight.data = f.get_tensor('wpe.weight')
for i, tfm_block in enumerate(mygpt_model.tfm_blocks):
tfm_block.ln1.weight.data = f.get_tensor(f'h.{i}.ln_1.weight')
tfm_block.ln1.bias.data = f.get_tensor(f'h.{i}.ln_1.bias')
tfm_block.ln2.weight.data = f.get_tensor(f'h.{i}.ln_2.weight')
tfm_block.ln2.bias.data = f.get_tensor(f'h.{i}.ln_2.bias')
...
不过这里面还是有两点细致的区别
- 在atttention里面,为了直观表示,我为
QKV
三个矩阵单独设置了三个变量,而OpenAI,也许是处于效率的考虑,把三个矩阵合并在一个矩阵attn.c_attn
里,待到使用后才将他们分开。所以我们在这里导入时要先把他们分割开,然后分别导入 - 在
MLP
模块里,OpenAI实际上使用的不是Linear
,二是Conv1D
。也许也是处于效率考虑?但两者的权重矩阵形状不一样,导入时我们需要注意转置一下,否则计算时会出现RuntimeError: mat1 and mat2 shapes cannot be multiplied
因此,把模型权重从文件导入到我们的模型就是
# load weights into the MyGPT model
from safetensors import safe_open
with safe_open(os.path.join(model_dir, 'model.safetensors'), framework='pt') as f:
mygpt_model.tok_embd.weight.data = f.get_tensor('wte.weight')
mygpt_model.pos_embd.weight.data = f.get_tensor('wpe.weight')
for i, tfm_block in enumerate(mygpt_model.tfm_blocks):
tfm_block.ln1.weight.data = f.get_tensor(f'h.{i}.ln_1.weight')
tfm_block.ln1.bias.data = f.get_tensor(f'h.{i}.ln_1.bias')
tfm_block.ln2.weight.data = f.get_tensor(f'h.{i}.ln_2.weight')
tfm_block.ln2.bias.data = f.get_tensor(f'h.{i}.ln_2.bias')
# split c_attn into q,k,v
c_attn_w = f.get_tensor(f'h.{i}.attn.c_attn.weight')
c_attn_b = f.get_tensor(f'h.{i}.attn.c_attn.bias')
q, k, v = c_attn_w.split(config.dim_embd, dim=1)
qb, kb, vb = c_attn_b.split(config.dim_embd)
tfm_block.attn.W_q.weight.data = q
tfm_block.attn.W_q.bias.data = qb
tfm_block.attn.W_k.weight.data = k
tfm_block.attn.W_k.bias.data = kb
tfm_block.attn.W_v.weight.data = v
tfm_block.attn.W_v.bias.data = vb
tfm_block.attn.W_out.weight.data = f.get_tensor(f'h.{i}.attn.c_proj.weight')
tfm_block.attn.W_out.bias.data = f.get_tensor(f'h.{i}.attn.c_proj.bias')
tfm_block.mlp.fc1.weight.data = f.get_tensor(f'h.{i}.mlp.c_fc.weight').mT
# OpenAI checkpoint used "Conv1D" modules: [3072, 768], we need to transpose it
tfm_block.mlp.fc1.bias.data = f.get_tensor(f'h.{i}.mlp.c_fc.bias')
tfm_block.mlp.fc2.weight.data = f.get_tensor(f'h.{i}.mlp.c_proj.weight').mT
tfm_block.mlp.fc2.bias.data = f.get_tensor(f'h.{i}.mlp.c_proj.bias')
return mygpt_model
完整的from_safetensors
函数代码如下
# skipped ...
class MyGPT(nn.Module):
# skipped ...
@classmethod
def from_safetensors(cls, model_dir: str) -> 'MyGPT':
'''load model from huggingface safetensors files
Required files in `model_dir`:
- config.json
- model.safetensors
'''
import json, os
# read configs, and initialize model
config_path = os.path.join(model_dir, 'config.json')
config_dict = json.load(open(config_path))
config = MyGPTConfig(ctx_win=config_dict['n_ctx'],
vocab_size=config_dict['vocab_size'],
dim_embd=config_dict['n_embd'],
n_attn_head=config_dict['n_head'],
n_tfm_block=config_dict['n_layer'],
dropout=config_dict['attn_pdrop'])
mygpt_model = cls(config)
# load weights into the MyGPT model
from safetensors import safe_open
with safe_open(os.path.join(model_dir, 'model.safetensors'), framework='pt') as f:
mygpt_model.tok_embd.weight.data = f.get_tensor('wte.weight')
mygpt_model.pos_embd.weight.data = f.get_tensor('wpe.weight')
for i, tfm_block in enumerate(mygpt_model.tfm_blocks):
tfm_block.ln1.weight.data = f.get_tensor(f'h.{i}.ln_1.weight')
tfm_block.ln1.bias.data = f.get_tensor(f'h.{i}.ln_1.bias')
tfm_block.ln2.weight.data = f.get_tensor(f'h.{i}.ln_2.weight')
tfm_block.ln2.bias.data = f.get_tensor(f'h.{i}.ln_2.bias')
# split c_attn into q,k,v
c_attn_w = f.get_tensor(f'h.{i}.attn.c_attn.weight')
c_attn_b = f.get_tensor(f'h.{i}.attn.c_attn.bias')
q, k, v = c_attn_w.split(config.dim_embd, dim=1)
qb, kb, vb = c_attn_b.split(config.dim_embd)
tfm_block.attn.W_q.weight.data = q
tfm_block.attn.W_q.bias.data = qb
tfm_block.attn.W_k.weight.data = k
tfm_block.attn.W_k.bias.data = kb
tfm_block.attn.W_v.weight.data = v
tfm_block.attn.W_v.bias.data = vb
tfm_block.attn.W_out.weight.data = f.get_tensor(f'h.{i}.attn.c_proj.weight')
tfm_block.attn.W_out.bias.data = f.get_tensor(f'h.{i}.attn.c_proj.bias')
tfm_block.mlp.fc1.weight.data = f.get_tensor(f'h.{i}.mlp.c_fc.weight').mT
# OpenAI checkpoint used "Conv1D" modules: [3072, 768], we need to transpose it
tfm_block.mlp.fc1.bias.data = f.get_tensor(f'h.{i}.mlp.c_fc.bias')
tfm_block.mlp.fc2.weight.data = f.get_tensor(f'h.{i}.mlp.c_proj.weight').mT
tfm_block.mlp.fc2.bias.data = f.get_tensor(f'h.{i}.mlp.c_proj.bias')
return mygpt_model