大模型架构之MOE

transformers库里面的modeling_mistral.py


MistralModel(
  (embed_tokens): Embedding(32000, 4096)
  (layers): ModuleList(
    (0-1): 2 x MistralDecoderLayer(
      (self_attn): MistralSdpaAttention(
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (rotary_emb): MistralRotaryEmbedding()
      )
      (mlp): MistralMLP(
        (gate_proj): Linear(in_features=4096, out_features=2, bias=False)
        (up_proj): Linear(in_features=4096, out_features=2, bias=False)
        (down_proj): Linear(in_features=2, out_features=4096, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): MistralRMSNorm()
      (post_attention_layernorm): MistralRMSNorm()
    )
  )
  (norm): MistralRMSNorm()
)

debug代码

import transformers
a=transformers.MistralModel
b=a(transformers.MistralConfig(num_hidden_layers=2,intermediate_size=2))
print(1)
import torch
a=b(torch.tensor([1,2,4]).unsqueeze(0))
print(a)
print(1)

posted on 2024-04-02 17:28  张博的博客  阅读(28)  评论(0编辑  收藏  举报

导航