从0开始训练自己的LLM（3）

大家好，我是讯享网，很高兴认识大家。这里提供最前沿的Ai技术和互联网信息。

Transformer架构核心是输入输出编码器、多头注意力机制和前馈神经网络，前面介绍了编码器和注意力机制，本文通过前馈神经网络，将两者串联起来，实现一个完整的GPT模型。前馈神经网络（Feedforward Neural Network, FNN）是神经网络中最基础的结构，数据从输入层出发，经过隐藏层的处理，最终到达输出层，整个过程没有反馈循环。它解决了各层维度不一致的问题，实现输入到输出的拟合，下面我们隐藏注意力机制等复杂结构实现一个没有内容的GPT模型，一个GPT模型的配置如下

GPT_CONFIG_124M = {

GPT plus 代充 只需 145"vocab_size": 50257, # Vocabulary size "context_length": 1024, # Context length "emb_dim": 768, # Embedding dimension "n_heads": 12, # Number of attention heads "n_layers": 12, # Number of layers "drop_rate": 0.1, # Dropout rate "qkv_bias": False # Query-Key-Value bias

}

模型会根据这些参数构建神经网络，首先是编码层，包括了内容本身的编码和内容位置编码，两者叠加，然后是随机的Dropout防止过拟合。接着经过Transformer层的处理，通常包含 6 到 12 个 Block（如 GPT-2 有 12 层），每层包含自注意力机制、前馈神经网络和残差连接。然后正则化，对模型输出的最终特征进行归一化处理，通过调整输入分布（均值为0，方差为1）加速收敛，提升模型训练稳定性。最后将 Transformer 模块的最终特征（维度 emb_dim）映射到词汇表大小（vocab_size），输出每个词的概率分布（logits）。有了这个概率我们就可以进行文本生成。

import torch import torch.nn as nn

class GPTModel(nn.Module):

GPT plus 代充 只需 145def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) # Use a placeholder for TransformerBlock self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) # Use a placeholder for LayerNorm self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear( cfg["emb_dim"], cfg["vocab_size"], bias=False ) def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits

class TransformerBlock(nn.Module):

def __init__(self, cfg): super().__init__() # A simple placeholder def forward(self, x): # This block does nothing and just returns its input. return x

class LayerNorm(nn.Module):

GPT plus 代充 只需 145def __init__(self, normalized_shape, eps=1e-5): super().__init__() # The parameters here are just to mimic the LayerNorm interface. def forward(self, x): # This layer does nothing and just returns its input. return x

有了模型输出的参数，我们就可以用这个模型来生成文本，模型算出概率后，我们提取最后一个词的概率，然后通过softmax进行归一化处理，最后通过argmax找到概率最大的位置，有了概率最大的位置，我们就可以到词典反查，得到输出

def generate_text_simple(model, idx, max_new_tokens, context_size): # idx is (batch, n_tokens) array of indices in the current context for _ in range(max_new_tokens): # Crop current context if it exceeds the supported context size # E.g., if LLM supports only 5 tokens, and the context size is 10 # then only the last 5 tokens are used as context idx_cond = idx[:, -context_size:] # Get the predictions with torch.no_grad(): logits = model(idx_cond) # Focus only on the last time step # (batch, n_tokens, vocab_size) becomes (batch, vocab_size) logits = logits[:, -1, :] # Apply softmax to get probabilities probas = torch.softmax(logits, dim=-1) # (batch, vocab_size) # Get the idx of the vocab entry with the highest probability value idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1) # Append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1) return idx

我们使用模型进行词的生成

GPT plus 代充 只需 145import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

start_context = "Hello, I am"

encoded = tokenizer.encode(start_context) print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0) print("encoded_tensor.shape:", encoded_tensor.shape)

model.eval() # disable dropout

out = generate_text_simple(

model=model, idx=encoded_tensor, max_new_tokens=6, context_size=GPT_CONFIG_124M["context_length"]

)

print("Output:", out) print("Output length:", len(out[0]))

decoded_text = tokenizer.decode(out.squeeze(0).tolist()) print(decoded_text)

输出如下：

GPT plus 代充 只需 145encoded: [15496, 11, 314, 716]

encoded_tensor.shape: torch.Size([1, 4])

GPT plus 代充 只需 145Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])

Output length: 10

GPT plus 代充 只需 145Hello, I am Featureiman Byeswickattribute argue

至此我们完成了，模型构建到文本预测的过程。但是其中有一个问题还没有解决，那就是如何训练模型，得到模型参数。我们下一章进行分解。

从0开始训练自己的LLM（3）

相关推荐