Transformer架构核心是输入输出编码器、多头注意力机制和前馈神经网络,前面介绍了编码器和注意力机制,本文通过前馈神经网络,将两者串联起来,实现一个完整的GPT模型。前馈神经网络(Feedforward Neural Network, FNN)是神经网络中最基础的结构,数据从输入层出发,经过隐藏层的处理,最终到达输出层,整个过程没有反馈循环。它解决了各层维度不一致的问题,实现输入到输出的拟合,下面我们隐藏注意力机制等复杂结构实现一个没有内容的GPT模型,一个GPT模型的配置如下
GPT_CONFIG_124M = {
GPT plus 代充 只需 145"vocab_size": 50257, # Vocabulary size "context_length": 1024, # Context length "emb_dim": 768, # Embedding dimension "n_heads": 12, # Number of attention heads "n_layers": 12, # Number of layers "drop_rate": 0.1, # Dropout rate "qkv_bias": False # Query-Key-Value bias
}
模型会根据这些参数构建神经网络,首先是编码层,包括了内容本身的编码和内容位置编码,两者叠加,然后是随机的Dropout防止过拟合。接着经过Transformer层的处理,通常包含 6 到 12 个 Block(如 GPT-2 有 12 层),每层包含自注意力机制、前馈神经网络和残差连接。然后正则化,对模型输出的最终特征进行归一化处理,通过调整输入分布(均值为0,方差为1)加速收敛,提升模型训练稳定性。最后将 Transformer 模块的最终特征(维度 emb_dim)映射到词汇表大小(vocab_size),输出每个词的概率分布(logits)。有了这个概率我们就可以进行文本生成。
import torch import torch.nn as nn
class GPTModel(nn.Module):
GPT plus 代充 只需 145def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) # Use a placeholder for TransformerBlock self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) # Use a placeholder for LayerNorm self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear( cfg["emb_dim"], cfg["vocab_size"], bias=False ) def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits
class TransformerBlock(nn.Module):
def __init__(self, cfg): super().__init__() # A simple placeholder def forward(self, x): # This block does nothing and just returns its input. return x
class LayerNorm(nn.Module):
GPT plus 代充 只需 145def __init__(self, normalized_shape, eps=1e-5): super().__init__() # The parameters here are just to mimic the LayerNorm interface. def forward(self, x): # This layer does nothing and just returns its input. return x
有了模型输出的参数,我们就可以用这个模型来生成文本,模型算出概率后,我们提取最后一个词的概率,然后通过softmax进行归一化处理,最后通过argmax找到概率最大的位置,有了概率最大的位置,我们就可以到词典反查,得到输出
def generate_text_simple(model, idx, max_new_tokens, context_size): # idx is (batch, n_tokens) array of indices in the current context for _ in range(max_new_tokens): # Crop current context if it exceeds the supported context size # E.g., if LLM supports only 5 tokens, and the context size is 10 # then only the last 5 tokens are used as context idx_cond = idx[:, -context_size:] # Get the predictions with torch.no_grad(): logits = model(idx_cond) # Focus only on the last time step # (batch, n_tokens, vocab_size) becomes (batch, vocab_size) logits = logits[:, -1, :] # Apply softmax to get probabilities probas = torch.softmax(logits, dim=-1) # (batch, vocab_size) # Get the idx of the vocab entry with the highest probability value idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1) # Append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1) return idx
我们使用模型进行词的生成
GPT plus 代充 只需 145import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context) print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0) print("encoded_tensor.shape:", encoded_tensor.shape)
model.eval() # disable dropout
out = generate_text_simple(
model=model, idx=encoded_tensor, max_new_tokens=6, context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out) print("Output length:", len(out[0]))
decoded_text = tokenizer.decode(out.squeeze(0).tolist()) print(decoded_text)
输出如下:
GPT plus 代充 只需 145encoded: [15496, 11, 314, 716]
encoded_tensor.shape: torch.Size([1, 4])
GPT plus 代充 只需 145Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])
Output length: 10
GPT plus 代充 只需 145Hello, I am Featureiman Byeswickattribute argue
至此我们完成了,模型构建到文本预测的过程。但是其中有一个问题还没有解决,那就是如何训练模型,得到模型参数。我们下一章进行分解。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/245657.html