2025年conv1d函数（convhull函数）

科技前沿 • 2025-05-28 21:00 • 阅读 37

大家好，我是讯享网，很高兴认识大家。

 <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"> <path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path> </svg> <p>为了方便理解NF4算法的实现&#xff0c;这里用PyTorch实现了一版可以和CUDA NF4精度对齐的量化和反量化函数&#xff0c;并使用llama-3.1-8b模型进行测试&#xff0c;可以做到和CUDA实现的算子精度基本对齐&#xff08;仅反量化存在少许误差&#xff09;&#xff0c;并对模型输出进行测试&#xff0c;64个tokens和CUDA实现完全一致。</p>

讯享网

以下都只是在RTX3090上对llama-3.1-8b上进行测试的结果，不能代表全部的设备和模型。

CUDA上使用函数使用类型的与类型的NF4表的中间值进行比较，从而得到表中距离的最近元素的索引。

讯享网

讯享网

因此在实现时也需要注意和的类型都需要是，经过在实际的llama3权重数据上测试：

量化函数PyTorch实现可以和CUDA实现精度对齐，无精度误差；
反量化函数平均绝对误差大约在，不影响模型输出。

在bitsandbytes中使用这两个函数对CUDA实现进行替换，可以达到模型输出64个tokens完全一致的效果：

<|begin_of_text|>Once upon a time, 20 years ago, I was a young, idealistic, and naive college student. I was also a young, idealistic, and naive college student who was a member of the Young Republicans Club. I was also a young, idealistic, and naive college student who was a member of the Young Republicans Club who was

不过PyTorch的实现存在一定的性能损失，8B模型的量化过程从CUDA实现的3s增加到PyTorch实现的10s；使用PyTorch实现的版本输出64 tokens需要28.012s（仅受反量化函数性能影响），而CUDA实现仅需3.65512s。

精度对比脚本：

讯享网

2025年conv1d函数（convhull函数）

相关推荐