2025年性能优化篇（1）：几种简单的访存优化手段

大家好，我是讯享网，很高兴认识大家。

性能优化篇（1）：几种简单的访存优化手段

Author:stormQ

Sunday, 10. November 2019 11:36AM

目录
- 减少不必要的内存引用
- 按顺序访问数据
- 按顺序存储同时要访问的数据

减少不必要的内存引用

示例：

void poor(const int *src, int n, int *dest) { for (int i = 0; i < n; ++i) { *dest += src[i]; } } void better(const int *src, int n, int *dest) { int sum = *dest; for (int i = 0; i < n; ++i) { sum += src[i]; } *dest = sum; }

讯享网

执行时间统计：

启动程序方式	第一次执行耗时(us)	第二次执行耗时(us)	第三次执行耗时(us)	第四次执行耗时(us)	第五次执行耗时(us)
./main_Og	poor: better:67489	poor: better:69189	poor: better:67232	poor: better:68082	poor: better:69484
valgrind --tool=cachegrind ./main_Og	poor: better:	poor: better:	poor: better:	poor: better:	poor: better:

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 2.5 倍左右。

从汇编的角度分析：

讯享网; With -Og optimization ; Dump of assembler code for function poor(int const*, int, int*): 0x000000000040080c <+0>: mov w3, #0x0 // #0 0x0000000000 <+4>: cmp w3, w1 0x0000000000 <+8>: b.ge 0x <poor(int const*, int, int*)+36> 0x0000000000 <+12>: ldr w5, [x2] 0x000000000040081c <+16>: ldr w4, [x0,w3,sxtw #2] 0x0000000000 <+20>: add w4, w5, w4 0x0000000000 <+24>: str w4, [x2] 0x0000000000 <+28>: add w3, w3, #0x1 0x000000000040082c <+32>: b 0x <poor(int const*, int, int*)+4> 0x0000000000 <+36>: ret ; With -Og optimization ; Dump of assembler code for function better(int const*, int, int*): 0x0000000000 <+0>: ldr w4, [x2] 0x0000000000 <+4>: mov w3, #0x0 // #0 0x000000000040083c <+8>: cmp w3, w1 0x0000000000 <+12>: b.ge 0x <better(int const*, int, int*)+32> 0x0000000000 <+16>: ldr w5, [x0,w3,sxtw #2] 0x0000000000 <+20>: add w4, w4, w5 0x000000000040084c <+24>: add w3, w3, #0x1 0x0000000000 <+28>: b 0x40083c <better(int const*, int, int*)+8> 0x0000000000 <+32>: str w4, [x2] 0x0000000000 <+36>: ret

从汇编代码中可以看出：

使用-Og优化选项编译时，poor() 函数的 for 循环的一次迭代中：读内存操作数量为2，写内存操作数量为1。
使用-Og优化选项编译时，better() 函数的 for 循环的一次迭代中：读内存操作数量为1，写内存操作数量为0。better() 函数的内存读写数量较少，因此执行速度更快。

从 cache 性能的角度分析：

-------------------------------------------------------------------------------- I1 cache: 16384 B, 64 B, 4-way associative D1 cache: 16384 B, 64 B, 4-way associative LL cache:  B, 64 B, 8-way associative Command: ./b_Og Data file: cachegrind.out.18226 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 1,993,889,018 1,004 864 315,036,517 13,121,792 13,115,591 209,899,117 6,555,946 6,555,006 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 838,860,804 0 0 209,715,200 6,553,601 6,553,601 104,857,600 0 0 /home/b.cpp:poor(int const*, int, int*) 629,145,606 0 0 104,857,601 6,553,601 6,553,601 1 1 1 /home/b.cpp:better(int const*, int, int*) 524,288,004 1 1 0 0 0 104,857,600 6,553,600 6,553,600 /home/b.cpp:init(int*, int)

输出结果解析：

poor() 函数的读内存操作的数量为 209,715,200（Dr 列），读内存操作在 L1 级缓存中不命中的数量为 6,553,601（D1mr 列），读内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 6,553,601（DLmr 列）。
poor() 函数的写内存操作的数量为 104,857,600（Dw 列），写内存操作在 L1 级缓存中不命中的数量为 0（D1mw 列），写内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 0（DLmw 列）。
better() 函数的读内存操作的数量为 104,857,601（Dr 列），读内存操作在 L1 级缓存中不命中的数量为 6,553,601（D1mr 列），读内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 6,553,601（DLmr 列）。
better() 函数的写内存操作的数量为 1（Dw 列），写内存操作在 L1 级缓存中不命中的数量为 1（D1mw 列），写内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 1（DLmw 列）。

从上述数据中可以看出：1）poor() 函数的读内存操作的数量为 better() 函数的两倍；2）poor() 函数的写内存操作的数量比 better() 函数多 104,857,599（104,857,599 = 104,857,600 - 1）。这也验证了better() 函数执行速度较快的原因。

按顺序存储同时要访问的数据

示例：

讯享网void poor(const int *src, int *dest, int n) { for (int i = 0; i < n; ++i) { // 同时要访问的数据（src[i]、dest[i]）在两个数组中，即同时要访问的数据不是连续存储的 dest[i] = src[i]; } } struct Vec2 { Vec2() { static long long count = 0; a = count++; } int a; int b; }; void better(struct Vec2 *data, int n) { for (int i = 0; i < n; ++i) { // 同时要访问的数据（data[i].a、data[i].b）存储在同一个结构体中，即同时要访问的数据是连续存储的 data[i].b = data[i].a; } }

执行时间统计：

启动程序方式	第一次执行耗时(us)	第二次执行耗时(us)	第三次执行耗时(us)	第四次执行耗时(us)	第五次执行耗时(us)
./main_Og	poor:8591 better:2714	poor:4657 better:1936	poor:7424 better:3436	poor:8976 better:1937	poor:4866 better:1931
valgrind --tool=cachegrind ./main_Og	poor:60573 better:52037	poor:59379 better:51119	poor:60963 better:51360	poor:59742 better:51622	poor:64674 better:52330

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 2 倍左右。

统计 cache 使用情况：

-------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw . . . . . . . . . void poor(const int *src, int *dest, int n) . . . . . . . . . { 8,294,404 0 0 1 1 1 0 0 0 for (int i = 0; i < n; ++i) . . . . . . . . . { 6,220,800 0 0 2,073,600 129,601 129,601 2,073,600 129,601 129,601 dest[i] = src[i]; . . . . . . . . . } . . . . . . . . . } . . . . . . . . . . . . . . . . . . struct Vec2 . . . . . . . . . { . . . . . . . . . Vec2() . . . . . . . . . { . . . . . . . . . static long long count = 0; 8,294,400 0 0 2,073,600 1 1 4,147,200 259,200 259,200 a = count++; . . . . . . . . . } . . . . . . . . . int a; . . . . . . . . . int b; . . . . . . . . . }; . . . . . . . . . . . . . . . . . . void better(struct Vec2 *data, int n) . . . . . . . . . { 8,294,404 0 0 1 1 1 0 0 0 for (int i = 0; i < n; ++i) . . . . . . . . . { 8,294,400 0 0 2,073,600 259,201 259,201 2,073,600 0 0 data[i].b = data[i].a; . . . . . . . . . } . . . . . . . . . }

输出结果解析：

poor() 和 better() 函数的内存读操作的数量是相同的，而 better() 函数的 D1mr、DLmr 比 poor() 函数分别多次、次。
poor() 和 better() 函数的内存写操作的数量是相同的，而 poor() 函数的 D1mw、DLmw 比 better() 函数分别多次、次。
从内存读和写操作的不命中总数量来看，两者是相同的。但为什么 better() 函数的执行速度比 poor() 函数快 2 倍？

按顺序访问数据

按数据在内存中排列先后顺序进行访问（读或写）时，通常会降低 cache 的缺失率，即减少访问内存的次数，从而执行速度更快。比如：C/C++ 中的多维数组是以行主序（存储完一行后再存储下一行）存储在内存中的，那么在循环中访问完一行后再访问下一行的方式更高效。Fortran 中的多维数组是以列主序（存储完一列后再存储下一列）存储在内存中的，那么在循环中访问完一列后再访问下一列的方式更高效。

示例：访问二维整型数组

// 按列访问数组元素 long long poor(const int *src, int rows, int cols) { long long sum = 0; for (int i = 0; i < cols; ++i) { for (int j = 0; j < rows; ++j) { sum += *(src + j * cols + i); } } return sum; } // 按行访问数组元素 long long better(const int *src, int rows, int cols) { long long sum = 0; for (int i = 0; i < rows; ++i) { for (int j = 0; j < cols; ++j) { sum += *(src + i * cols + j); } } return sum; }

执行时间统计：

启动程序方式	第一次执行耗时(us)	第二次执行耗时(us)	第三次执行耗时(us)	第四次执行耗时(us)	第五次执行耗时(us)
./main_Og	poor:12575 better:2479	poor:12661 better:2240	poor:12687 better:2313	poor:12387 better:2241	poor:12493 better:2230
valgrind --tool=cachegrind ./main_Og	poor: better:54882	poor:99056 better:56708	poor:96098 better:57031	poor:97853 better:57691	poor:95651 better:57502

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 5 倍左右。

统计 cache 使用情况：

-------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw . . . . . . . . . long long poor(const int *src, int rows, int cols) . . . . . . . . . { 2 0 0 0 0 0 0 0 0 long long sum = 0; 4,323 0 0 0 0 0 0 0 0 for (int i = 0; i < cols; ++i) . . . . . . . . . { 8,296,560 1 1 0 0 0 0 0 0 for (int j = 0; j < rows; ++j) . . . . . . . . . { 14,515,200 0 0 2,073,600 2,073,600 76,630 0 0 0 sum += *(src + j * cols + i); . . . . . . . . . } . . . . . . . . . } . . . . . . . . . return sum; 1 0 0 1 1 1 0 0 0 } . . . . . . . . . . . . . . . . . . long long better(const int *src, int rows, int cols) . . . . . . . . . { 2 0 0 0 0 0 0 0 0 long long sum = 0; 7,683 0 0 0 0 0 0 0 0 for (int i = 0; i < rows; ++i) . . . . . . . . . { 8,298,240 0 0 0 0 0 0 0 0 for (int j = 0; j < cols; ++j) . . . . . . . . . { 14,515,200 0 0 2,073,600 129,601 74,916 0 0 0 sum += *(src + i * cols + j); . . . . . . . . . } . . . . . . . . . } . . . . . . . . . return sum; 1 0 0 1 1 1 0 0 0 }

输出结果解析：

poor() 和 better() 函数的读内存操作的数量是相同的。
poor() 和 better() 函数在 L1 级缓存的读数据操作不命中数量差距很大，前者的读数据操作不命中数量为后者的 16 倍。这正是 better() 函数执行速度快于 poor() 函数的原因。

如果你觉得本文对你有所帮助，欢迎关注公众号，支持一下！

在这里插入图片描述

2025年性能优化篇（1）：几种简单的访存优化手段

性能优化篇（1）：几种简单的访存优化手段

减少不必要的内存引用

按顺序存储同时要访问的数据

按顺序访问数据

相关推荐