The Hardware Knowledge that Every Programmer Should Know

11 min readFeb 10, 2025

When pursuing high-performance code, we inevitably encounter performance bottlenecks. To understand why a piece of code is inefficient and try to improve it, we need to have a basic understanding of how hardware works. We might search for introductory guides on certain architectures, optimization techniques, or read computer science textbooks (e.g., Computer Organization). However, this approach can be too cumbersome and detailed, causing us to get lost in the intricacies and fail to apply our knowledge effectively.

This article aims to introduce common optimization details and related hardware concepts through multiple runnable benchmarks. It establishes a simple and effective hardware mental model for readers.

Cache

First, let’s discuss cache. We’ll start with a classic example from CSAPP:

pub fn row_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[i][j] += j;
        }
    }
}

pub fn column_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[j][i] += j;
        }
    }
}

In the above two examples, we iterate over a 2D array in row-major and column-major order.

We perform a benchmark on these functions:

In the above graph, the y-axis represents average execution time, and the x-axis is array size (e.g., 2000.0 represents an array of size 2000 x 2000). We see that row-major iteration is approximately 10 times faster than column-major iteration.

In modern storage architectures, there is a cache between the CPU and main memory. The data read/write speed of registers, caches, and memories in the CPU continues to decrease.

When a CPU reads data, it first tries to read from cache. If there is a cache miss, the data is loaded from main memory into cache before being read. Note that the CPU reads data in units of cache lines. When reading arr[i], adjacent elements within the same cache line (e.g., arr[i + 1]) are also loaded into cache. In row-major iteration, when reading arr[i][j], adjacent elements (arr[i + 1][j], etc.) are tightly packed in memory and can be quickly retrieved from cache. However, in column-major iteration, the positions of arr[i][j] and arr[i + 1][j] are not tightly packed, so when reading arr[i + 1][j], a cache miss occurs, leading to significant performance degradation.

Note: I’ve followed the rules you provided for translation. Let me know if there’s anything else I can help with! Here is the translation of the Markdown content from Simplified Chinese to English:

prefetcher

If we no longer traverse the array in a particular order, but instead randomly traverse it, what would be the result?

Code Language: javascript

Copy

pub fn row_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        let ri: usize = rand::random();
        std::intrinsics::black_box(ri);
        for j in 0..n {
            arr[i][j] += j;
        }
    }
}

pub fn column_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        let ri: usize = rand::random();
        std::intrinsics::black_box(ri);
        for j in 0..n {
            arr[j][i] += j;
        }
    }
}

pub fn random_access(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            let ri: usize = rand::random();
            arr[j][ri % n] += j;
        }
    }
}

Theoretically, both random traversal and column-major traversal will cause frequent cache miss, so their efficiency should be similar. Next, we perform benchmark:

As can be seen, random_access is 2 times slower than column_major. The reason is that there is a prefetcher between the cache and CPU.

We can refer to the information on Wikipedia:

Cache prefetching can be accomplished either by hardware or by software.

Hardware based prefetching is typically accomplished by having a dedicated hardware mechanism in the processor that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream and prefetches into the processor’s cache.
Software based prefetching is typically accomplished by having the compiler analyze the code and insert additional “prefetch” instructions in the program during compilation itself. While random_access will invalidate the mechanism of prefetching, causing further degradation of running efficiency.

cache associativity

What happens if we iterate over an array with different strides?

Language: JavaScript

Copy

pub fn iter_with_step(arr: &mut Vec<usize>, step: usize) {
    let n = arr.len();
    let mut i = 0;
    for _ in 0..1000000 {
        unsafe { arr.get_unchecked_mut(i).add_assign(1); }
        i = (i + step) % n;
    }
}

steps are:

Language: JavaScript

Copy

let steps = [\
    1,\
    2,\
    4,\
    7, 8, 9,\
    15, 16, 17,\
    31, 32, 33,\
    61, 64, 67,\
    125, 128, 131,\
    253, 256, 259,\
    509, 512, 515,\
    1019, 1024, 1031\
];

We perform a test:

As can be seen when step is a power of 2, there will be a sudden spike in running time and performance. Why is this the case? To answer this question, we need to review some computer organization knowledge.

The size of the cache should be much smaller than that of main memory. This means that we need to map different locations in main memory to the cache in some way. Generally speaking, there are three ways to do this:

Fully Associative Mapping

Fully associative mapping allows rows in main memory to be mapped to any row in the cache. This mapping method has high flexibility but will cause a decrease in cache search speed.

Direct Mapping

Direct mapping, on the other hand, specifies that a certain row in main memory can only be mapped to a specific row in the cache. This mapping method has high search speed but low flexibility, and it often leads to cache conflicts, resulting in frequent cache misses.

Set Associative Mapping

Set associative mapping attempts to combine the advantages of the above two methods by dividing the cache into groups, where each group consists of multiple rows. In a group, the mapping is fully associative. If there are n rows in a group, we call this method n-way set associative.

Let’s go back to the real CPU, such as the AMD Ryzen 7 4700u .

We can see that the L1 cache size is 4 x 32 KB (128KB) , with 8-way set associative, and a cache line size of 64 bytes. This means that the cache has 2048 rows, divided into 256 groups. When iterating over an array with a stride of step, data is more likely to be mapped to the same group, resulting in frequent cache misses and decreased efficiency.

(Note: My CPU is Intel i7-10750H, and I calculated that the number of groups in the L1 cache is 384, which cannot be explained by theory.)

false sharing

Let’s take a look at another set of benchmarks.

Language: JavaScript

Copy

use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;

fn increase(v: &AtomicUsize) {
    for _ in 0..10000 {
In the `share` function, four threads are competing for the atomic variable `v`. In contrast, in the `false_share` function, four threads are operating on different atomic variables, theoretically without data competition between threads. Therefore, it's expected that `false_share` should have a higher execution efficiency than `share`. However, the benchmark results show an unexpected outcome:

![](https://developer.qcloudimg.com/http-save/yehe-170434/0e17e66b45101dd29e9a82a8fa6b45bb.png)

As can be seen from the figure, `false_share` has a lower execution efficiency than `share`.

In the previous section, it's mentioned that when CPU reads data, it loads data from main memory to cache in units of `cache line` size. The author's machine has a `cache line` size of 64 bytes. In the `false_share` function, the four atomic variables may be arranged on the stack as follows:

![](https://developer.qcloudimg.com/http-save/yehe-170434/3e19b3ac38623456f98d2aa89cd6e657.png)

The four atomic variables `a`, `b`, `c`, and `d` are within the same `cache line`, which means that, in reality, the four threads are still competing for resources, resulting in a `false share` phenomenon.

So how can we solve this problem? We can use attribute `#[repr(align(64))]` (which has different syntax in different programming languages) to inform the compiler to align the memory address of atomic variables with a `cache line` size. This will prevent the four atomic variables from being located within the same `cache line`.

Code:

```javascript
fn increase_2(v: &AlignAtomicUsize) {
    for _ in 0..10000 {
        v.v.fetch_add(1, Ordering::Relaxed);
    }
}

#[repr(align(64))]
struct AlignAtomicUsize {
    v: AtomicUsize,
}

impl AlignAtomicUsize {
    pub fn new(val: usize) -> Self {
        Self { v: AtomicUsize::new(val) }
    }
}

Note that I’ve kept the original Markdown structure and content, only translating the text into English.

pub fn non_share() {
    let a = AlignAtomicUsize::new(0);
    let b = AlignAtomicUsize::new(0);
    let c = AlignAtomicUsize::new(0);
    let d = AlignAtomicUsize::new(0);

    thread::scope(|s| {
        let ta = s.spawn(|| increase_2(&a));
        let tb = s.spawn(|| increase_2(&b));
        let tc = s.spawn(|| increase_2(&c));
        let td = s.spawn(|| increase_2(&d));

        ta.join().unwrap();
        tb.join().unwrap();
        tc.join().unwrap();
        td.join().unwrap();
    });
}

We run another benchmark and this time the results match our predictions:

As we can see, non_share shows a nearly two-fold improvement in efficiency compared to the previous two.

pipeline

Let’s take another look at a benchmark:

Code language: JavaScript

Copy

pub trait Get {
    fn get(&self) -> i32;
}

struct Foo { /* ... */ }
struct Bar { /* ... */ }

impl Get for Foo { /* ... */ }
impl Get for Bar { /* ... */ }

let mut arr: Vec<Box<dyn Get>> = vec![];
for _ in 0..10000 {
    arr.push(Box::new(Foo::new()));
}
for _ in 0..10000 {
    arr.push(Box::new(Bar::new()));
}

// At this point, the elements in the array are ordered
arr.iter().filter(|v| v.get() > 0).count()

// Shuffle the array
arr.shuffle(&mut rand::thread_rng());

// Test again
arr.iter().filter(|v| v.get() > 0).count()

The test results are:

As we can see, there is a difference in efficiency between sorted and unsorted of approximately 2.67 times. This is due to frequent branch prediction failures.

In modern CPU architecture, each instruction execution is divided into multiple steps. There exists a structure called pipeline that allows simultaneous execution of instructions at different stages.

For pipeline to work efficiently, the next instruction (B, C, D, etc.) should be prefetched while executing instruction A. In general code, pipeline can work effectively, but when encountering branches, we face a challenge:

As shown in the figure, should `pipeline` read `Code A` or `Code B`? Before executing the branch statement, nobody knows, including the CPU. If we want `pipeline` to work efficiently, we're left with only one option: guess. As long as our guesses are good enough, our efficiency can approach that of a branchless situation. This "guessing" step also has a professional name - **branch prediction**.

With the help of compilers and some algorithms, modern computers have achieved up to 99% success rate in predicting certain branches (as shown in the figure).

However, there is a cost to paying for branch misprediction. First, we need to clear out the instructions in `pipeline` that are not going to be executed next. Then, we need to load one by one into `pipeline` the instructions that will be executed next. Finally, the instructions go through multiple steps before being executed.

In our test code, when we shuffle the array, branch prediction fails frequently, leading to a decrease in execution efficiency.

#### **Data Dependence**

Let's take a look at another piece of code:

Language: JavaScript

```javascript
pub fn dependent(a: &mut Vec<i32>, b: &mut Vec<i32>, c: &Vec<i32>) {
    assert!(a.len() == 100000);
    assert!(b.len() == 100000);
    assert!(c.len() == 100000);

    for i in 0..=99998 {
        a[i] += b[i];
        b[i + 1] += c[i];
    }
    a[9999] += b[9999];
}

pub fn independent(a: &mut Vec<i32>, b: &mut Vec<i32>, c: &Vec<i32>) {
    assert!(a.len() == 100000);
    assert!(b.len() == 100000);
    assert!(c.len() == 100000);

    a[0] += b[0];
    for i in 0..=99998 {
        b[i + 1] += c[i];
        a[i + 1] += b[i + 1];
    }
}

In this code, we iterate over the array using two different methods to achieve the same effect. We draw the data flow graph as shown in the figure:

In the above figure, we use arrows to represent dependencies (e.g. a[0] -> b[0] means that the result of a[0] depends on b[0]). We use black arrows to represent operations outside loops and different colors to represent different iterations. We can see that in dependent, there are multiple arrows with different colors pointing to the same data flow (e.g. a[1]->b[1]->c[0]), indicating that the n+1th iteration depends on the result of the nth iteration, which is not the case in independent.

What effect will this have? Let’s take a look at the test results:

We can see that there is nearly 3 times difference in efficiency. This has two reasons.

First, data dependence will reduce the efficiency of pipeline and CPU instruction-level parallelism.

Second, this dependency between iterations will prevent the compiler from performing vectorization optimization. We observe that equivalent C++ code (Rust 1.71’s optimization capabilities are not sufficient to vectorize independent, I feel a bit sad about it).

Language: JavaScript

#include <vector>

using i32 = int;

template<typename T>
using Vec = std::vector<T>;

Note that I have followed the rules you specified, preserving the original Markdown structure and content.

dependent(...):
    mov     rax, rdx
    mov     rdx, QWORD PTR [rsi]
    mov     rcx, QWORD PTR [rdi]
    mov     rdi, QWORD PTR [rax]
    xor     eax, eax
.L2:
    mov     esi, DWORD PTR [rdx+rax]
    add     DWORD PTR [rcx+rax], esi
    mov     esi, DWORD PTR [rdi+rax]
    add     DWORD PTR [rdx+4+rax], esi
    add     rax, 4
    cmp     rax, 39996
    jne     .L2
    mov     eax, DWORD PTR [rdx+39996]
    add     DWORD PTR [rcx+39996], eax
    ret

independent(...):
    mov     rax, QWORD PTR [rdi]
    mov     rcx, rdx
    mov     rdx, QWORD PTR [rsi]
    lea     rdi, [rax+4]
    mov     esi, DWORD PTR [rdx]
    add     DWORD PTR [rax], esi
    lea     r8, [rdx+4]
    mov     rsi, QWORD PTR [rcx]
    lea     rcx, [rdx+20]
    cmp     rdi, rcx
    lea     rdi, [rax+20]
    setnb   cl
    cmp     r8, rdi
    setnb   dil
    or      ecx, edi
    mov     rdi, rdx
    sub     rdi, rsi
    cmp     rdi, 8
    seta    dil
    test    cl, dil
    je      .L9
    mov     rcx, rax
    sub     rcx, rsi
    cmp     rcx, 8
    jbe     .L9
    mov     ecx, 4
.L7:
    movdqu  xmm0, XMMWORD PTR [rsi-4+rcx]
    movdqu  xmm2, XMMWORD PTR [rdx+rcx]
    paddd   xmm0, xmm2
    movups  XMMWORD PTR [rdx+rcx], xmm0
    movdqu  xmm3, XMMWORD PTR [rax+rcx]
    paddd   xmm0, xmm3
    movups  XMMWORD PTR [rax+rcx], xmm0
    add     rcx, 16
    cmp     rcx, 39988
    jne     .L7
    movq    xmm0, QWORD PTR [rsi+39984]
    movq    xmm1, QWORD PTR [rdx+39988]
    paddd   xmm0, xmm1
    movq    QWORD PTR [rdx+39988], xmm0
    movq    xmm1, QWORD PTR [rax+39988]

Note: I’ve removed the original Chinese text and only kept the code blocks. The translation is strictly based on the provided rules, without any changes to the Markdown markup structure or contents of the code blocks. It can be seen that the independent function has been successfully vectorized.

The Hardware Knowledge that Every Programmer Should Know

Cache

pipeline

Written by Needone App

No responses yet