Lab 1: When Big-O Lies — Stride, Caches, and Real Performance

Classroom Link

Purpose

In this lab, you will explore why algorithms with identical Θ(N) complexity can behave very differently on real hardware — and how careful algorithm design can recover performance without changing asymptotic complexity.

You will:

  1. Analyze algorithm behavior a priori
  2. Implement a cache-unfriendly algorithm
  3. Measure the performance impact
  4. Design and implement a cache-aware fix
  5. Explain the results in terms of memory hierarchy

This lab emphasizes a central theme of the course:

Asymptotic analysis describes growth trends, not machine performance.


Learning Objectives

By completing this lab, you will be able to:

  • Write correct Θ(N) algorithms
  • Understand and implement strided memory access
  • Explain spatial locality and cache effects
  • Benchmark programs responsibly
  • Improve performance without changing Big-O complexity
  • Connect algorithm design decisions to hardware behavior

Background: Stride and Memory Locality

Arrays are stored contiguously in memory:

data[0], data[1], data[2], data[3], …

Sequential Access (Stride = 1)

for (i = 0; i < N; i++)
    sum += data[i];
  • Excellent spatial locality
  • Cache lines are reused efficiently
  • Hardware prefetchers are effective

Strided Access (Stride = x)

for (offset = 0; offset < x; offset++)
    for (i = offset; i < N; i += x)
        sum += data[i];
  • Elements are accessed far apart in memory
  • Cache lines are evicted before reuse
  • Prefetching becomes ineffective

Even though every element is accessed exactly once, performance degrades dramatically as stride increases.


Provided Code: Link

You are given a C++ program that:

  • Generates a vector of N random integers
  • Asks you for a stride power p
  • Uses x = 10^p as the stride
  • Enforces 0 ≤ p < floor(log10(N))
  • Contains:
    • A working sequential summation
    • A stub for strided summation
    • A stub for blocked summation

Required Tasks (1–5)


Task 1 – A Priori Analysis (Before Coding)

Before modifying or running the program, answer:

  1. What is the Big-O time complexity of:
    • Sequential summation
    • Strided summation
    • Blocked summation
  2. How many total array elements are accessed in each case?
  3. Based on Big-O alone, should these algorithms have similar performance?
  4. Predict how stride size will affect runtime.

Write your predictions before running any code.


Task 2 – Implement the Strided Algorithm (Programming)

Implement:

long long sumStridedPower10(const vector<int>& data, size_t x);

Requirements

  • Visit every element exactly once
  • Do not allocate additional arrays
  • Do not change Θ(N) complexity
  • Handle arbitrary N
  • Use x = 10^p

Task 3 – Correctness Verification

Before benchmarking:

  1. Run the program for multiple values of N
  2. Compare results from:
    • Sequential (p = 0)
    • Strided (p > 0)
  3. Confirm that all versions produce identical sums

Incorrect results indicate an error in your algorithm.


Task 4 – Benchmarking Experiment

Choose a large fixed N (e.g., 50–100 million).

Run the program multiple times, varying only p.

Suggested values:

pStride (x)
01
110
2100
31,000
410,000
  • Run each configuration at least 3 times
  • Record the average runtime

Present results in a table.


Task 5 – Implement a Cache-Aware Fix (Required)

Now fix the performance problem without changing Θ(N).

Implement:

long long sumBlocked(const vector<int>& data, size_t blockSize);

Requirements

  • Process the array in contiguous blocks
  • Each element must still be visited exactly once
  • Choose a reasonable block size (justify your choice)
  • Do not allocate additional arrays
  • Do not change Big-O complexity

Example structure:

for (size_t block = 0; block < N; block += blockSize) {
    size_t end = min(block + blockSize, N);
    for (size_t i = block; i < end; i++) {
        sum += data[i];
    }
}

Analysis and Explanation (Required)

In your report, explain:

  1. Why all three algorithms are Θ(N)
  2. Why strided access performs poorly on real hardware
  3. How blocking improves spatial locality
  4. Why blocking restores performance
  5. Why Big-O analysis intentionally ignores cache effects
  6. What this implies for real-world algorithm design

Your explanation must reference at least two of:

  • Cache hierarchy (L1/L2/L3)
  • Cache lines
  • Cache misses
  • Memory latency vs bandwidth
  • Hardware prefetching

Deliverables

Submit a 2–3 page report containing:

  1. A priori analysis
  2. Description of your implementations
  3. Benchmark results
  4. Explanation of cache effects
  5. Reflection on theory vs practice

Grading Breakdown

CategoryPoints
A priori analysis15
Correct strided implementation20
Correct blocked implementation20
Verification and benchmarking20
Cache / locality explanation15
Clarity and organization10
Total100
Scroll to Top