Purpose

In this lab, you will explore why algorithms with identical Θ(N) complexity can behave very differently on real hardware — and how careful algorithm design can recover performance without changing asymptotic complexity.

You will:

Analyze algorithm behavior a priori
Implement a cache-unfriendly algorithm
Measure the performance impact
Design and implement a cache-aware fix
Explain the results in terms of memory hierarchy

This lab emphasizes a central theme of the course:

Asymptotic analysis describes growth trends, not machine performance.

Learning Objectives

By completing this lab, you will be able to:

Write correct Θ(N) algorithms
Understand and implement strided memory access
Explain spatial locality and cache effects
Benchmark programs responsibly
Improve performance without changing Big-O complexity
Connect algorithm design decisions to hardware behavior

Background: Stride and Memory Locality

Arrays are stored contiguously in memory:

data[0], data[1], data[2], data[3], …

Sequential Access (Stride = 1)

for (i = 0; i < N; i++)
    sum += data[i];

Excellent spatial locality
Cache lines are reused efficiently
Hardware prefetchers are effective

Strided Access (Stride = x)

for (offset = 0; offset < x; offset++)
    for (i = offset; i < N; i += x)
        sum += data[i];

Elements are accessed far apart in memory
Cache lines are evicted before reuse
Prefetching becomes ineffective

Even though every element is accessed exactly once, performance degrades dramatically as stride increases.

Provided Code: Link

You are given a C++ program that:

Generates a vector of N random integers
Asks you for a stride power p
Uses x = 10^p as the stride
Enforces 0 ≤ p < floor(log10(N))
Contains:
- A working sequential summation
- A stub for strided summation
- A stub for blocked summation

Required Tasks (1–5)

Task 1 – A Priori Analysis (Before Coding)

Before modifying or running the program, answer:

What is the Big-O time complexity of:
- Sequential summation
- Strided summation
- Blocked summation
How many total array elements are accessed in each case?
Based on Big-O alone, should these algorithms have similar performance?
Predict how stride size will affect runtime.

Write your predictions before running any code.

Task 2 – Implement the Strided Algorithm (Programming)

Implement:

long long sumStridedPower10(const vector<int>& data, size_t x);

Requirements

Visit every element exactly once
Do not allocate additional arrays
Do not change Θ(N) complexity
Handle arbitrary N
Use x = 10^p

Task 3 – Correctness Verification

Before benchmarking:

Run the program for multiple values of N
Compare results from:
- Sequential (p = 0)
- Strided (p > 0)
Confirm that all versions produce identical sums

Incorrect results indicate an error in your algorithm.

Task 4 – Benchmarking Experiment

Choose a large fixed N (e.g., 50–100 million).

Run the program multiple times, varying only p.

Suggested values:

p	Stride (x)
0	1
1	10
2	100
3	1,000
4	10,000

Run each configuration at least 3 times
Record the average runtime

Present results in a table.

Task 5 – Implement a Cache-Aware Fix (Required)

Now fix the performance problem without changing Θ(N).

Implement:

long long sumBlocked(const vector<int>& data, size_t blockSize);

Requirements

Process the array in contiguous blocks
Each element must still be visited exactly once
Choose a reasonable block size (justify your choice)
Do not allocate additional arrays
Do not change Big-O complexity

Example structure:

for (size_t block = 0; block < N; block += blockSize) {
    size_t end = min(block + blockSize, N);
    for (size_t i = block; i < end; i++) {
        sum += data[i];
    }
}

Analysis and Explanation (Required)

In your report, explain:

Why all three algorithms are Θ(N)
Why strided access performs poorly on real hardware
How blocking improves spatial locality
Why blocking restores performance
Why Big-O analysis intentionally ignores cache effects
What this implies for real-world algorithm design

Your explanation must reference at least two of:

Cache hierarchy (L1/L2/L3)
Cache lines
Cache misses
Memory latency vs bandwidth
Hardware prefetching

Deliverables

Submit a 2–3 page report containing:

A priori analysis
Description of your implementations
Benchmark results
Explanation of cache effects
Reflection on theory vs practice

Grading Breakdown

Category	Points
A priori analysis	15
Correct strided implementation	20
Correct blocked implementation	20
Verification and benchmarking	20
Cache / locality explanation	15
Clarity and organization	10
Total	100