Purpose
In this lab, you will explore why algorithms with identical Θ(N) complexity can behave very differently on real hardware — and how careful algorithm design can recover performance without changing asymptotic complexity.
You will:
- Analyze algorithm behavior a priori
- Implement a cache-unfriendly algorithm
- Measure the performance impact
- Design and implement a cache-aware fix
- Explain the results in terms of memory hierarchy
This lab emphasizes a central theme of the course:
Asymptotic analysis describes growth trends, not machine performance.
Learning Objectives
By completing this lab, you will be able to:
- Write correct Θ(N) algorithms
- Understand and implement strided memory access
- Explain spatial locality and cache effects
- Benchmark programs responsibly
- Improve performance without changing Big-O complexity
- Connect algorithm design decisions to hardware behavior
Background: Stride and Memory Locality
Arrays are stored contiguously in memory:
data[0], data[1], data[2], data[3], …
Sequential Access (Stride = 1)
for (i = 0; i < N; i++)
sum += data[i];
- Excellent spatial locality
- Cache lines are reused efficiently
- Hardware prefetchers are effective
Strided Access (Stride = x)
for (offset = 0; offset < x; offset++)
for (i = offset; i < N; i += x)
sum += data[i];
- Elements are accessed far apart in memory
- Cache lines are evicted before reuse
- Prefetching becomes ineffective
Even though every element is accessed exactly once, performance degrades dramatically as stride increases.
Provided Code: Link
You are given a C++ program that:
- Generates a vector of
Nrandom integers - Asks you for a stride power
p - Uses
x = 10^pas the stride - Enforces
0 ≤ p < floor(log10(N)) - Contains:
- A working sequential summation
- A stub for strided summation
- A stub for blocked summation
Required Tasks (1–5)
Task 1 – A Priori Analysis (Before Coding)
Before modifying or running the program, answer:
- What is the Big-O time complexity of:
- Sequential summation
- Strided summation
- Blocked summation
- How many total array elements are accessed in each case?
- Based on Big-O alone, should these algorithms have similar performance?
- Predict how stride size will affect runtime.
Write your predictions before running any code.
Task 2 – Implement the Strided Algorithm (Programming)
Implement:
long long sumStridedPower10(const vector<int>& data, size_t x);
Requirements
- Visit every element exactly once
- Do not allocate additional arrays
- Do not change Θ(N) complexity
- Handle arbitrary
N - Use
x = 10^p
Task 3 – Correctness Verification
Before benchmarking:
- Run the program for multiple values of
N - Compare results from:
- Sequential (
p = 0) - Strided (
p > 0)
- Sequential (
- Confirm that all versions produce identical sums
Incorrect results indicate an error in your algorithm.
Task 4 – Benchmarking Experiment
Choose a large fixed N (e.g., 50–100 million).
Run the program multiple times, varying only p.
Suggested values:
| p | Stride (x) |
|---|---|
| 0 | 1 |
| 1 | 10 |
| 2 | 100 |
| 3 | 1,000 |
| 4 | 10,000 |
- Run each configuration at least 3 times
- Record the average runtime
Present results in a table.
Task 5 – Implement a Cache-Aware Fix (Required)
Now fix the performance problem without changing Θ(N).
Implement:
long long sumBlocked(const vector<int>& data, size_t blockSize);
Requirements
- Process the array in contiguous blocks
- Each element must still be visited exactly once
- Choose a reasonable block size (justify your choice)
- Do not allocate additional arrays
- Do not change Big-O complexity
Example structure:
for (size_t block = 0; block < N; block += blockSize) {
size_t end = min(block + blockSize, N);
for (size_t i = block; i < end; i++) {
sum += data[i];
}
}
Analysis and Explanation (Required)
In your report, explain:
- Why all three algorithms are Θ(N)
- Why strided access performs poorly on real hardware
- How blocking improves spatial locality
- Why blocking restores performance
- Why Big-O analysis intentionally ignores cache effects
- What this implies for real-world algorithm design
Your explanation must reference at least two of:
- Cache hierarchy (L1/L2/L3)
- Cache lines
- Cache misses
- Memory latency vs bandwidth
- Hardware prefetching
Deliverables
Submit a 2–3 page report containing:
- A priori analysis
- Description of your implementations
- Benchmark results
- Explanation of cache effects
- Reflection on theory vs practice
Grading Breakdown
| Category | Points |
|---|---|
| A priori analysis | 15 |
| Correct strided implementation | 20 |
| Correct blocked implementation | 20 |
| Verification and benchmarking | 20 |
| Cache / locality explanation | 15 |
| Clarity and organization | 10 |
| Total | 100 |
