Goal: You will implement and benchmark three increasingly optimized versions of vector summation, starting from a provided baseline. The focus is on divide-and-conquer structure, fork–join parallelism, and SIMD vector processing via pragma.
Starter Code
You are given a starter repo/project that already:
- generates a vector of random integers
- computes a correct baseline sum
- runs a benchmark harness that times and compares implementations
- checks correctness by comparing all results to the baseline
You will modify src/sum.cpp.
What you will build
Version 0 (provided): Baseline sum
A simple loop:
- one thread
- no recursion
- no pragmas
Version 1 (you implement): Divide-and-Conquer (Sequential)
Implement a recursive sum:
- split the array into halves
- recursively sum each half
- combine results by addition
- include a cutoff threshold to avoid overhead for small subproblems
Version 2 (you implement): Divide-and-Conquer + Fork–Join
Parallelize Version 1:
- use fork–join with
std::async - spawn one side as a task, compute the other side locally, then
get()to join - include a cutoff and a concurrency limiter (max depth or task budget)
Version 3 (you implement): Divide-and-Conquer + SIMD via pragma
Add vector processing:
- in the leaf work (base case / cutoff), use a pragma to encourage SIMD
- recommended approach:
#pragma omp simd reduction(+:sum)on the leaf loop
- you may keep fork–join parallelism from Version 2, but your code must clearly include a SIMD pragma in the inner loop
Benchmark/Test Program (required)
The starter already includes a driver that:
- generates data
- runs all versions
- validates results
- prints timing comparisons
You may improve it, but at minimum it must:
- run all four versions (baseline + 3 implementations)
- verify all results match baseline
- report timing for each version
How to run
From the project directory:
make ./vecsum_demo N [seed]
Example:
./vecsum_demo 50000000 42
What to submit
- Your modified
src/sum.cpp - Any edits to
src/main.cpp(if you improved benchmarking output) - A short
REPORT.md(or PDF) answering the questions below
Report questions (short but specific)
- What cutoff did you choose for D&C? Why?
- What limiter did you use for fork–join (depth limit, budget, etc.) and why?
- For at least three N values (small/medium/large), report timings for:
- baseline
- D&C sequential
- D&C fork–join
- D&C + SIMD
- For your largest test, show:
- speedup vs baseline for each version
- Briefly explain why speedup does not scale perfectly with CPU cores (mention overheads and memory effects).
Rubric (100 points)
| Category | Points | What earns full credit |
|---|---|---|
| Correctness | 25 | All implementations match baseline for multiple N values; no overflow issues; clean handling of edge cases |
| Version 1: D&C Sequential | 15 | True recursive divide-and-conquer with cutoff; clean, readable code |
| Version 2: Fork–Join | 20 | Uses std::async fork–join correctly; includes cutoff + limiter; avoids runaway task creation |
| Version 3: SIMD Pragma | 15 | Clearly uses SIMD pragma in leaf loop (e.g., omp simd reduction); measurable effect at scale or justified in report |
| Benchmark/Test Program | 10 | Runs all versions, checks correctness, reports timings clearly, repeatable via seed |
| Report Quality | 10 | Clear tables/results, explains cutoffs/limits, reasonable analysis |
| Code Quality | 5 | Style, organization, comments, and no unnecessary complexity |
