Lab 2 – Parallel Divide-and-Conquer Vector Sum (Fork–Join + SIMD)

Classroom Link

Goal: You will implement and benchmark three increasingly optimized versions of vector summation, starting from a provided baseline. The focus is on divide-and-conquer structure, fork–join parallelism, and SIMD vector processing via pragma.

Starter Code

You are given a starter repo/project that already:

generates a vector of random integers
computes a correct baseline sum
runs a benchmark harness that times and compares implementations
checks correctness by comparing all results to the baseline

You will modify src/sum.cpp.

What you will build

Version 0 (provided): Baseline sum

A simple loop:

one thread
no recursion
no pragmas

Version 1 (you implement): Divide-and-Conquer (Sequential)

Implement a recursive sum:

split the array into halves
recursively sum each half
combine results by addition
include a cutoff threshold to avoid overhead for small subproblems

Version 2 (you implement): Divide-and-Conquer + Fork–Join

Parallelize Version 1:

use fork–join with std::async
spawn one side as a task, compute the other side locally, then get() to join
include a cutoff and a concurrency limiter (max depth or task budget)

Version 3 (you implement): Divide-and-Conquer + SIMD via pragma

Add vector processing:

in the leaf work (base case / cutoff), use a pragma to encourage SIMD
recommended approach:
- #pragma omp simd reduction(+:sum) on the leaf loop
you may keep fork–join parallelism from Version 2, but your code must clearly include a SIMD pragma in the inner loop

Benchmark/Test Program (required)

The starter already includes a driver that:

generates data
runs all versions
validates results
prints timing comparisons

You may improve it, but at minimum it must:

run all four versions (baseline + 3 implementations)
verify all results match baseline
report timing for each version

How to run

From the project directory:

make
./vecsum_demo N [seed]

Example:

./vecsum_demo 50000000 42

What to submit

Your modified src/sum.cpp
Any edits to src/main.cpp (if you improved benchmarking output)
A short REPORT.md (or PDF) answering the questions below

Report questions (short but specific)

What cutoff did you choose for D&C? Why?
What limiter did you use for fork–join (depth limit, budget, etc.) and why?
For at least three N values (small/medium/large), report timings for:
- baseline
- D&C sequential
- D&C fork–join
- D&C + SIMD
For your largest test, show:
- speedup vs baseline for each version
Briefly explain why speedup does not scale perfectly with CPU cores (mention overheads and memory effects).

Rubric (100 points)

Category	Points	What earns full credit
Correctness	25	All implementations match baseline for multiple N values; no overflow issues; clean handling of edge cases
Version 1: D&C Sequential	15	True recursive divide-and-conquer with cutoff; clean, readable code
Version 2: Fork–Join	20	Uses `std::async` fork–join correctly; includes cutoff + limiter; avoids runaway task creation
Version 3: SIMD Pragma	15	Clearly uses SIMD pragma in leaf loop (e.g., `omp simd reduction`); measurable effect at scale or justified in report
Benchmark/Test Program	10	Runs all versions, checks correctness, reports timings clearly, repeatable via seed
Report Quality	10	Clear tables/results, explains cutoffs/limits, reasonable analysis
Code Quality	5	Style, organization, comments, and no unnecessary complexity