Lab 2 – Parallel Divide-and-Conquer Vector Sum (Fork–Join + SIMD)

Classroom Link

Goal: You will implement and benchmark three increasingly optimized versions of vector summation, starting from a provided baseline. The focus is on divide-and-conquer structure, fork–join parallelism, and SIMD vector processing via pragma.

Starter Code

You are given a starter repo/project that already:

  • generates a vector of random integers
  • computes a correct baseline sum
  • runs a benchmark harness that times and compares implementations
  • checks correctness by comparing all results to the baseline

You will modify src/sum.cpp.


What you will build

Version 0 (provided): Baseline sum

A simple loop:

  • one thread
  • no recursion
  • no pragmas

Version 1 (you implement): Divide-and-Conquer (Sequential)

Implement a recursive sum:

  • split the array into halves
  • recursively sum each half
  • combine results by addition
  • include a cutoff threshold to avoid overhead for small subproblems

Version 2 (you implement): Divide-and-Conquer + Fork–Join

Parallelize Version 1:

  • use fork–join with std::async
  • spawn one side as a task, compute the other side locally, then get() to join
  • include a cutoff and a concurrency limiter (max depth or task budget)

Version 3 (you implement): Divide-and-Conquer + SIMD via pragma

Add vector processing:

  • in the leaf work (base case / cutoff), use a pragma to encourage SIMD
  • recommended approach:
    • #pragma omp simd reduction(+:sum) on the leaf loop
  • you may keep fork–join parallelism from Version 2, but your code must clearly include a SIMD pragma in the inner loop

Benchmark/Test Program (required)

The starter already includes a driver that:

  • generates data
  • runs all versions
  • validates results
  • prints timing comparisons

You may improve it, but at minimum it must:

  • run all four versions (baseline + 3 implementations)
  • verify all results match baseline
  • report timing for each version

How to run

From the project directory:

make
./vecsum_demo N [seed]

Example:

./vecsum_demo 50000000 42

What to submit

  1. Your modified src/sum.cpp
  2. Any edits to src/main.cpp (if you improved benchmarking output)
  3. A short REPORT.md (or PDF) answering the questions below

Report questions (short but specific)

  1. What cutoff did you choose for D&C? Why?
  2. What limiter did you use for fork–join (depth limit, budget, etc.) and why?
  3. For at least three N values (small/medium/large), report timings for:
    • baseline
    • D&C sequential
    • D&C fork–join
    • D&C + SIMD
  4. For your largest test, show:
    • speedup vs baseline for each version
  5. Briefly explain why speedup does not scale perfectly with CPU cores (mention overheads and memory effects).

Rubric (100 points)

CategoryPointsWhat earns full credit
Correctness25All implementations match baseline for multiple N values; no overflow issues; clean handling of edge cases
Version 1: D&C Sequential15True recursive divide-and-conquer with cutoff; clean, readable code
Version 2: Fork–Join20Uses std::async fork–join correctly; includes cutoff + limiter; avoids runaway task creation
Version 3: SIMD Pragma15Clearly uses SIMD pragma in leaf loop (e.g., omp simd reduction); measurable effect at scale or justified in report
Benchmark/Test Program10Runs all versions, checks correctness, reports timings clearly, repeatable via seed
Report Quality10Clear tables/results, explains cutoffs/limits, reasonable analysis
Code Quality5Style, organization, comments, and no unnecessary complexity
Scroll to Top