- cppreference: The best C++ online reference around.
- isocpp: Some blog posts about C++, tips and reference papers.
- Programming -- Principles and Practice Using C++ (3rd Edition): Book for C++ beginners.
- A Tour of C++, third edition: Book if you already know basics of C++, it covers C++20.
- E. A. Lee, “The Problem with Threads”: A seminal paper on why shared-state parallelism is hard.
- The Art of Multiprocessor Programming: General introduction to parallelism.
- P. E. McKenney, “Is Parallel Programming Hard, And, If So, What Can You Do About It?”: Advanced textbook on parallelism.
1-introduction-cpp/exercises/rule184.cpp
: Implement the Rule184. You write the simulation in a loop using two arrays of cells (the current array and the next array). To see the evolution of the queue, pause the program 1 second between two iterations.- Decompose the program into several functions:
print(cells);
simulate(steps, current, next);
simulate_step(current, next);
- Add a functionality to compute the longest size of consecutive '1' at any iteration of the simulation (it represents the longest traffic jam).
See the files in demo/modular
for setting up a multi-files project with headers (.hpp) and .cpp files.
1-introduction-cpp/exercises/int_vector.hpp
: Implement the necessary constructors, destructors and methods given in the file. For additionnal documentation see std::vector, we try to follow it as close as possible.1-introduction-cpp/exercises/vector.hpp
: Update the previous example using a templated class.- Extend (2) with a move constructor.
- [Optional⭐⭐] Extend (3) to allow the type
T
to lack a default constructor (e.g.Vector<S> x; x.reserve(10);
withstruct S { S(int x) {} };
should work. See placement new and std::malloc.
1-introduction-cpp/exercises/vector.hpp
: Implement a map function taking a vector and applying a functionf
to each of its component. Suppose an integer vectorv
with the values1,-5,6
, thenmap(v, [](int x) { return x * 2; })
modifiesv
in-place and double each value.1-introduction-cpp/exercises/vector.hpp
: Implement a left-fold function (aka. "reduce") which takes a vector, an accumulator and a functionf
. For instance,fold_left(v, 0, [](int accu, int x) { return accu - x; })
returns the value2
(the difference of the integers in vectorv
from left to right). More information.- [Optional⭐] Same as (2) but with
fold_right
, so it is from right to left. On the previous example, it returns-2
. - [Optional⭐] Rewrite the loops in
Rule184.cpp
usingmap
. If it is not possible, introduce a new generic combinator similar to map to achieve your aim.
2-std-thread/demo/parallel_map.cpp
: Do some experiments with different sizes of array. When is it interesting to have 2 threads? Why?2-std-thread/demo/parallel_map.cpp
: Use the functionmore_work
and redo the experiments.- [Optional⭐]
2-std-thread/demo/parallel_map.cpp
: Extendparallel_map
to any number of threads. 2-std-thread/exercises/parallel_fold_left.cpp
: Implementparallel_fold_left_commutative
where 2 threads compute on half the array, wait for completion and then both results are merged. Note that the operation must be commutative otherwise you might not get the same result as in sequential computation.- [Optional⭐]
2-std-thread/exercises/parallel_fold_left.cpp
: Extendparallel_fold_left
to any number of threads. 2-std-thread/exercises/parallel_map_reduce.cpp
: Perform a map in parallel, followed by a fold. Find a (simple) way to reuse the threads from the map to compute the fold. Think about the pros and cons of your design, can we do better?
2-std-thread/exercises/histogram.cpp
: Given an array of size N taking value in the range [0,1000], implements a sequential algorithmhistogram_seq
which computes the frequency of each value and store the result in another arrayhistogram
.- Parallelize the algorithm using thread and mutex when writing in the histogram array.
- [Optional⭐⭐] Lock-free algorithm: Use atomic to write values in the histogram, in order to avoid locks.
- Using the C++ threads, parallelize the simulation algorithm of Rule 184. Print only the final iteration and the size of the longest queue obtained.
- The goal is to perform the most iterations in 1 minute for different values of the array.
- Plot your results using Python or any tool to see the impact of 1, 2, ..., N threads on the efficiency according to the size of the array.
- Discuss your results and design in a small PDF report.
- Send the code and report in ".zip" before the next course begins (4th October 13h59) by email with the subject "[MHPC][PGC] Project".
- You can discuss your design and your results on Discord or orally, but please don't share your code.
- This is a solo project.
- Implement the dining philosopher for 2 philosophers by simply numbering the forks (you can read this code).
- Generalize your solutions for N philosophers.
- Modify the algorithm to use coarse-grained locking.
- Implement a thread-safe version of
vector.cpp
(add a mutex as a private member, and use alock_guard
to guard the operations). - Suppose you have 1 thread consuming data from the vector (
pop_back
), and another thread producing data (push_back
). Create two threads that produces and consumes integers. Is the classvector
sufficient for this situation? - Create a
buffer
class wrapping avector
to solve with two methodsproduce
andconsume
. Use a semaphore to synchronize in case the vector is empty. - With many consumers and producers, can this implementation be a bottleneck when parallelizing?
- [Optional⭐] Add an upper limit on the size of the vector, so a producer will wait when the buffer is full.
- [Optional⭐⭐] View the
vector
as a queue instead of a stack, so the first element produced will be consumed first (and not last like now).
Suppose you have a shared resources (e.g. a web page) among many readers (e.g. the people browsing the web) and writers (e.g. the developers updating the page).
- Implement a class with a method
read
and a methodwrite
. For the purpose of the exercise, the resource can simply be an integer. - Many readers can concurrently access the resource, but only one writer can modify the resource at a time (therefore no reader must be currently reading the resource).
- Does your implementation take into account the problem of many readers, such that there is always at least a reader accessing the resource? What can you do about it?
- Execute the sequential code for N from 10 to 16 and note the number of explored nodes, the number of solutions and the execution time for each instance.
- Parallelize the sequential code following a single-pool multi-thread approach.
Hints:
- The number of threads should be read from the command-line;
- Threads (
std::thread
) should be stored in astd::vector
; - Accesses to the pool must be protected with
std::mutex
.
- Re-execute question 1. considering the parallel version implemented in 2. and varying the number of threads. What do you observe? Why?
- Extend the parallel version implemented in 2. using multiple pools (one pool per thread).
Hints:
- The victim selection policy will be random (uniform);
- The work stealing granularity will be "steal-half".
- Re-execute question 1. considering the parallel version implemented in 4. and varying the number of threads. Compare with the results obtained in 3. What do you observe? Why?
5-openmp/exercises/scalar_product.cpp
: Using OpenMP pragmas, parallelize the functionscalar_product
.5-openmp/exercises/reduction_min.cpp
: Write a parallel reduction function computing the minimum value.5-openmp/exercises/filter_leq.cpp
: Write a parallel functionfilter_leq(v, k)
taking a vectorv
of integers and an integerk
which returns a new array with the sum of all elements less or equal tok
inv
.
5-openmp/exercises/nqueens.cpp
: Parallelize the N-queens code seen in the previous course using OpenMP tasks. Hint: check5-openmp/demo/tasks_fib.cpp
.
The project description is available here.
- Install Chapel on your system using the provided script (it can take some minutes). Command:
cd 6-pgas-chapel/
source chapel_install.sh
- Read the introductory file available here.
- Implement a parallel approach where each thread computes a block made of contiguous lines of the global image. What do you observe in terms of speed-up compared to the sequential version?
- The goal is to improve the load balancing between threads. Implement a variant of the previous algorithm where each processor computes a block made of alternated lines. More precisely, we will assume that the k-th thread (0 <= k < n) takes in charge the lines indexed by k + i n.
Hints:
- The number of threads
n
can be set in command-line usingCHPL_RT_NUM_THREADS_PER_LOCALE=n
; - Compile using the
--fast
optimization flag:chpl --fast foo.chpl
; - You can verify that two images are similar using the output of
diff image1 image2
.
6-pgas-chapel/exercises/nqueens.chpl
: Parallelize the N-queens code seen in a previous course using Chapel.
7-mpi/exercises/ping_pong.cpp
: GivenN
processes, we pair processes(0,1), (2,3), ...
such that0
sends "ping" to1
which replies with "pong". This is repeatednum_ping_pong
times as specified by a program's argument.7-mpi/exercises/ping_pong_ring.cpp
: GivenN
processes organized in a ring, a process send "ping" to its right, receive "ping" from its left, send "pong" to its left and receive "pong" from its right. This is repeatednum_ping_pong
times as specified by a program's argument.7-mpi/exercises/stop_when.cpp
: Given 3 processes, process 1 and 2 waits for a random number of seconds, and afterwards they send a message to the process 0 (can useMPI_ANY_SOURCE
). Once process 0 receives the message, it broadcasts a termination message to everybody, and all processes immediately stop.7-mpi/exercises/sum.cpp
: Suppose we haveN
processes. The process 0 generates a random arrayarr
. UsingMPI_Scatter
, the array is distributed among the processes. They all compute the local sum of the array. UsingMPI_Reduce
we retrieve the global sum and print it in the process 0.- Read this paper on Kahn network.
In this project, you will develop a parallel algorithm which creates a maze and test how many paths can enter and exit the maze.
- Use randomized depth-first search to create a maze, or with any technique of your choice (see here).
- Given the beginning and exit of the labyrinth, design three different algorithms with MPI (different communication and/or data-splitting strategies) finding all different paths in the labyrinth.
- Compare the paths found (if the maze is small enough) or the number of paths found (otherwise) among the algorithms. You must obtain the same results, otherwise an algorithm is wrong.
- Write a short report explaining the intuitions behind the three algorithms, and evaluate those algorithms on a set of generated maze (from small to large).
- Give hypothesis on why an algorithm is working better than another.
- The goal is to practice with MPI and different topologies/strategies. It is OK if your algorithm is slower than the others as long as you can explain why.
- For students in the Master in HPC: it is mandatory to run some experiments on the HPC with the number of nodes >= 2.
- The code you produce must be your own and you cannot copy an existing MPI algorithm from internet.
- You can work in team of 3, each member must design and implement a different algorithm, and you must explicitly say in the report who wrote which algorithm.
- Send the project before the exam at pierre.talbot@uni.lu with the subject "[MHPC][PGC] Project MPI".
- Optional: Use the Boost.MPI library to create your application.