changed message format

xLPMG · Feb 1, 2024 · 3c64ac8 · 3c64ac8
1 parent f50e338
commit 3c64ac8
Show file tree

Hide file tree

Showing 6 changed files with 481 additions and 400 deletions.
diff --git a/docs/source/files/assignments/08.rst b/docs/source/files/assignments/08.rst
@@ -1,15 +1,17 @@
+################
 8. Optimization
-*****************
+################
 
+**********
 8.1 ARA
-========
+**********
 
 .. figure:: https://wiki.uni-jena.de/download/attachments/22453005/IMG_7381_0p5.JPG?version=1&modificationDate=1625042348365&api=v2
 
     HPC-Cluster ARA. Source: https://wiki.uni-jena.de/pages/viewpage.action?pageId=22453005
 
 8.1.1 - Uploading and running the code
-----------------------------------------
+========================================
 
 First we cloned our github repository to "beegfs" and transfered the bythymetry and displacement data with "wget https://cloud.uni-jena.de/s/CqrDBqiMyKComPc/download/data_in.tar.xz -O tsunami_lab_data_in.tar.xz" there.
 
@@ -41,9 +43,10 @@ sbatch file:
 Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``cpus-per-task`` to 72.
 
 8.1.2 - Visualizations
---------------------------
+========================
 
-**Tohoku 5000**
+Tohoku 5000
+-----------
 
 .. raw:: html
 
@@ -52,7 +55,8 @@ Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``c
     </video> 
 
 
-**Tohoku 1000**
+Tohoku 1000
+-----------
 
 .. raw:: html
 
@@ -62,15 +66,17 @@ Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``c
 
 
 
-**Chile 5000**
+Chile 5000
+-----------
 
 .. raw:: html
 
     <video width="100%" height="auto" controls>
       <source src="../../_static/assets/task_8-1-2_chile_5000.mp4" type="video/mp4">
     </video> 
 
-**Chile 1000**
+Chile 1000
+-----------
 
 .. raw:: html
 
@@ -82,15 +88,15 @@ Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``c
 Comparing to the simulations from assignment 6, it is clear that all simulations behave equally.
 
 8.1.3 - Private PC vs ARA
----------------------------
+===========================
 
 .. note:: 
 
   The code was compiled using ``scons mode=benchmark opt=-O2``.
   The benchmarking mode disables all file output (and also skips all imports of ``<filesystem>``).
 
 Setups
-^^^^^^^^^^
+-------
 
 If you are interested, you can view the used configurations here:
 
@@ -103,7 +109,7 @@ If you are interested, you can view the used configurations here:
 :download:`tohoku1000.json <../../_static/text/tohoku1000.json>`
 
 Results
-^^^^^^^^^^
+--------
 
 ..  list-table:: execution times on different devices
     :header-rows: 1
@@ -187,17 +193,18 @@ Results
   and stopped after the program has finished and all memory has been freed.
 
 Observations
-^^^^^^^^^^^^^^
+--------------
 
 In every scenario, ARA had a faster setup time but slower computation times.
 We conclude that ARA has faster data/file access (because the setup heavily depends on data reading speed from a file)
 while the private PC seems to have better single core performance.
 
+**************
 8.2 Compilers
-===============
+**************
 
 8.2.1 - Generic compiler support
----------------------------------
+=================================
 
 We enabled generic compiler support by adding the following lines to our ``SConstruct`` file
 
@@ -221,10 +228,10 @@ Now, scons can be invoked with a compiler of choice, for example by running
   CXX=icpc scons
 
 8.2.2 & 8.2.3 - Test runs
---------------------------
+===========================
 
 Time measurements
-^^^^^^^^^^^^^^^^^^^^^^^^^
+------------------
 
 For each run, we used the following configuration:
 
@@ -313,7 +320,7 @@ We therefore ended up using ``compiler/intel/2018-Update1`` and ``gcc (GCC) 4.8.
 This configuration was the only one that worked for us, as we did not manage to fix all the errors that were thrown at us.
 
 Observations from the table
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------
 
 As one would intuitively expect, the higher the optimization level is,
 the quicker the process finished.
@@ -328,7 +335,7 @@ We would also need to ensure that there are no other intensive processes running
 Nonetheless, by using the table as a rough estimate it seems that ``g++`` is faster when using ``-O0`` and ``-Ofast`` while ``icpc`` is preferable for ``-O2``.
 
 8.2.3 - Optimization flags
----------------------------
+===========================
 
 To allow for an easy switch between optimization flag, we added following code to our SConstruct:
 
@@ -355,7 +362,7 @@ and
     env.Append( CXXFLAGS = [ env['opt'] ] ) 
 
 The dangers of -Ofast
-^^^^^^^^^^^^^^^^^^^^^^^
+----------------------
 
 One of the options that ``-Ofast`` enables is ``-ffast-math``.
 With that, a whole lot of other options get activated as well, such as
@@ -386,7 +393,7 @@ and
 `<https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>`_
 
 8.2.4 - Compiler reports
-------------------------
+=========================
 
 We added the support for a compiler report flag with the following lines in our ``SConstruct``
 
@@ -435,7 +442,7 @@ This snippet refers to the loops that provide our solver with data from a setup:
     }  
 
 F-Wave optimization report
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+---------------------------
 
 The full report can be found :download:`here. <../../_static/text/task8-2-4_fwave_optrpt.txt>`
 
@@ -484,7 +491,7 @@ For ``netUpdates``, the report tells us that
 We can conclude that the compiler is able to inline our calls to ``computeEigenvalues`` and ``computeEigencoefficients``.
 
 WavePropagation2d optimization report
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+--------------------------------------
 
 The full report can be found :download:`here. <../../_static/text/task8-2-4_waveprop2d_optrpt.txt>`
 
@@ -514,12 +521,12 @@ could not be vectorized:
   Lines 86 and 88 are the two for-loops for y- and x-axis of the x-sweep and 
   lines 152 and 154 are the two for-loops for y- and x-axis of the y-sweep. 
 
-
+*********************************************
 8.3 Instrumentation and Performance Counters
-==============================================
+*********************************************
 
 8.3.1 to 8.3.4 - VTune
------------------------
+=======================
 
 First we used the gui of Intel vTune to specify our reports.
 
@@ -542,7 +549,7 @@ Then the following batch script was used to run the hotspots measurement:
   /cluster/intel/vtune_profiler_2020.2.0.610396/bin64/vtune -collect hotspots -app-working-dir /beegfs/xe63nel/tsunami_lab/build -- /beegfs/xe63nel/tsunami_lab/build/tsunami_lab ../configs/config.json
 
 Hotspots
-^^^^^^^^^^
+---------
 
 ..  image:: ../../_static/assets/task_8-3-1_hotspot_bottomUp.png
 
@@ -564,7 +571,7 @@ It was interesting to see (although it should not come as a surprise) that the `
 of the CPU time. 
 
 Threads
-^^^^^^^^^^
+--------
 
 ..  image:: ../../_static/assets/task_8-3-1_threads.png
 
@@ -573,10 +580,10 @@ Threads
 The poor result for the thread report was also expected, because we only compute sequentially.
 
 8.3.5 - Code optimizations
----------------------------
+===========================
 
 TsunamiEvent2d speedup
-^^^^^^^^^^^^^^^^^^^^^^^
+-----------------------
 
 In order to increase the speed of this setup, we introduced a variable ``lastnegativeIndex`` for the X and Y direction for the bathymetry and displacement.
 The idea is the following: 
@@ -634,7 +641,7 @@ Code snippets of the implementation:
     }
 
 F-Wave solver optimization  
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------
 
 In ``computeEigencoefficients``, we changed
 
@@ -677,7 +684,7 @@ Furthermore, we established a constant for :code:`t_real(0.5) * m_g`:
 
 
 Coarse Output optimization
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+----------------------------
 
 Inside the ``write()`` function in ``NetCdf.cpp`` we calculated
 
@@ -699,8 +706,9 @@ once and then reuse it wherever we need it:
 
 This way, the division only happens once.
 
+************************
 Individual phase ideas
-========================
+************************
 
 For the individual phase, we plan on building a graphical user interface using `ImGui <https://github.com/ocornut/imgui>`_.
 

diff --git a/docs/source/files/assignments/09.rst b/docs/source/files/assignments/09.rst
@@ -1,11 +1,13 @@
+==================
 9. Parallelization
-********************
+==================
 
+************
 9.1 OpenMP
-============
+************
 
 9.1.1 - Parallelization with OpenMP
-----------------------------------------
+====================================
 
 An easy way to parallelize our for loops is using 
 
@@ -22,7 +24,7 @@ example:
     ...
 
 9.1.2 - Parallelization speedup
-------------------------------------------
+====================================
 
 We have used following batch script for ara:
 
@@ -42,7 +44,8 @@ We have used following batch script for ara:
 
 And got following results:
 
-**Without parallelization**
+Without parallelization
+-----------------------
 
 .. code:: text
 
@@ -54,7 +57,8 @@ And got following results:
     = 1941.01 seconds
     = 32.3501 minutes
 
-**With parallelization on 72 cores with 72 threads**
+With parallelization on 72 cores with 72 threads
+------------------------------------------------
 
 .. code:: text
 
@@ -72,7 +76,8 @@ And got following results:
 
 Speedup: :math:`\frac{1941}{75.5} = 25.7`
 
-**With parallelization on 72 cores with 144 threads**
+With parallelization on 72 cores with 144 threads
+-------------------------------------------------
 
 .. code:: text
 
@@ -90,7 +95,7 @@ We can see that having twice the amount of threads resulted in a much slower com
 We conclude that using more threads than cores results in a slowed down performance.
 
 9.1.3 - 2D for loop parallelization
-------------------------------------------
+====================================
 
 The results from above used parallelization in the outer loop.
 The parallelized inner loops leads to following time:
@@ -105,9 +110,10 @@ The parallelized inner loops leads to following time:
 It is clear, that parallelizing the outer loop is more effficient.
 
 9.1.4 - Pinning and Scheduling
-------------------------------------------
+===============================
 
-**Scheduling**
+Scheduling
+----------
 
 The upper implementation used the basic :code:`scheduling(static)`.
 
@@ -136,7 +142,8 @@ For :code:`scheduling(auto)` we get:
             = 84.5467 seconds
             = 1.40911 minutes
 
-**Pinning** 
+Pinning
+------- 
 
 Using :code:`OMP_PLACES={0}:36:1` we get:
 

diff --git a/docs/source/files/assignments/project.rst b/docs/source/files/assignments/project.rst
@@ -1,11 +1,13 @@
+###################
 10. Project Phase
-********************
+###################
 
 In the project phase we decided to implement a userfriendly Gui. The aim is to make the usage of our Tsunami solver
 as easy and interactive as possible. 
 
+*********************
 GUI (Client-side)
-==================
+*********************
 
 ..  image:: ../../_static/assets/task-10-Gui_help.png
 
@@ -28,20 +30,22 @@ After selecting the simulation has to be recompiled with the according button be
 The last tab contains further actions to interact with the simulation. First, the simulation can be started or killed here.
 Also files for the bathymetry and displacement can be chosen. As an addition, the user can get data like the heigth from the simulation. 
 
+*********************
 Server-side
-=============
+*********************
 
+*********************
 Libraries
-==============
+*********************
 
 Communicator
-**************
+=====================
 
 For communication between simulation and the GUI we implemented a communication library. 
 The **Communicator.hpp** library can be used to easily create a client-server TCP connection and handle its communication and logging.
 
 Communicator API
-******************
+=====================
 
 (**File: communicator_api.h**)