Skip to content

Commit

Permalink
changed message format
Browse files Browse the repository at this point in the history
  • Loading branch information
xLPMG committed Feb 1, 2024
1 parent f50e338 commit 3c64ac8
Show file tree
Hide file tree
Showing 6 changed files with 481 additions and 400 deletions.
72 changes: 40 additions & 32 deletions docs/source/files/assignments/08.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
################
8. Optimization
*****************
################

**********
8.1 ARA
========
**********

.. figure:: https://wiki.uni-jena.de/download/attachments/22453005/IMG_7381_0p5.JPG?version=1&modificationDate=1625042348365&api=v2

HPC-Cluster ARA. Source: https://wiki.uni-jena.de/pages/viewpage.action?pageId=22453005

8.1.1 - Uploading and running the code
----------------------------------------
========================================

First we cloned our github repository to "beegfs" and transfered the bythymetry and displacement data with "wget https://cloud.uni-jena.de/s/CqrDBqiMyKComPc/download/data_in.tar.xz -O tsunami_lab_data_in.tar.xz" there.

Expand Down Expand Up @@ -41,9 +43,10 @@ sbatch file:
Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``cpus-per-task`` to 72.

8.1.2 - Visualizations
--------------------------
========================

**Tohoku 5000**
Tohoku 5000
-----------

.. raw:: html

Expand All @@ -52,7 +55,8 @@ Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``c
</video>


**Tohoku 1000**
Tohoku 1000
-----------

.. raw:: html

Expand All @@ -62,15 +66,17 @@ Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``c



**Chile 5000**
Chile 5000
-----------

.. raw:: html

<video width="100%" height="auto" controls>
<source src="../../_static/assets/task_8-1-2_chile_5000.mp4" type="video/mp4">
</video>

**Chile 1000**
Chile 1000
-----------

.. raw:: html

Expand All @@ -82,15 +88,15 @@ Since we only want to use one node, we set ``nodes`` and ``ntasks`` to 1 and ``c
Comparing to the simulations from assignment 6, it is clear that all simulations behave equally.

8.1.3 - Private PC vs ARA
---------------------------
===========================

.. note::

The code was compiled using ``scons mode=benchmark opt=-O2``.
The benchmarking mode disables all file output (and also skips all imports of ``<filesystem>``).

Setups
^^^^^^^^^^
-------

If you are interested, you can view the used configurations here:

Expand All @@ -103,7 +109,7 @@ If you are interested, you can view the used configurations here:
:download:`tohoku1000.json <../../_static/text/tohoku1000.json>`

Results
^^^^^^^^^^
--------

.. list-table:: execution times on different devices
:header-rows: 1
Expand Down Expand Up @@ -187,17 +193,18 @@ Results
and stopped after the program has finished and all memory has been freed.

Observations
^^^^^^^^^^^^^^
--------------

In every scenario, ARA had a faster setup time but slower computation times.
We conclude that ARA has faster data/file access (because the setup heavily depends on data reading speed from a file)
while the private PC seems to have better single core performance.

**************
8.2 Compilers
===============
**************

8.2.1 - Generic compiler support
---------------------------------
=================================

We enabled generic compiler support by adding the following lines to our ``SConstruct`` file

Expand All @@ -221,10 +228,10 @@ Now, scons can be invoked with a compiler of choice, for example by running
CXX=icpc scons
8.2.2 & 8.2.3 - Test runs
--------------------------
===========================

Time measurements
^^^^^^^^^^^^^^^^^^^^^^^^^
------------------

For each run, we used the following configuration:

Expand Down Expand Up @@ -313,7 +320,7 @@ We therefore ended up using ``compiler/intel/2018-Update1`` and ``gcc (GCC) 4.8.
This configuration was the only one that worked for us, as we did not manage to fix all the errors that were thrown at us.

Observations from the table
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------------------

As one would intuitively expect, the higher the optimization level is,
the quicker the process finished.
Expand All @@ -328,7 +335,7 @@ We would also need to ensure that there are no other intensive processes running
Nonetheless, by using the table as a rough estimate it seems that ``g++`` is faster when using ``-O0`` and ``-Ofast`` while ``icpc`` is preferable for ``-O2``.

8.2.3 - Optimization flags
---------------------------
===========================

To allow for an easy switch between optimization flag, we added following code to our SConstruct:

Expand All @@ -355,7 +362,7 @@ and
env.Append( CXXFLAGS = [ env['opt'] ] )
The dangers of -Ofast
^^^^^^^^^^^^^^^^^^^^^^^
----------------------
One of the options that ``-Ofast`` enables is ``-ffast-math``.
With that, a whole lot of other options get activated as well, such as
Expand Down Expand Up @@ -386,7 +393,7 @@ and
`<https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>`_
8.2.4 - Compiler reports
------------------------
=========================
We added the support for a compiler report flag with the following lines in our ``SConstruct``
Expand Down Expand Up @@ -435,7 +442,7 @@ This snippet refers to the loops that provide our solver with data from a setup:
}
F-Wave optimization report
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
---------------------------
The full report can be found :download:`here. <../../_static/text/task8-2-4_fwave_optrpt.txt>`
Expand Down Expand Up @@ -484,7 +491,7 @@ For ``netUpdates``, the report tells us that
We can conclude that the compiler is able to inline our calls to ``computeEigenvalues`` and ``computeEigencoefficients``.
WavePropagation2d optimization report
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
--------------------------------------
The full report can be found :download:`here. <../../_static/text/task8-2-4_waveprop2d_optrpt.txt>`
Expand Down Expand Up @@ -514,12 +521,12 @@ could not be vectorized:
Lines 86 and 88 are the two for-loops for y- and x-axis of the x-sweep and
lines 152 and 154 are the two for-loops for y- and x-axis of the y-sweep.
*********************************************
8.3 Instrumentation and Performance Counters
==============================================
*********************************************
8.3.1 to 8.3.4 - VTune
-----------------------
=======================
First we used the gui of Intel vTune to specify our reports.
Expand All @@ -542,7 +549,7 @@ Then the following batch script was used to run the hotspots measurement:
/cluster/intel/vtune_profiler_2020.2.0.610396/bin64/vtune -collect hotspots -app-working-dir /beegfs/xe63nel/tsunami_lab/build -- /beegfs/xe63nel/tsunami_lab/build/tsunami_lab ../configs/config.json
Hotspots
^^^^^^^^^^
---------
.. image:: ../../_static/assets/task_8-3-1_hotspot_bottomUp.png
Expand All @@ -564,7 +571,7 @@ It was interesting to see (although it should not come as a surprise) that the `
of the CPU time.
Threads
^^^^^^^^^^
--------
.. image:: ../../_static/assets/task_8-3-1_threads.png
Expand All @@ -573,10 +580,10 @@ Threads
The poor result for the thread report was also expected, because we only compute sequentially.
8.3.5 - Code optimizations
---------------------------
===========================
TsunamiEvent2d speedup
^^^^^^^^^^^^^^^^^^^^^^^
-----------------------
In order to increase the speed of this setup, we introduced a variable ``lastnegativeIndex`` for the X and Y direction for the bathymetry and displacement.
The idea is the following:
Expand Down Expand Up @@ -634,7 +641,7 @@ Code snippets of the implementation:
}
F-Wave solver optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------------------
In ``computeEigencoefficients``, we changed
Expand Down Expand Up @@ -677,7 +684,7 @@ Furthermore, we established a constant for :code:`t_real(0.5) * m_g`:
Coarse Output optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------------------
Inside the ``write()`` function in ``NetCdf.cpp`` we calculated
Expand All @@ -699,8 +706,9 @@ once and then reuse it wherever we need it:
This way, the division only happens once.
************************
Individual phase ideas
========================
************************
For the individual phase, we plan on building a graphical user interface using `ImGui <https://github.com/ocornut/imgui>`_.
Expand Down
29 changes: 18 additions & 11 deletions docs/source/files/assignments/09.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
==================
9. Parallelization
********************
==================

************
9.1 OpenMP
============
************

9.1.1 - Parallelization with OpenMP
----------------------------------------
====================================

An easy way to parallelize our for loops is using

Expand All @@ -22,7 +24,7 @@ example:
...
9.1.2 - Parallelization speedup
------------------------------------------
====================================

We have used following batch script for ara:

Expand All @@ -42,7 +44,8 @@ We have used following batch script for ara:
And got following results:

**Without parallelization**
Without parallelization
-----------------------

.. code:: text
Expand All @@ -54,7 +57,8 @@ And got following results:
= 1941.01 seconds
= 32.3501 minutes
**With parallelization on 72 cores with 72 threads**
With parallelization on 72 cores with 72 threads
------------------------------------------------

.. code:: text
Expand All @@ -72,7 +76,8 @@ And got following results:

Speedup: :math:`\frac{1941}{75.5} = 25.7`

**With parallelization on 72 cores with 144 threads**
With parallelization on 72 cores with 144 threads
-------------------------------------------------

.. code:: text
Expand All @@ -90,7 +95,7 @@ We can see that having twice the amount of threads resulted in a much slower com
We conclude that using more threads than cores results in a slowed down performance.

9.1.3 - 2D for loop parallelization
------------------------------------------
====================================

The results from above used parallelization in the outer loop.
The parallelized inner loops leads to following time:
Expand All @@ -105,9 +110,10 @@ The parallelized inner loops leads to following time:
It is clear, that parallelizing the outer loop is more effficient.

9.1.4 - Pinning and Scheduling
------------------------------------------
===============================

**Scheduling**
Scheduling
----------

The upper implementation used the basic :code:`scheduling(static)`.

Expand Down Expand Up @@ -136,7 +142,8 @@ For :code:`scheduling(auto)` we get:
= 84.5467 seconds
= 1.40911 minutes
**Pinning**
Pinning
-------

Using :code:`OMP_PLACES={0}:36:1` we get:

Expand Down
16 changes: 10 additions & 6 deletions docs/source/files/assignments/project.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
###################
10. Project Phase
********************
###################

In the project phase we decided to implement a userfriendly Gui. The aim is to make the usage of our Tsunami solver
as easy and interactive as possible.

*********************
GUI (Client-side)
==================
*********************

.. image:: ../../_static/assets/task-10-Gui_help.png

Expand All @@ -28,20 +30,22 @@ After selecting the simulation has to be recompiled with the according button be
The last tab contains further actions to interact with the simulation. First, the simulation can be started or killed here.
Also files for the bathymetry and displacement can be chosen. As an addition, the user can get data like the heigth from the simulation.

*********************
Server-side
=============
*********************

*********************
Libraries
==============
*********************

Communicator
**************
=====================

For communication between simulation and the GUI we implemented a communication library.
The **Communicator.hpp** library can be used to easily create a client-server TCP connection and handle its communication and logging.

Communicator API
******************
=====================

(**File: communicator_api.h**)

Expand Down
Loading

0 comments on commit 3c64ac8

Please sign in to comment.