Convert DG-RePlAce algorithm to Kokkos #5352

kamilrakoczy · 2024-07-08T09:41:44Z

This MR converts DG-RePlAce algorithm that was originally written for CUDA to Kokkos.

Kokkos provides abstraction for writing parallel code that can be translated into several backends including CUDA, OpenMP and C++ threads.

Tested on single run with RTX 3090 and i7-8700 CPU @ 3.20GHz using ariane133 design.

	original placer	CUDA implementation	Kokkos (CUDA backend)	Kokkos (OpenMP backend)	Kokkos (Threads backend)
ariane133 global place time	11:27.39	0:57.70	1:33.49	3:24.12	6:08.94

github-actions

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 52. Check the log or trigger a new build to see more.

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/MakeDgReplace.cpp

+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+///////////////////////////////////////////////////////////////////////////////
+
+#include "gpl2/MakeDgReplace.h"


warning: 'gpl2/MakeDgReplace.h' file not found [clang-diagnostic-error]

#include "gpl2/MakeDgReplace.h" ^

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.cpp

+//
+///////////////////////////////////////////////////////////////////////////////
+
+#include <Kokkos_Core.hpp>


warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include <Kokkos_Core.hpp> ^

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.h

+//
+//
+///////////////////////////////////////////////////////////////////////////////
+#include <Kokkos_Core.hpp>


warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include <Kokkos_Core.hpp> ^

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.h

+///////////////////////////////////////////////////////////////////////////////
+#include <Kokkos_Core.hpp>
+
+void dct_2d_fft(const int M,


warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void dct_2d_fft(const int M,

void dct_2d_fft(int M,

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.h

+#include <Kokkos_Core.hpp>
+
+void dct_2d_fft(const int M,
+                const int N,


warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

const int N,

int N,

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.cpp

+    binCntY_ = 512;
+  }
+
+  binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);


warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]

src/gpl2/src/placerBase.cpp:40:

- #include <cstdio> + #include <cmath> + #include <cstdio>

Suggested change

binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.cpp

+  }
+
+  binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);
+  binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);


warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]

Suggested change

binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.h

+#include <string>
+#include <vector>
+
+#include "db_sta/dbNetwork.hh"


warning: 'db_sta/dbNetwork.hh' file not found [clang-diagnostic-error]

#include "db_sta/dbNetwork.hh" ^

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.h

+  int64_t nesterovInstsArea() const
+  {
+    return stdInstsArea_
+           + static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));


warning: call to 'round' promotes float to double [performance-type-promotion-in-math-fn]

src/gpl2/src/placerBase.h:38:

- #include <memory> + #include <cmath> + #include <memory>

Suggested change

+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));

+ static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_));

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerObjects.cpp

+///////////////////////////////////////////////////////////////
+// Instance
+Instance::Instance()
+    : inst_(nullptr),


warning: member initializer for 'inst_' is redundant [modernize-use-default-member-init]

Suggested change

: inst_(nullptr),

: ,

maliberty · 2024-07-08T15:30:55Z

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

kamilrakoczy · 2024-07-09T09:03:40Z

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

Earlier measurements were done when some parts was still using native CUDA and using different design (black-parrot).
This measurements are single run on local machine while using it for other things too, so they are not very accurate.

I'd expect, it should be possible to achieve similar runtime using Kokkos, This results might suggest, that there are some unnecessary memory copies between host/device, but this needs to be investigated further.

maliberty · 2024-07-09T17:30:25Z

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

Do all the various versions produce the same result? That is also important.

maliberty · 2024-07-09T17:33:28Z

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

QuantamHD · 2024-07-09T18:31:57Z

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion. I think Kokkos or something like it is the only viable path forward. The runtime differences don't look significant if you compare it to the overall speedup achieved.

We're going for a pragmatic path forward, and to me this meets my bar for the goals we set out.

Do all the various versions produce the same result? That is also important.

Agree that this is important to check. We may need to order the floats to get identical/sufficiently similar results.

maliberty · 2024-07-09T20:01:25Z

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion.

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

A 50% overhead is worth exploring to at least understand if not eliminate.

QuantamHD · 2024-07-09T21:50:06Z

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

I think that seems like the right move at this point. With more time and context I don't think it's viable for us to maintain two codebases.

A 50% overhead is worth exploring to at least understand if not eliminate.

+1 I just want to point out if this is the fastest we could go that seems fast enough for me.

kamilrakoczy · 2024-07-18T13:56:48Z

Do all the various versions produce the same result? That is also important.

No they don't and it was quite surprising, as I expected that original code and Kokkos with CUDA backend will produce the same result.
We investigated this and it turned out that it is because Kokkos passes all files that depends on it through nvcc_wrapper. This wrapper converts host compiler options (g++) to nvcc options and uses nvcc to compile all Kokkos-dependent sources. This is done to allow device code in single .cpp file instead of separate .cu file for it.

NVCC should do pre-processing and compilation for device code and produce CUDA binary and it should leave host code for host compiler.

We checked that when nvcc is used to compile InitialPlace, Eigen solveWithGuess returns different results with exactly the same inputs comparing to using g++ directly.

I suspect that this issue isn't only related to Eigen: when I disabled initial placement, runtime of Kokkos and original code were almost the same, but results were still different (I haven't investigated reason for this).

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

kokkos-fft is header only interface library that translates FFT calls into proper backend by detecting enabled backends in Kokkos, but I agree, if preferred, both kokkos and kokkos-fft could be dependencies.

A 50% overhead is worth exploring to at least understand if not eliminate.

I think this overhead is due to different initial placement, when initial placement is disabled runtime is very similar:

	CUDA implementation	Kokkos (CUDA backend)
ariane133 global place time without initial placement	0:55.52	0:58.25

I also did precise measurements using RTX 3080, 8 vCPU i9-12900 @ 2.42 GHz and 32GB of RAM with 10 runs using ariane133 design:

	min time [min]	avg time [min]	med time [min]	max time [min]
CUDA implementation	0:45	0:48	0:47	0:53
Kokkos (CUDA backend)	1:53	1:57	1:57	2:00
Kokkos (OpenMP backend)	1:50	2:04	1:54	2:37
Kokkos (threads backend)	3:42	3:43	3:43	3:45

maliberty · 2024-07-18T15:10:28Z

Thanks for the analysis. It would be good to get to the bottom of the difference as it will make regression testing hard otherwise. Is nvcc calling g++ with different flags?

kamilrakoczy · 2024-07-19T07:33:04Z

Is nvcc calling g++ with different flags?

Arguments that are passed to nvcc and that nvcc should pass to g++ are the same.
I haven't investigated yet how (with what flags) g++ is invoked from nvcc.

maliberty · 2024-07-19T15:38:07Z

another possibility is that it is invoking a different g++ binary from another path

maliberty · 2024-10-14T04:42:11Z

Converted to a draft due to no progress.

jbylicki · 2025-01-07T13:17:36Z

I've rebased this branch onto latest master and started resolving the mentioned issues:

Eigen’s solveWithGuess() behaves differently on the Kokkos branch (with a suggestion that this is caused by nvcc_wrapper, a part of Kokkos responsible for redirecting compilations, not pertaining to CUDA, to the host compiler):

I've found that to not be the case. Early, I've recreated the same condition (where Eigen was running slowly) using clang++ as the Kokkos compiler and I've confirmed that nvcc_wrapper was not used then. The problem was Eigen, when detecting CUDA availability, was trying to use it. Nevertheless, I saw no peak in GPU usage when initial_place was running, so I've disabled it and saw the numbers return to baseline (the same as in the CUDA-native implementation).

What is the performance difference between Kokkos and CUDA-native implementations?

To prioritize merging of GPU-accelerated placement, the focus was to get the branch issue-free before optimizing. In my testing, Kokkos-based algorithm on black-parrot spends about 10 seconds in libcuda.so, whereas the CUDA-native implementation spends around 5. All other timings are comparable, making the entire run about 5 seconds longer.

Future / subsequent work:

Make Kokkos a submodule: Due to varying conditions on host machines, most Kokkos libraries available as a package ship without either CUDA or OMP support. Having a dependency that has to be manually compiled and set correctly to have a functioning and fast implementation might intruduce complexity for the end user. Therefore, I suggest not migrating kokkos-fft to be a dependency and using kokkos, that is already cloned as a submodule to kokkos-fft, as an in-tree library. The issue I'm currently facing is that internal deprecations of CMake symbols are being triggered when Kokkos' compilation is triggered as a child project and not the parent.
Optimize memory accesses and the Kokkos implementation itself: I've confirmed that memory copying is one of the causes of the algorithm being slower, and fixes are in development, waiting for the more pressing issues to be resolved.

jbylicki · 2025-01-09T17:11:51Z

I added a configuration option to etc/Build.sh, -use_gpl2 that will include the gpl2 subdirectory and launch the compilation of kokkos via kokkos-fft in CMake. I additionally assigned the -gpu flag from the build script to enable the CUDA backend in Kokkos.

maliberty · 2025-01-09T17:37:09Z

I would prefer to see kokkos as part of the dependency installer rather than as a submodule. There should be no need to compile it for each workspace on a machine.

jbylicki · 2025-01-09T17:56:27Z

With the current setup, it would be possible to support both compilation schemes, with the priority set towards the DependencyInstaller - if a system-wide Kokkos installation would be detected, it will be used during compilation. I would suggest leaving the possibility to use in-tree Kokkos and kokkos-fft (if kokkos-fft was also moved to be downloaded via DependencyInstaller), as the script is tailored only towards Ubuntu users. If a system-wide package is not detected, both dependencies can be installed via FetchContent and built in-tree.

maliberty · 2025-01-09T18:01:00Z

If someone wants to put a local copy in-tree that's fine but I'd like to avoid having a submodule.

jbylicki · 2025-01-09T18:07:24Z

I'll add support for kokkos and kokkos-fft via the DependencyInstaller then. The submodule could be deleted while keeping in-tree support - CMake would in case of a system-wide package being absent handle the download by the FetchContent directive, and the build would have conditionals in place to link correctly.

jbylicki · 2025-02-12T16:51:35Z

I've added nested parallelism to the most time consuming kernel - computeBCPosNegKernel. After rebasing both branches to the same base commit, the performance results are as follows for the black-parrot design with the CUDA backend:

CUDA-native: 24.606 seconds (total time: 114.50 s, skipped intial place: 94.49 s)
Kokkos: 23.614 seconds (total time: 114.42 s, skipped intial place: 95.07 s)

Additionally, a concern was raised wrt. non-deterministic results that are returned from Kokkos, depending on the compute device used for processing. To validate the flow, each variant was subjected to a run from syntheis to the final step. While it's true that those results are varying, they have minimal impact on the actual parameters of the finished flow. Additionally, the results are deterministic on a per-device basis, even when the compute device is calculating under heavy external loads (especially applicable for GPUs).

Test subjects were:

master branch commit 7e0fce872123, as baseline and base for other branches
cuda-native, the original CUDA-native implementation, rebased onto the same base as other branches
kokkos-cpu, the Kokkos-based flow, ran on the OpenMP backend
kokkos-gpu, the Kokkos-based flow, ran on the CUDA backend

Metrics collected were taken from the final report and log, and were:

Total Negative Slack (tns)
Worst Negative Slack (wns)
Total power
Design area and utilization

Results:

Branch	TNS	WNS	Design area, utilization	Total Power
`master`	-2.42	-2.42	760397 u^2 45% utilization	2.57e-01 W
`cuda-native`	-2.40	-2.40	753511 u^2 44% utilization	2.49e-01 W
`kokkos-cpu`	-2.49	-2.49	753608 u^2 44% utilization	2.50e-01 W
`kokkos-gpu`	-2.44	-2.44	753674 u^2 44% utilization	2.50e-01 W

maliberty · 2025-02-12T17:43:48Z

Very nice! How is the cpu vs gpu runtime with your latest changes? Is this ready for review?

github-actions

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 45. Check the log or trigger a new build to see more.

github-actions · 2025-02-12T17:48:37Z

src/gpl2/src/dct.h

+///////////////////////////////////////////////////////////////////////////////
+#include <Kokkos_Core.hpp>
+
+void dct_2d_fft(const int M,


warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void dct_2d_fft(const int M,

void dct_2d_fft(int M,

github-actions · 2025-02-12T17:48:38Z

src/gpl2/src/dct.h

+#include <Kokkos_Core.hpp>
+
+void dct_2d_fft(const int M,
+                const int N,


warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

const int N,

int N,

github-actions · 2025-02-12T17:48:38Z

src/gpl2/src/dct.h

+                const Kokkos::View<Kokkos::complex<float>*>& fft,
+                const Kokkos::View<float*>& post);
+
+void idct_2d_fft(const int M,


warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void idct_2d_fft(const int M,

void idct_2d_fft(int M,

github-actions · 2025-02-12T17:48:38Z

src/gpl2/src/dct.h

+                const Kokkos::View<float*>& post);
+
+void idct_2d_fft(const int M,
+                 const int N,


warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

const int N,

int N,

github-actions · 2025-02-12T17:48:38Z

src/gpl2/src/dct.h

+                 const Kokkos::View<float*>& ifft,
+                 const Kokkos::View<float*>& post);
+
+void idxst_idct(const int M,


warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void idxst_idct(const int M,

void idxst_idct(int M,

github-actions · 2025-02-12T17:48:40Z

src/gpl2/src/placerObjects.cpp

+      densityScale_(0.0),
+      haloWidth_(0),
+      type_(InstanceType::FILLER),
+      isFixed_(false)


warning: member initializer for 'isFixed_' is redundant [modernize-use-default-member-init]

Suggested change

isFixed_(false)

github-actions · 2025-02-12T17:48:40Z

src/gpl2/src/placerObjects.cpp

+  int lx = 0.0;
+  int ly = 0.0;
+  inst->getLocation(lx, ly);
+  int ux = lx + floor(bbox->getDX() / 2) * 2;


warning: result of integer division used in a floating point context; possible loss of precision [bugprone-integer-division]

int ux = lx + floor(bbox->getDX() / 2) * 2; ^

github-actions · 2025-02-12T17:48:40Z

src/gpl2/src/placerObjects.cpp

+  int ly = 0.0;
+  inst->getLocation(lx, ly);
+  int ux = lx + floor(bbox->getDX() / 2) * 2;
+  int uy = ly + floor(bbox->getDY() / 2) * 2;


warning: result of integer division used in a floating point context; possible loss of precision [bugprone-integer-division]

int uy = ly + floor(bbox->getDY() / 2) * 2; ^

github-actions · 2025-02-12T17:48:41Z

src/gpl2/src/placerObjects.cpp

+  }
+}
+
+void Instance::dbSetPlacementStatus(odb::dbPlacementStatus ps)


warning: the parameter 'ps' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]

Suggested change

void Instance::dbSetPlacementStatus(odb::dbPlacementStatus ps)

void Instance::dbSetPlacementStatus(const odb::dbPlacementStatus& ps)

src/gpl2/src/placerObjects.h:105:

- void dbSetPlacementStatus(odb::dbPlacementStatus ps); + void dbSetPlacementStatus(const odb::dbPlacementStatus& ps);

github-actions · 2025-02-12T17:48:41Z

src/gpl2/src/placerObjects.cpp

+////////////////////////////////////////////////////////
+// Pin
+Pin::Pin()
+    : pin_(nullptr),


warning: member initializer for 'pin_' is redundant [modernize-use-default-member-init]

Suggested change

: pin_(nullptr),

: ,

jbylicki · 2025-02-13T19:44:33Z

Yes, it's ready for review. I've applied the suggested clang-tidy fixes and added the missing RockyLinux9 package.

The performance difference between CUDA and OpenMP backends on black_parrot is:

CUDA: 85.38 s (dg_global_place call time: 20.46 s)
OpenMP: 96.58 s (dg_global_place call time: 29.83 s)

The test setup is an Intel i7-8700 and a NVIDIA GTX 1080Ti

github-actions

clang-tidy made some suggestions

github-actions · 2025-02-14T05:40:27Z

src/gpl2/src/placerObjects.cpp

+////////////////////////////////////////////////////////////////////////////////////////////////
+// Net
+Net::Net()
+    : net_(nullptr),


warning: member initializer for 'net_' is redundant [modernize-use-default-member-init]

Suggested change

: net_(nullptr),

: ,

github-actions · 2025-02-14T05:40:27Z

src/gpl2/src/placerObjects.cpp

+// Net
+Net::Net()
+    : net_(nullptr),
+      netId_(-1),


warning: member initializer for 'netId_' is redundant [modernize-use-default-member-init]

Suggested change

netId_(-1),

,

github-actions · 2025-02-14T05:40:27Z

src/gpl2/src/placerObjects.cpp

+      ly_(0),
+      ux_(0),
+      uy_(0),
+      isDontCare_(false),


warning: member initializer for 'isDontCare_' is redundant [modernize-use-default-member-init]

Suggested change

isDontCare_(false),

,

github-actions · 2025-02-14T05:40:27Z

src/gpl2/src/placerObjects.cpp

+      ux_(0),
+      uy_(0),
+      isDontCare_(false),
+      virtualWeight_(0.0),


warning: member initializer for 'virtualWeight_' is redundant [modernize-use-default-member-init]

Suggested change

virtualWeight_(0.0),

,

github-actions · 2025-02-14T05:40:28Z

src/gpl2/src/placerObjects.cpp

+      uy_(0),
+      isDontCare_(false),
+      virtualWeight_(0.0),
+      weight_(1.0)


warning: member initializer for 'weight_' is redundant [modernize-use-default-member-init]

Suggested change

weight_(1.0)

github-actions · 2025-02-14T05:40:29Z

src/gpl2/src/poissonSolver.h

+                    Kokkos::View<float*> electroForceY);
+
+  // Compute Potential Only (not Electric Force) the row-major order
+  void solvePoissonPotential(const Kokkos::View<float*> binDensity, Kokkos::View<float*> potential);


warning: parameter 'binDensity' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void solvePoissonPotential(const Kokkos::View<float*> binDensity, Kokkos::View<float*> potential);

void solvePoissonPotential(Kokkos::View<float*> binDensity, Kokkos::View<float*> potential);

github-actions · 2025-02-14T05:40:29Z

src/gpl2/src/routeBase.cpp

+// RouteBase
+
+RouteBase::RouteBase()
+    : rbVars_(), db_(nullptr), grouter_(nullptr), nbc_(nullptr), log_(nullptr)


warning: initializer for member 'rbVars_' is redundant [readability-redundant-member-init]

Suggested change

: rbVars_(), db_(nullptr), grouter_(nullptr), nbc_(nullptr), log_(nullptr)

: , db_(nullptr), grouter_(nullptr), nbc_(nullptr), log_(nullptr)

github-actions · 2025-02-14T05:40:29Z

src/gpl2/src/routeBase.cpp

+RouteBase::RouteBase(RouteBaseVars rbVars,
+                           odb::dbDatabase* db,
+                           grt::GlobalRouter* grouter,
+                           std::shared_ptr<PlacerBaseCommon> nbc,


warning: the parameter 'nbc' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]

Suggested change

std::shared_ptr<PlacerBaseCommon> nbc,

const std::shared_ptr<PlacerBaseCommon>& nbc,

github-actions · 2025-02-14T05:40:29Z

src/gpl2/src/routeBase.cpp

+  nbVec_ = std::move(nbVec);
+}
+
+RouteBase::~RouteBase()


warning: use '= default' to define a trivial destructor [modernize-use-equals-default]

src/gpl2/src/routeBase.cpp:98:

- { - } + = default;

github-actions · 2025-02-14T05:40:29Z

src/gpl2/src/timingBase.cpp

+}
+
+TimingBase::TimingBase(std::shared_ptr<PlacerBaseCommon> nbc,
+                             rsz::Resizer* rs,


warning: the parameter 'nbc' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]

Suggested change

rsz::Resizer* rs,

TimingBase::TimingBase(const std::shared_ptr<PlacerBaseCommon>&const ingBase::TimingBase(std::shared_p&tr<PlacerBaseCommon> nbc,

rsz::Resizer* rs,

maliberty

Partial review

maliberty · 2025-02-14T05:43:47Z

CMakeLists.txt

+if(CMAKE_CXX_COMPILER_ID STREQUAL "Clang" AND NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS "19")
+  link_libraries(stdc++)
+endif()


What necessitates this?

After bumping clang to version 19, it started defaulting to linking against libc++ in CUDA code. It only affected gpl2; there was a missing libstdc++ definiton when linking only this specific module. Should I add a CUDA/GPL2 conditional there as well?

maliberty · 2025-02-14T05:45:12Z

etc/Build.sh

@@ -121,6 +122,9 @@ while [ "$#" -gt 0 ]; do
            echo "${1} requires an argument" >&2
            _help
            ;;
+        -use_gpl2)


Add a description to the usage message

maliberty · 2025-02-14T05:46:17Z

etc/DependencyInstaller.sh

@@ -76,6 +76,7 @@ _installCommonDev() {
    gtestChecksum="a1279c6fb5bf7d4a5e0d0b2a4adb39ac"
    bisonVersion=3.8.2
    bisonChecksum="1e541a097cda9eca675d29dd2832921f"
+    kokkosfftVersion="2c616d29a7ad0c390259efeb9224115bfa6910fd"


Why isn't this using a release tag and checksum?

It accomplishes the same goal in a single defintion and eliminates the need to hash the directory. If it's preferred to keep convention, I can change it.

It does not as there is no easy way to know what release (if any) we are using from a commit id.

Changed to a version tag. Other dependencies managed by git were not getting hashed either, as it would require to tar the directory.

maliberty · 2025-02-14T05:48:33Z

etc/DependencyInstaller.sh

+            # Older version of g++ is needed for compatibility with NVCC
+            ARGS_KOKKOSFFT+=" -DKokkos_ENABLE_CUDA=ON -DCMAKE_CXX_COMPILER=g++-10"


https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html suggests it works with many default compiler versions. Is this really necessary?

NVlabs/instant-ngp#119 presents the same issue as I have encountered. Ubuntu is using the supposedly supported gcc 11.4, but it does not compile Kokkos properly.

maliberty · 2025-02-14T05:49:23Z

etc/DependencyInstaller.sh

+            # Older version of g++ is needed for compatibility with NVCC
+            ARGS_KOKKOSFFT+=" -DKokkos_ENABLE_CUDA=ON -DCMAKE_CXX_COMPILER=g++-10"


Can you enable both openmp and cuda in one build?

Yes, it will default to CUDA if it's detected but both can be compiled in at once.

maliberty · 2025-02-14T05:59:27Z

src/gpl2/LICENSE

There is no need for a separate LICENSE file here.

maliberty · 2025-02-14T06:00:12Z

src/gpl2/include/gpl2/DgReplace.h

+  // The three main functions
+  void doInitialPlace();
+  int doNesterovPlace(int start_iter = 0);


I only see two

Fixed the comment, it was here since the CUDA-native implementation, there's no clear indication what would've been the third function.

maliberty · 2025-02-14T06:01:17Z

src/gpl2/include/gpl2/DgReplace.h

+  // We should only have one placerBaseCommon, timingBase and routeBase
+  // But we need multiple placerBases to handle fences and multiple domains


This comment seems to have become separate from its context (move down 4 lines)

maliberty · 2025-02-14T06:02:44Z

src/gpl2/src/dct.cpp

Did you author this code?

It was originally written by @ZhiangWang033 (as can be seen in the commit history), then ported to Kokkos by the other committers (including me).

maliberty · 2025-02-14T06:03:46Z

src/gpl2/src/densityOp.h

+  int coreUx_;
+  int coreUy_;
+
+  // We need to store all the statictis information for each bin


typo: statictis

jbylicki · 2025-02-20T14:50:02Z

Currently, the mainline gpl runs the global_place call for 653.90s with the total run time being 729.30s

Signed-off-by: ZhiangWang033 <zhw033@ucsd.edu>

Co-authored-by: Kamil Rakoczy <krakoczy@antmicro.com> Co-authored-by: Jan Bylicki <jbylicki@antmicro.com> Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com> Signed-off-by: Kamil Rakoczy <krakoczy@antmicro.com> Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

…ackends Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

Removes interpretation of LayoutRight data as LayoutLeft Fixes `input and output extents must be the same except for the transform axis` gpl2 error caused by incomplete workaround against differing default 2d data layouts beetween CUDA and CPU. I considered alternative approches (such as getting rid of LayoutRight specific code entirely) but they turned out to be unproportionaly complex. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Always calculate fft on host to avoid differing results between impls NOTE: This may have some performance repercussions. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Device and host may use different implementations of math functions giving different results which is not desirable in OpenROAD The fix relies on (possibly wrong) assumption, that the error of double precision built-in function is less than precision of float. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Replace non-deterministic paralel reduces with serial loops that give same result regardless of platform NOTE: This change results in serious performance degradation Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

computeWeightedHPWL() suffered from implicit lossy (in case of big numbers) conversion of int64_t to float. After fixing it, the summation can be made parallel without introducing inconsistencies between kokkos configurations. NOTE: the computeHPWL() never suffered from this issue, but I had excesivly deparalelized it before. In this fix, I added safe-guards to both computeWeightedHPWL() and computeHPWL() for consistency. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Replace dNetWith and dNetHeight with single dNetWidthPlusHeight The time improvement is small (it could be measurement error as well) Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Serial code is order of magnitude slower to execute on GPU than on CPU Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

I'm a bit suprised, but this simple change reduced time from 2m38s to 2m30s Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Calculate individual distances in parallel, then sum them serially Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Before, we executed serial reduction for X and Y separatly. Now we have parallel calculation of view with abs(X)+abs(Y), and one serial reduction of it. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

NOTE: I used static vars for storing plans. While simple and convienient, it assumes that N and M won't change between calls (changing them would result in runtime error) Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

github-actions bot reviewed Jul 8, 2024

View reviewed changes

maliberty marked this pull request as draft October 14, 2024 04:41

jbylicki force-pushed the convert-gpl2-kokkos branch 2 times, most recently from 04d428f to 925dd93 Compare January 7, 2025 13:14

jbylicki force-pushed the convert-gpl2-kokkos branch from 072e3b1 to 2dcac77 Compare January 10, 2025 17:19

jbylicki force-pushed the convert-gpl2-kokkos branch from 2dcac77 to 960ec72 Compare February 12, 2025 16:49

github-actions bot reviewed Feb 12, 2025

View reviewed changes

jbylicki force-pushed the convert-gpl2-kokkos branch from a1b101b to 1d136de Compare February 13, 2025 19:44

maliberty marked this pull request as ready for review February 14, 2025 05:34

github-actions bot reviewed Feb 14, 2025

View reviewed changes

maliberty requested changes Feb 14, 2025

View reviewed changes

jbylicki force-pushed the convert-gpl2-kokkos branch 5 times, most recently from 974d4c0 to e0f9b1f Compare February 20, 2025 13:20

ZhiangWang033 and others added 21 commits March 6, 2025 12:49

DGRePlAce

993cc70

Signed-off-by: ZhiangWang033 <zhw033@ucsd.edu>

Add Kokkos in-tree building and configuration

9a4fda5

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

Moved Kokkos/Kokkos-fft to be primarly handled via DependencyInstaller

232a0d5

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

gpl2: Utilize nested parallelism in computeBCPosNegKernel

013113f

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

gpl2: Fix lint violations in implementation

b1dabea

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

gpl2: Remove redundant LICENSE file

3736616

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

gpl2: Make computeBCPosNegKernel deterministic on different compute b…

faa3ab9

…ackends Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

gpl2: Move unnecessary operations out of loop bodies

efa26c7

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>

Always calculate fft on host for consistency

cdabaa1

Always calculate fft on host to avoid differing results between impls NOTE: This may have some performance repercussions. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Replace non-deterministic paralel reduces with serial loops

04d5522

Replace non-deterministic paralel reduces with serial loops that give same result regardless of platform NOTE: This change results in serious performance degradation Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Replace two reduntant views with single one

3f6ff18

Replace dNetWith and dNetHeight with single dNetWidthPlusHeight The time improvement is small (it could be measurement error as well) Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Execute serial reduces on CPU rather than GPU

858497e

Serial code is order of magnitude slower to execute on GPU than on CPU Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Use fabs() rather than handmade tenary

da5c9cd

I'm a bit suprised, but this simple change reduced time from 2m38s to 2m30s Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Split getDistance to parallel and serial part

0646f14

Calculate individual distances in parallel, then sum them serially Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Sum X and Y gradients in one go

baa4b70

Before, we executed serial reduction for X and Y separatly. Now we have parallel calculation of view with abs(X)+abs(Y), and one serial reduction of it. Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Make summations autovectorizable

76ec154

Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

Reuse fft plan

76e5a69

NOTE: I used static vars for storing plans. While simple and convienient, it assumes that N and M won't change between calls (changing them would result in runtime error) Signed-off-by: Szymon Gizler <sgizler@antmicro.com>

sgizler force-pushed the convert-gpl2-kokkos branch from d841e44 to 76e5a69 Compare March 6, 2025 12:15

	binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);
	binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

	binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);
	binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

	+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));
	+ static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_));

	void Instance::dbSetPlacementStatus(odb::dbPlacementStatus ps)
	void Instance::dbSetPlacementStatus(const odb::dbPlacementStatus& ps)

	void solvePoissonPotential(const Kokkos::View<float> binDensity, Kokkos::View<float> potential);
	void solvePoissonPotential(Kokkos::View<float> binDensity, Kokkos::View<float> potential);

	: rbVars_(), db_(nullptr), grouter_(nullptr), nbc_(nullptr), log_(nullptr)
	: , db_(nullptr), grouter_(nullptr), nbc_(nullptr), log_(nullptr)

	std::shared_ptr<PlacerBaseCommon> nbc,
	const std::shared_ptr<PlacerBaseCommon>& nbc,

	rsz::Resizer* rs,
	TimingBase::TimingBase(const std::shared_ptr<PlacerBaseCommon>&const ingBase::TimingBase(std::shared_p&tr<PlacerBaseCommon> nbc,
	rsz::Resizer* rs,

		# Older version of g++ is needed for compatibility with NVCC
		ARGS_KOKKOSFFT+=" -DKokkos_ENABLE_CUDA=ON -DCMAKE_CXX_COMPILER=g++-10"

		// We should only have one placerBaseCommon, timingBase and routeBase
		// But we need multiple placerBases to handle fences and multiple domains

Convert DG-RePlAce algorithm to Kokkos #5352

Are you sure you want to change the base?

Convert DG-RePlAce algorithm to Kokkos #5352

Conversation

kamilrakoczy commented Jul 8, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

maliberty commented Jul 8, 2024

kamilrakoczy commented Jul 9, 2024

maliberty commented Jul 9, 2024

maliberty commented Jul 9, 2024

QuantamHD commented Jul 9, 2024 • edited Loading

maliberty commented Jul 9, 2024

QuantamHD commented Jul 9, 2024 • edited Loading

kamilrakoczy commented Jul 18, 2024

maliberty commented Jul 18, 2024

kamilrakoczy commented Jul 19, 2024

maliberty commented Jul 19, 2024

maliberty commented Oct 14, 2024

jbylicki commented Jan 7, 2025

jbylicki commented Jan 9, 2025

maliberty commented Jan 9, 2025

jbylicki commented Jan 9, 2025

maliberty commented Jan 9, 2025

jbylicki commented Jan 9, 2025

jbylicki commented Feb 12, 2025

maliberty commented Feb 12, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

github-actions bot Feb 12, 2025

Choose a reason for hiding this comment

jbylicki commented Feb 13, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

kamilrakoczy commented Jul 8, 2024 •

edited

Loading

QuantamHD commented Jul 9, 2024 •

edited

Loading

QuantamHD commented Jul 9, 2024 •

edited

Loading