Causes of vastly different ICP runtimes #596

kam3k · 2025-01-08T14:47:05Z

What would cause the ICP runtime to vary significantly, despite very similar inputs?

More specifically, I am running scan matching on a sequence of scans from the same 2D lidar. There is < 20 cm of displacement and < 2 deg of rotation between the two point clouds, and I provide ICP with an initial guess from odometry. In other words, the reading and reference clouds begin nearly aligned, and ICP is just making a small adjustment.

My data points filters (applied to both reading and reference) are:

SimpleSensorNoiseDataPointsFilter
ObservationDirectionDataPointsFilter
SurfaceNormalDataPointsFilter
OrientNormalsDataPointsFilter

The matcher is KDTreeMatcher, the outlier filters are TrimmedDistOutlierFilter and SurfaceNormalOutlierFilter, the error minimizer is PointToPlaneWithCovErrorMinimizer, the transformation checkers are DifferentialTransformationChecker and CounterTransformationChecker, and a RigidTransformation transformation is specified.

Both the reading and reference clouds have about ~1000 pts each (this could possibly be downsampled if needed).

When I run ICP using this set up, the alignment time varies significantly, fluctuating between ~8 ms to ~300 ms. My lidar data is coming in at 25 Hz, which means this could run online if ICP took < 40 ms (which is does sometimes).

I was wondering, what is likely the cause of these big fluctuations in runtime?

I took a look a the logs, and the pre-processing times can vary, but not as much as ICP itself (sometimes it is ~50x slower than other times).

As an aside, the log reports pre-processing times of the clouds, and of ICP, but the times don't make sense. For example, it will say:

PointMatcher::icp - reference pre-processing took 0.112477 [s]
PointMatcher::icp - reading pre-processing took 0.133762 [s]
PointMatcher::icp - 4 iterations took 0.905763 [s]

which if you sum up, adds up to more than one second. But measuring how long the call to the ICP function took (with std::steady_clock) is much less, in this case it was 120 ms.

At the end of the day my goal is to have ICP run quickly (< 40 ms) but perhaps more importantly, consistently. Are there data filters and/or parts of the pipeline that can be configured, added, or removed to help achieve a more consistent, fast runtime?

Any help would be greatly appreciated. Thank you.

The text was updated successfully, but these errors were encountered:

boxanm · 2025-01-08T18:03:16Z

Hi @kam3k !

While it is possible that the runtimes oscillate, the processing times you are reporting are oddly long. As a reference, we normally process point clouds two orders of magnitudes larger while still being real-time (although at 10 Hz). Are you running the code on some edge hardware or a standard computer? Is it natively installed or a containerized application?
Also, an obligatory question: was the library built in a release mode cmake -DCMAKE_BUILD_TYPE=Release?

kam3k · 2025-01-08T18:10:38Z

@boxanm Thanks for the reply.

This is running natively installed (not containerized) on a developer laptop. I built both libpointmatcher (commit 08729a, September 2024) and libnabo (commit cd95c8b, May 2024) from source, both with cmake -DCMAKE_BUILD_TYPE=Release. I just verified this.

I'd like to add that I did some more testing that shows that the run time varies significantly despite identical inputs and set up. That is, just calling ICP in a loop and measuring the run time produces very different results (e.g., 20 ms vs 200 ms).

But your response is very encouraging, suggesting that there is a solution somewhere that will easily make this runnable in real time.

boxanm · 2025-01-08T18:19:28Z

Can you please also report the output of listVersionsUbuntu.sh?
#585 also mentions inconsistencies in time measured through libpointmatcher's logging.
Other than that, can you test your code with a different minimizer than the PointToPlaneWithCovErrorMinimizer, preferably the PointToPoint? Since your use case is 2-dimensional, some instabilities might be related to the planes and covariance calculations.

kam3k · 2025-01-08T18:31:56Z

@boxanm

Name            | Version
----------------|-------------------------------
ubuntu:         | Description: Ubuntu 20.04.6 LTS
architecture:   | 64-bit
gcc:            | gcc (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
git:            | git version 2.47.1
cmake:          | cmake version 3.16.3 CMake suite maintained and supported by Kitware (kitware.com/cmake).
boost:          | Version: 1.71.0.0ubuntu2
eigen3:         | Version: 3.3.7-2
doxygen:        | Version: 1.8.17-0ubuntu2

No change in behaviour when using PointToPoint.

One thing I'm doing "different" is the configuration is loaded into the ICP object each time, but not from file (it's doing it the same way as in this example. But that appears to take < 1 ms and the variations in timing I'm talking about are explicitly in the call to PointMatcher<double>::ICP::operator(...). The reason why I am bringing this up is I'm wondering if the first time you call ICP if there is some configuration loading going on, which is cached and then subsequent calls are much faster? Since I'm "setting up" the ICP configuration every time could that be the problem (i.e., every call clears and needs to set up this cache?).

kam3k · 2025-01-08T19:00:56Z

I've just run some tests where the configuration isn't done every time and it had no effect. Sounds like there is something going on with the build. Perhaps one of the dependencies are built in debug? But none of them (boost, eigen) are built from source, just installed via apt.

kam3k · 2025-01-08T21:13:41Z

For another datapoint, I completely uninstalled libpointmatcher and libnabo and then instead installed them using apt via ros-noetic-libpointmatcher and have the same problem.

boxanm · 2025-01-09T12:41:01Z

@kam3k I would suggest removing all nonessential filters, both input, and outliers. Then, switch to the Point2Point minimizer to maximally simplify the setup. From the logs you reported, it seems that most of the time is taken by the ICP loop itself. I suppose the fluctuating runtimes come mostly from a varying number of iterations. What value is your CounterTransformationChecker set to? Do the registration results you are getting make sense? Are the scans well aligned? Running a profiler and identifying the function call that takes the most time might be our best option here.

@nicolaslauzon did you experience anything similar when testing libpointmatcher for the f1tenth?

nicolaslauzon · 2025-01-09T12:45:36Z

That would be @aguenette who tested libpointmatcher on the f1tenth.

kam3k · 2025-01-09T14:47:22Z

@boxanm

To really try and narrow things down I've just used the default configuration (i.e., icp.setDefault()) to see what would happen. And then I called ICP on the exact same inputs in a loop (10 times). I did this for a number of inputs (i.e., providing two clouds, calling ICP ten times, then doing the same with subsequent pairs of clouds).

What happened was sometimes all ten calls would be very fast. But about 40% of the time, just the first call would be slow, and the remaining nine would be fast.

For example, most of the time things would look good for the ten calls:

PointMatcher::icp - 4 iterations took 0.000476553 [s]
PointMatcher::icp - 4 iterations took 0.000475422 [s]
PointMatcher::icp - 4 iterations took 0.00350221 [s]
PointMatcher::icp - 4 iterations took 0.000376781 [s]
PointMatcher::icp - 4 iterations took 0.00106319 [s]
PointMatcher::icp - 4 iterations took 0.000383776 [s]
PointMatcher::icp - 4 iterations took 0.029914 [s]
PointMatcher::icp - 4 iterations took 0.00141966 [s]
PointMatcher::icp - 4 iterations took 0.000381071 [s]
PointMatcher::icp - 4 iterations took 0.000371113 [s]

But very often, just the first call would be substantially slower (even though the number of iterations didn't change):

PointMatcher::icp - 4 iterations took 0.656374 [s]
PointMatcher::icp - 4 iterations took 0.000944552 [s]
PointMatcher::icp - 4 iterations took 0.00347905 [s]
PointMatcher::icp - 4 iterations took 0.000404512 [s]
PointMatcher::icp - 4 iterations took 0.0014445 [s]
PointMatcher::icp - 4 iterations took 0.0296904 [s]
PointMatcher::icp - 4 iterations took 0.0013557 [s]
PointMatcher::icp - 4 iterations took 0.0228575 [s]
PointMatcher::icp - 5 iterations took 0.00135517 [s]
PointMatcher::icp - 4 iterations took 0.00439387 [s]

Note that it was always the first call that was slower. So when I remove the loop (and only call ICP once per cloud pair), it's always "the first call" and therefore I get sporadic runtimes.

boxanm · 2025-01-09T15:34:35Z

Alright, that's interesting! Seeing values in the range of milliseconds is the expected result. I'll run a similar test on my side, but probably not before the end of the week.

kam3k · 2025-01-10T15:18:04Z

Here's a test that anyone can run using the examples that come with libpointmatcher. After building, go inside the directory build/examples and run

for run in {1..10}; do ./icp_customized ../../examples/data/2D_twoBoxes.csv ../../examples/data/2D_oneBox.csv; done

which will run the icp_customized example ten times, outputting results each time. When I do this, a get a wide variety of reported runtimes, most being small (e.g., 0.002 s) but several that are orders of magnitude slower (e.g., 0.16 s, 0.07 s, 0.39 s).

For example,

> for run in {1..10}; do ./icp_customized ../../examples/data/2D_twoBoxes.csv ../../examples/data/2D_oneBox.csv; done | grep "iterations took"

PointMatcher::icp - 6 iterations took 0.122271 [s]
PointMatcher::icp - 5 iterations took 0.0616906 [s]
PointMatcher::icp - 7 iterations took 0.00202791 [s]
PointMatcher::icp - 7 iterations took 0.00122573 [s]
PointMatcher::icp - 7 iterations took 0.267806 [s]
PointMatcher::icp - 7 iterations took 0.00148203 [s]
PointMatcher::icp - 6 iterations took 0.00135091 [s]
PointMatcher::icp - 8 iterations took 0.00196917 [s]
PointMatcher::icp - 10 iterations took 0.00152176 [s]
PointMatcher::icp - 8 iterations took 0.00131205 [s]

boxanm · 2025-01-11T17:48:34Z

Here are the results I get for icp_customized on 2D_twoBoxes data. I ran your command on 10k iterations and while the execution times differ, it is still less than 10x increase in the worst case. I'll try the same tests on Ubuntu 20, but it appears that at least on MacOS with M2 chip the issues you are reporting do not happen.

boxanm · 2025-01-12T19:06:12Z

@kam3k Can you please also confirm that you are seeing the same issue when using non-POSIX timers? See the recent PR: #598

kam3k · 2025-01-13T14:58:26Z

@boxanm Yes I can confirm I was seeing the same issue you're fixing in #598 and that after I use the new flag the issue is fixed. Thanks for that, the durations in the log make way more sense when using your new branch.

Here is my plot (1000 samples) for the same dataset. The mean was 0.0053 s with a standard deviation of 0.0228 s. I get a median that is still slower than your slowest time (0.00042) but the same order of magnitude. I don't know what to make of these massive and frequently occurring outliers.

kam3k · 2025-01-13T15:21:54Z

I'm looking around for anything else I can think of that could lead the cause. One is the USE_OPEN_MP flag in cmake. I've noticed no real change whether I include it or not, but interestingly when I do include it, i.e.,

> cmake -DCMAKE_BUILD_TYPE=Release -DUSE_OPEN_MP=On -DDISABLE_POSIX_TIMERS=On ..
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0") found components: thread system program_options date_time
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0") found components: thread system program_options date_time chrono
-- libnabo found, version 1.1.2 (Config mode,  libs=libnabo::nabo)
-- Found OpenMP_CXX: -fopenmp
-- Found OpenMP: TRUE
-- Looking for yaml-cpp on system...
-- OpenMP found, parallel computer enabled
-- /usr/lib/gcc/x86_64-linux-gnu/10/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.so
-- API Documentation (doxygen): disabled
-- Configuring done
-- Generating done

I still get these warnings during compilation:

warning: ignoring ‘#pragma omp parallel’ [-Wunknown-pragmas]

boxanm · 2025-01-13T20:09:33Z

I ran the same script on a Ubuntu20 and Ubuntu22 device (both x86), and while I'm also seeing some peaks, it looks like the issue only arises on Ubuntu20.

Thanks for bringing the OpenMP to my attention, I'll take a look.

boxanm · 2025-01-14T15:33:59Z

@kam3k for the OpenMP, I think CMake has a problem finding the full OpenMP installation. Here's what the output looks like when USE_OPEN_MP=ON on my Ubuntu20:

-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0") found components: thread system program_options date_time 
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0") found components: thread system program_options date_time chrono 
-- libnabo found, version 1.1.2 (Config mode,  libs=libnabo::nabo)
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for yaml-cpp on system...
-- OpenMP found, parallel computer enabled
-- /usr/lib/gcc/x86_64-linux-gnu/9/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.so 
-- API Documentation (doxygen): disabled
-- Configuring done
-- Generating done
-- Build files have been written to: /home/norlab/Libraries/libpointmatcher/build

kam3k · 2025-01-14T17:31:15Z

I did some profiling with Intel vTune and noticed a bunch of time spent in the KD-tree, so I turned my attention to libnabo as the possible source of the issue. To test this, I modified examples/trivial.cpp in libnabo to increase the number of points and added timing around the query (i.e, does not include creation of the KD-tree). Please also note that a random seed is not being used, so the matrix M and vector q are the same every time this program is run. The new trivial.cpp looks like:

#include "nabo/nabo.h"

#include <chrono>
#include <iostream>

int main()
{
	using namespace Nabo;
	using namespace Eigen;
	
	// 100 points in 3D
	MatrixXf M = MatrixXf::Random(3, 2000);
	// 1 query points
	VectorXf q = VectorXf::Random(3);
	
	// create a kd-tree for M, note that M must stay valid during the lifetime of the kd-tree
	NNSearchF* nns = NNSearchF::createKDTreeLinearHeap(M);

	const auto start = std::chrono::steady_clock::now();
	
	// look for the 5 nearest neighbour of a the single-point query
	const int K = 5;
	VectorXi indices(K);
	VectorXf dists2(K);
	nns->knn(q, indices, dists2, K);

	const auto end = std::chrono::steady_clock::now();
	const std::chrono::duration<double> elapsed(end - start);
	std::cout << elapsed.count() << "\n";
	
	// cleanup kd-tree
	delete nns;

	return 0;
}

Then I ran this 1000 times, recording the duration of each run to a file:

> for run in {1..1000}; do ./examples/trivial; done > times.txt

And the results are very similar to what we saw when running ICP:

kam3k · 2025-01-14T18:10:21Z

Issue found!

I started profiling the above libnabo program and found that the CPU time taken by OpenMP threads could vary pretty significantly, so I started looking into that and found this: https://stackoverflow.com/a/70134444

Setting the environment variable OMP_WAIT_POLICY=passive before running the ICP example program (box2D) appears to have (mostly) fixed the problem (still some outliers, but much better than before):
. Setting this environment variable adjusts GOMP_SPINCOUNT = '300000' to GOMP_SPINCOUNT = '0'.

But right now I don't know the consequences of setting GOMP_SPINCOUNT and whether or not it should be explicitly set in libnabo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Causes of vastly different ICP runtimes #596

Causes of vastly different ICP runtimes #596

kam3k commented Jan 8, 2025 •

edited

Loading

boxanm commented Jan 8, 2025

kam3k commented Jan 8, 2025

boxanm commented Jan 8, 2025 •

edited

Loading

kam3k commented Jan 8, 2025

kam3k commented Jan 8, 2025

kam3k commented Jan 8, 2025

boxanm commented Jan 9, 2025

nicolaslauzon commented Jan 9, 2025

kam3k commented Jan 9, 2025 •

edited

Loading

boxanm commented Jan 9, 2025

kam3k commented Jan 10, 2025 •

edited

Loading

boxanm commented Jan 11, 2025

boxanm commented Jan 12, 2025

kam3k commented Jan 13, 2025

kam3k commented Jan 13, 2025

boxanm commented Jan 13, 2025

boxanm commented Jan 14, 2025

kam3k commented Jan 14, 2025 •

edited

Loading

kam3k commented Jan 14, 2025 •

edited

Loading

Causes of vastly different ICP runtimes #596

Causes of vastly different ICP runtimes #596

Comments

kam3k commented Jan 8, 2025 • edited Loading

boxanm commented Jan 8, 2025

kam3k commented Jan 8, 2025

boxanm commented Jan 8, 2025 • edited Loading

kam3k commented Jan 8, 2025

kam3k commented Jan 8, 2025

kam3k commented Jan 8, 2025

boxanm commented Jan 9, 2025

nicolaslauzon commented Jan 9, 2025

kam3k commented Jan 9, 2025 • edited Loading

boxanm commented Jan 9, 2025

kam3k commented Jan 10, 2025 • edited Loading

boxanm commented Jan 11, 2025

boxanm commented Jan 12, 2025

kam3k commented Jan 13, 2025

kam3k commented Jan 13, 2025

boxanm commented Jan 13, 2025

boxanm commented Jan 14, 2025

kam3k commented Jan 14, 2025 • edited Loading

kam3k commented Jan 14, 2025 • edited Loading

kam3k commented Jan 8, 2025 •

edited

Loading

boxanm commented Jan 8, 2025 •

edited

Loading

kam3k commented Jan 9, 2025 •

edited

Loading

kam3k commented Jan 10, 2025 •

edited

Loading

kam3k commented Jan 14, 2025 •

edited

Loading

kam3k commented Jan 14, 2025 •

edited

Loading