diff --git a/BENCHMARK.md b/BENCHMARK.md
index 78e70d5..452c2be 100644
--- a/BENCHMARK.md
+++ b/BENCHMARK.md
@@ -48,7 +48,7 @@ All benchmarks were conducted on the KITTI 00 sequence.
 
 ```bash
 cd small_gicp/scripts
-./run_downsampling_benchmark.sh
+./run_downsampling_benchmark.sh /path/to/kitti/velodyne
 python3 plot_downsampling.py
 ```
 
@@ -67,12 +67,13 @@ python3 plot_downsampling.py
 
 ```bash
 cd small_gicp/scripts
-./run_kdtree_benchmark.sh
+./run_kdtree_benchmark.sh /path/to/kitti/velodyne
 python3 plot_kdtree.py
 ```
 
 - Multi-threaded implementation (TBB and OMP) can be up to **4x faster** than the single-threaded one (All the implementations are based on nanoflann).
-- The processing speed gets faster as the number of threads increases, but the speed gain is not monotonic sometimes (because of the scheduling algorithm or some CPU(AMD 5995WX)-specific issues?).
+- ~~The processing speed gets faster as the number of threads increases, but the speed gain is not monotonic sometimes (because of the scheduling algorithm or some CPU(AMD 5995WX)-specific issues?)~~.
+- The new KdTree implementation shows a good scalability thanks to its well balanced task assignment.
 - This benchmark only compares the construction time (query time is not included). 
 
 ![kdtree_time](docs/assets/kdtree_time.png)
@@ -81,7 +82,7 @@ python3 plot_kdtree.py
 
 ```bash
 cd small_gicp/scripts
-./run_odometry_benchmark.sh
+./run_odometry_benchmark.sh /path/to/kitti/velodyne
 python3 plot_odometry.py
 ```
 
diff --git a/README.md b/README.md
index 5cd3eb0..4e2c058 100644
--- a/README.md
+++ b/README.md
@@ -2,11 +2,11 @@
 
 **small_gicp** is a header-only C++ library that offers efficient and parallelized algorithms for fine point cloud registration (ICP, Point-to-Plane ICP, GICP, VGICP, etc.). It is a refined and optimized version of its predecessor, [fast_gicp](https://github.com/SMRT-AIST/fast_gicp), re-written from scratch with the following features.
 
-- **Highly Optimized** : The implementation of the core registration algorithm is further optimized from that in fast_gicp. It enables up to **2x speed gain** compared to fast_gicp.
-- **All parallerized** : small_gicp offers parallelized implementations of several preprocessing algorithms to make the entire registration process parallelized (Downsampling, KdTree construction, Normal/covariance estimation). As a parallelism backend, either (or both) [OpenMP](https://www.openmp.org/) and [Intel TBB](https://github.com/oneapi-src/oneTBB) can be used. 
-- **Minimum dependency** : Only [Eigen](https://eigen.tuxfamily.org/) (and bundled [nanoflann](https://github.com/jlblancoc/nanoflann) and [Sophus](https://github.com/strasdat/Sophus)) are required at a minimum. Optionally, it provides the [PCL](https://pointclouds.org/) registration interface so that it can be used as a drop-in replacement in many systems.
+- **Highly Optimized** : The implementation of the core registration algorithm is further optimized from that in fast_gicp. It enables up to **2x speed gain**.
+- **All parallerized** : small_gicp offers parallel implementations of several preprocessing algorithms to make the entire registration process parallelized (Downsampling, KdTree construction, Normal/covariance estimation). As a parallelism backend, either (or both) [OpenMP](https://www.openmp.org/) and [Intel TBB](https://github.com/oneapi-src/oneTBB) can be used. 
+- **Minimum dependency** : Only [Eigen](https://eigen.tuxfamily.org/) (and bundled [nanoflann](https://github.com/jlblancoc/nanoflann) and [Sophus](https://github.com/strasdat/Sophus)) are required at a minimum. Optionally, it provides the [PCL](https://pointclouds.org/) registration interface so that it can be used as a drop-in replacement.
 - **Customizable** : small_gicp allows feeding any custom point cloud class to the registration algorithm via traits. Furthermore, the template-based implementation enables customizing the registration process with your original correspondence estimator and registration factors.
-- **Python bindings** : The isolation from PCL makes small_gicp's python bindings more portable and connectable to other libraries (e.g., Open3D) without problems. 
+- **Python bindings** : The isolation from PCL makes small_gicp's python bindings more portable and usable with other libraries (e.g., Open3D) without problems. 
 
 Note that GPU-based implementations are NOT included in this package.
 
@@ -22,7 +22,7 @@ This library uses some C++17 features. The PCL interface is not compatible with
 ## Dependencies
 
 - [Mandatory] [Eigen](https://eigen.tuxfamily.org/), [nanoflann](https://github.com/jlblancoc/nanoflann) ([bundled](include/small_gicp/ann/kdtree.hpp)), [Sophus](https://github.com/strasdat/Sophus) ([bundled](include/small_gicp/util/lie.hpp))
-- [Optional] [OpenMP](https://www.openmp.org/), [Intel TBB](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html), [PCL](https://pointclouds.org/), [Iridescence](https://github.com/koide3/iridescence)
+- [Optional] [OpenMP](https://www.openmp.org/), [Intel TBB](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html), [PCL](https://pointclouds.org/)
 
 ## Installation
 
@@ -344,6 +344,11 @@ open3d.visualization.draw_geometries([target_o3d, source_o3d])
 
 </details>
 
+
+### Cookbook
+
+- [Scan-to-scan and scan-to-model GICP matching odometry on KITTI](src/example/kitti_odometry.py)
+
 ## [Benchmark](BENCHMARK.md)
 
 Processing speed comparison between small_gicp and Open3D ([youtube]((https://youtu.be/LNESzGXPr4c?feature=shared))).
@@ -360,8 +365,9 @@ Processing speed comparison between small_gicp and Open3D ([youtube]((https://yo
 
 ### KdTree construction
 
-- Multi-threaded implementation (TBB and OMP) can be up to **4x faster** than the single-threaded one (All the implementations are based on nanoflann).
-- The processing speed gets faster as the number of threads increases, but the speed gain is not monotonic sometimes (because of the scheduling algorithm or some CPU(AMD 5995WX)-specific issues?).
+- Multi-threaded implementation (TBB and OMP) can be up to **6x faster** than the single-threaded one. The single-thread version shows almost equivalent performance with nanoflann.
+- ~~The processing speed gets faster as the number of threads increases, but the speed gain is not monotonic sometimes (because of the scheduling algorithm or some CPU(AMD 5995WX)-specific issues?)~~.
+- The new KdTree implementation shows a good scalability thanks to its well balanced task assignment.
 - This benchmark only compares the construction time (query time is not included). 
 
 ![kdtree_time](docs/assets/kdtree_time.png)
diff --git a/docs/assets/kdtree_time.png b/docs/assets/kdtree_time.png
index aa16c94..2f3b532 100644
Binary files a/docs/assets/kdtree_time.png and b/docs/assets/kdtree_time.png differ
diff --git a/include/small_gicp/ann/kdtree.hpp b/include/small_gicp/ann/kdtree.hpp
index df0240f..2141f70 100644
--- a/include/small_gicp/ann/kdtree.hpp
+++ b/include/small_gicp/ann/kdtree.hpp
@@ -96,11 +96,11 @@ struct KdTreeBuilder {
   NodeIndexType create_node(KdTree& kdtree, size_t& node_count, const PointCloud& points, IndexConstIterator global_first, IndexConstIterator first, IndexConstIterator last)
     const {
     const size_t N = std::distance(first, last);
+    const NodeIndexType node_index = node_count++;
+    auto& node = kdtree.nodes[node_index];
+
     // Create a leaf node.
     if (N <= max_leaf_size) {
-      const NodeIndexType node_index = node_count++;
-      auto& node = kdtree.nodes[node_index];
-
       // std::sort(first, last);
       node.node_type.lr.first = std::distance(global_first, first);
       node.node_type.lr.last = std::distance(global_first, last);
@@ -115,8 +115,6 @@ struct KdTreeBuilder {
     std::nth_element(first, median_itr, last, [&](size_t i, size_t j) { return proj(traits::point(points, i)) < proj(traits::point(points, j)); });
 
     // Create a non-leaf node.
-    const NodeIndexType node_index = node_count++;
-    auto& node = kdtree.nodes[node_index];
     node.node_type.sub.proj = proj;
     node.node_type.sub.thresh = proj(traits::point(points, *median_itr));
 
diff --git a/include/small_gicp/ann/kdtree_omp.hpp b/include/small_gicp/ann/kdtree_omp.hpp
index e1e8d96..2968a5e 100644
--- a/include/small_gicp/ann/kdtree_omp.hpp
+++ b/include/small_gicp/ann/kdtree_omp.hpp
@@ -55,11 +55,11 @@ struct KdTreeBuilderOMP {
     IndexConstIterator first,
     IndexConstIterator last) const {
     const size_t N = std::distance(first, last);
+    const NodeIndexType node_index = node_count++;
+    auto& node = kdtree.nodes[node_index];
+
     // Create a leaf node.
     if (N <= max_leaf_size) {
-      const NodeIndexType node_index = node_count++;
-      auto& node = kdtree.nodes[node_index];
-
       // std::sort(first, last);
       node.node_type.lr.first = std::distance(global_first, first);
       node.node_type.lr.last = std::distance(global_first, last);
@@ -74,8 +74,6 @@ struct KdTreeBuilderOMP {
     std::nth_element(first, median_itr, last, [&](size_t i, size_t j) { return proj(traits::point(points, i)) < proj(traits::point(points, j)); });
 
     // Create a non-leaf node.
-    const NodeIndexType node_index = node_count++;
-    auto& node = kdtree.nodes[node_index];
     node.node_type.sub.proj = proj;
     node.node_type.sub.thresh = proj(traits::point(points, *median_itr));
 
diff --git a/include/small_gicp/ann/kdtree_tbb.hpp b/include/small_gicp/ann/kdtree_tbb.hpp
index d6a71a7..890fbb8 100644
--- a/include/small_gicp/ann/kdtree_tbb.hpp
+++ b/include/small_gicp/ann/kdtree_tbb.hpp
@@ -37,11 +37,11 @@ struct KdTreeBuilderTBB {
     IndexConstIterator first,
     IndexConstIterator last) const {
     const size_t N = std::distance(first, last);
+    const NodeIndexType node_index = node_count++;
+    auto& node = kdtree.nodes[node_index];
+
     // Create a leaf node.
     if (N <= max_leaf_size) {
-      const NodeIndexType node_index = node_count++;
-      auto& node = kdtree.nodes[node_index];
-
       // std::sort(first, last);
       node.node_type.lr.first = std::distance(global_first, first);
       node.node_type.lr.last = std::distance(global_first, last);
@@ -56,8 +56,6 @@ struct KdTreeBuilderTBB {
     std::nth_element(first, median_itr, last, [&](size_t i, size_t j) { return proj(traits::point(points, i)) < proj(traits::point(points, j)); });
 
     // Create a non-leaf node.
-    const NodeIndexType node_index = node_count++;
-    auto& node = kdtree.nodes[node_index];
     node.node_type.sub.proj = proj;
     node.node_type.sub.thresh = proj(traits::point(points, *median_itr));
 
diff --git a/scripts/plot_kdtree.py b/scripts/plot_kdtree.py
index a30d9c1..7ead743 100644
--- a/scripts/plot_kdtree.py
+++ b/scripts/plot_kdtree.py
@@ -54,14 +54,14 @@ def main():
   fig, axes = pyplot.subplots(1, 2, figsize=(12, 3))
   
   num_threads = [1, 2, 3, 4, 5, 6, 7, 8, 16, 32, 64, 128]
-  axes[0].plot(num_points, results['small_1'], label='kdtree (nanoflann)', marker='o', linestyle='--')
-  for idx in [1, 3, 5, 7, 8]:
+  axes[0].plot(num_points, results['small_1'], label='kdtree (single-thread)', marker='o', linestyle='--')
+  for idx in [1, 2, 3, 5, 7, 8, 9]:
     N = num_threads[idx]
     axes[0].plot(num_points, results['omp_{}'.format(N)], label='kdtree_omp (%d threads)' % N, marker='s')
-    axes[0].plot(num_points, results['tbb_{}'.format(N)], label='kdtree_tbb (%d threads)' % N, marker='^')
+    # axes[0].plot(num_points, results['tbb_{}'.format(N)], label='kdtree_tbb (%d threads)' % N, marker='^')
 
   baseline = numpy.array(results['small_1'])
-  axes[1].plot([num_threads[0], num_threads[-1]], [1.0, 1.0], label='kdtree (nanoflann)', linestyle='--')
+  axes[1].plot([num_threads[0], num_threads[-1]], [1.0, 1.0], label='kdtree (single-thread)', linestyle='--')
   for idx in [5]:
     threads = num_threads[idx]
     N = num_points[idx]
diff --git a/scripts/run_kdtree_benchmark.sh b/scripts/run_kdtree_benchmark.sh
index 79fc832..cf88254 100755
--- a/scripts/run_kdtree_benchmark.sh
+++ b/scripts/run_kdtree_benchmark.sh
@@ -3,7 +3,7 @@ dataset_path=$1
 exe_path=../build/kdtree_benchmark
 
 mkdir results
-num_threads=(1 2 3 4 5 6 7 8 16 32 64 128)
+num_threads=(1 2 3 4 5 6 7 8 16 32 64 92 128)
 
 $exe_path $dataset_path --num_threads 1 --num_trials 1000 --method small | tee results/kdtree_benchmark_small_$N.txt