Simple Monocular SLAM using OpenCV and OpenCV Contrib.
Simultaneous Localisation and Mapping (SLAM) is a complex task in Robotics & Computer Vision. It solves situations, for example, of a robot vacuum cleaner building a map of unknown surface. The robot software must be able to reconstruct the 3D scene of an environment from 2D pictures (there could be various artifacts - depth maps, heatmaps, sensors data, etc.).
The most classic SLAM variation works with a single camera. The task is to build a 3D scene (say, point cloud) from the pictures made with a single camera, which moves inside some environment.
The dataset and formulation were prepared by YSDA faculty, with support of Timur Ibadov and Ivan Malin (from GeoCV), and Alexander Velizhnev (a member of IBM Research Center in Zurich, velizhev@gmail.com).
Used data from TU Munich RGB-D SLAM Dataset.
In order to understand the underlying concepts of SLAM, there is a study-oriented project to build up a simple version of SLAM, using pictures from a single camera. About half of the images provided have camera coordinates and rotation matrices (i.e, we know where camera was when taking a specific shot). The task is to determine the coordinates and rotation matrices for the rest of the images. Camera intrinsics are given. Support images data is provided with some noise to make the problem closer to real world situations.
The images with known poses are called support images, the images to determine the poses - unknown images.
Let's mark support images as $S = {s_i}{i=1}^{n}$, unknown images as $U = {u_i}{i=1}^{m}$, and all images
as
- For each image
$x_i \in X$ find a set of keypoints$K(x_i)$ . Then, using an ORB algorithm, which is an alternative to SIFT, find descriptors$D_i = D(K(x_i))$ . - For each pair of support images
${s_i, s_j}, i < j$ :- Match descriptors of images in descriptor space.
- Use RANSAC to filter wrong inliers (inlier - the same point across images).
- Save all inliers for a given pair of support images,
$I({D_i, D_j}) = I_{ij}$ .
- Build tracks between images. A track is a sequence of inliers of the same point across pairs of images.
- Triangulate 3D points for each track. Here, poses of each support image are used. After this step, we have a cloud of 3D points.
- Filter noisy 3D points. Each 3D point in a cloud is reprojected back to images where it was found. Reprojection error is computed for each image, and if the maximum error is above threshold, the point (and the whole track) is discarded from the cloud.
- For each unknown image
$u_i \in U$ :- Match descriptors of
$u_i$ with$s_j \in S \forall j = 1, ..., n$ . The same procedure as in step$2$ , but with inliers from support images. 3D and 2D points correspondences are sufficient to solve PnP equation system and find poses of$u_i$ .
- Match descriptors of
Examples of images:
Some statistics of the process:
# of images : 100
# of support images : 50
Keypoints per image : 500.00
# of support pairs : 1225
# of pairs with inliers : 535
Inliers per support pair : 47.64
Tracks found : 9289
Tracks per inliers pair : 17.36
Scene points found : 9219
Scene points per track : 0.99
# of pairs with inliers2 : 774
Inliers2 per support pair : 44.64
Keypoints per image is a hyperparameter of descriptors matcher. Overall, the algorithms found
Obtained metrics are as follows:
tr 0.03 rot 1.38
These are maximum translation error (tr) and maximum rotation error (rot). Each error
is computed as
The metrics show very small difference between ground truth and predicted matrices, approving that the algorithm is capable of solving the task.

