In the context of drone deliveries, it is crucial not only to detect obstacles in real-time
but also to accurately determine their distances
. However, the implementation of traditional LiDAR
sensors on drones poses a significant challenge due to their high cost
and weight
. The expense and heaviness of LiDAR systems hinder their practicality for integration onto drones for obstacle detection
and distance estimation
purposes. Consequently, an alternative solution is needed to overcome these limitations and enable drones to avoid obstacles
, plan optimal paths
, and gain a comprehensive understanding of the surrounding environment. Addressing the problem of acquiring accurate obstacle distance measurements becomes essential in providing a cost-effective and lightweight alternative to LiDAR sensors, ensuring the safe and efficient implementation of drone deliveries.
This project presents an economical alternative to LiDAR sensors for per-pixel depth estimation
in drone applications. By utilizing block matching
and Semi-Global Block Matching (SGBM)
algorithms, we showcase how stereo computer vision techniques can accurately determine depth information. The block matching algorithm efficiently establishes correspondences
between stereo camera images, while the SGBM algorithm optimizes the disparity map
estimation process. Extensive experimentation validates the efficacy of this approach, demonstrating its capacity to provide per-pixel depth estimation at a significantly reduced cost
compared to traditional LiDAR systems. Our findings indicate that the integration of pseudo-LiDAR
systems presents a promising and cost-effective solution for replacing LiDAR sensors in drone applications, facilitating affordable depth perception and expanding the possibilities for aerial data collection and analysis.
For this project, the KITTI dataset will be utilized, specifically the stereo camera images. The dataset includes a comprehensive 3D object detection benchmark, comprising 7,481
training images and 7,518
test images. These images are accompanied by corresponding point clouds, providing a total of 80,256
labeled objects for analysis. Additionally, the synchronized and rectified raw data
from the KITTI dataset will be employed during the inference stage of the project, further enhancing the accuracy and reliability of the results. The availability of this rich dataset enables thorough exploration and evaluation of the proposed methods and ensures robustness in the project's findings.
Before proceeding, we need to develop a method to estimate the camera's internal and external parameters accurately. This requires creating a linear camera model, which simplifies the estimation process compared to nonlinear models.
In the image below, we have a point P
in the world coordinates W
and the camera with its own coordinates C
. If we know the relative position and orientation of the camera coordinate frame with respect to the world coordinates frame, then we can write an expression that takes us all the way from the point P
in the world coordinate frame to its projection P'
onto the image plane. That complete mapping is what we refer to as the forward imaging model
.
We assume that the point has been defined in the camera coordinate frame and using perspective projection equations
:
and are the coordinates of point P
onto the image plane and f
is the focal length which is the distance between the Centre of Projection (COP) and the image plane of the camera.
We have assumed we know the coordinates of the image plane in terms of millimeters (mm) which is the same unit in the camera coordinate frame. However, in reality we have an image sensor whose units are pixels (px). Hence, we need to convert our coordinate of P
from mm to pixel coordinates using pixel densities (pixels/mm).
and are the pixel densities (px/mm) in the x and y directions respectively.
Since now we treated the midpoint of the image plane as the origin but generally, we treat the top-left corner as the origin of the image sensor. is the Principle Point where the optical axis pierces the sensor.
We re-write the equation above whereby are the focal lengths in pixels in the x and y directions:
- : Intrinsic parameters of the camera.
- They represent the camera's internal geometry.
- The equation tells us that objects farther away appear smaller in the image.
(u,v)
are non-linear models as we are dividing byz
.
We need to go from a non-linear
model to a linear
model and we will use homogeneous coordinates to do so. we will transform (u,v)
from pixel coordinates to homogeneous coordinates.
The homogeneous representation of a 2D point is a 3D point . The third coordinate is fictitious.
Similarly, we multiply the equation below by :
We now have a 3 x 4
matrix which contains all the internal parameters of the camera multiplied by the homogeneous coordinates of the 3-dimensional point P
defined in the camera coordinate frame. This gives us a Linear Model for Perspective Projection.
Note that this 3 x 4
matrix is called the Intrinsic Matrix:
Its structure is an upper-right triangular matrix which we can separate as K
:
Hence, we have that takes us from the homogeneous coordinate representation of a point in the camera coordinate frame 3D to its pixel coordinates in the image, .
Now we need the mapping of a point from the world
coordinates to the camera
coordinates: 3D to 3D. That can be done by using the position,
and orientation, R
, of the camera coordinate frame. The position and orientation of the camera in the world coordinate frame W
are the camera's Extrinsic Parameters.
Orientation or Rotation matrix R is orthonormal:
If we now map P
from the word coordinate frame to the camera coordinate frame rotation matrix R
and translation matrix t
:
We can write the equation above using homogeneous coordinates:
The Extrinsic Matrix is the 4 x 4
matrix:
And this is how we transformed a point from the world coordinates frame to the camera coordinates frame:
We now have our mapping from world to camera using the Extrinsic Matrix:
And the mapping of the camera to the image using the Intrinsic Matrix:
Therefore, we can combine these two to get a direct mapping from a point in the world coordinate frame to a pixel location in the image:
We then have a 3 x 4
matrix called the Projection Matrix:
Hence, if we wish to calibrate the camera, all we need to know is the projection matrix. We can then go from any point in 3D to its projection in pixels in the image.
By employing our calibration method, we are able to achieve a precise estimation of the projection matrix. we can also go beyond this step by decomposing the projection matrix into its constituent parts: the intrinsic matrix
, encompassing all internal parameters
, and the extrinsic matrix
, which captures the external parameters
of the camera.
Now if we check how we get the first 3 columns of the projection matrix, we get it by multiplying the calibration matrix, K with the rotation matrix, R. Note that K
is an upper-right triangular matrix and R
is an orthonormal matrix, hence, it is possible to uniquely "decouple" K
and R
from their product using QR factorization method
.
Similarly, to get the last column of the projection matrix, we multiply the calibration matrix, K with the translation vector, t:
Therefore:
Note that pinholes do not exhibit distortions but lenses do. We may encounter radial or tangential distortion and these are non-linear effects. And in order to take care of these, we need to incorporate the distortion coefficients in our intrinsic model.
Now we want to recover a 3-Dimensional structure of a scene from two images. Before we dive into this, let's ask ourselves, given a calibrated camera, can we find the 3D scene point from a sinple 2D image? The answer is no. But we do know that the corresponding 3D point must lie on an outgoing ray
shown in green and given that the camera is calibrated, we know the equation of this ray.
So in order to reconstruct a 3D scene, we need two images captured from two different locations. Below is an example of a stereo camera whereby we have a left camera and a right camera, and the right camera is simply identical to the left camera but displaced along the horizontal direction by a distance b
called the baseline
.
In the image below we have the projection of a scene point in the left camera and the projection of the same point in the right camera. We shoot out two rays from the projected points and wherever those two rays intersect is where the physical point lies corresponding to these two image points. This is called triangulation problem.
Assuming we know the position of the point in the left and right image plane then we have the 2 equations based on the perspective projection equation:
Notice that the position of the point in the vertical direction of the image plane is the same for both the left and right camera meaning we have no disparity in the vertical direction. Hence solving for (x,y,z)
:
- Disparity is the difference in the u-coordinates of the same scene point in the two images.
- Using the disparity we can calculate the depth
z
of the point in the scene. - Depth
z
is inversely proportional to the disparity. That is, if a point is very close to the camera, it will have a big disparity. On the other hand, if a point is far from the camera, it will have a small disparity. - The disparity is zero if a point is at infinity, that is, at infinity a point will have the same exact position on the left and right image plane.
- Disparity is proportional to baseline, that is, the further apart are the two cameras, the greater the disparity will be.
- When designing a stereo system, we want to use a stereo configuration where the baseline is large, as the larger the baseline, the more precise we can make the disparity measurements.
Since now we have assumed we know where a point of the left image lands on the right image. That is, we assumed we knew the correspondence between points in the left and right image. Now we need to find that correspondence and that is called stereo matching.
As mentioned before, we do NOT have disparity in the vertical direction, which means that the corresponding points must lie on the same horizontal line
in both images. Hence, our search is 1D
and we can use template matching
to find the corresponding point in the right image.
- We use the window
T
as a template and match it with all the windows along the same scan lineL
in the right image using a similarity metric (e.g.,sum of absolute differences
,sum of squared differences
). - The point where it matches best (where the error is lowest) is the match.
- That point is the correspondence that we use to estimate disparity.
Now we want to know how big the window T
should be:
- If the window is very small, we may get good localization but high-sensitivity noise. The smaller the window, the less descriptive the pattern is.
- If we use a large window, we may get more robust matches in terms of depth of values but the disparity map is going to be more blurred: poor localization.
- Another method could be to use the adaptive window method: For each point, we match points using windows of multiple sizes and use the disparity that is a result of the best similarity measure.
Stereo-matching algorithms can be categorized as either local
or global
methods, depending on how they handle the disparity optimization step.
- Global methods incorporate a pairwise smoothness term in the cost function, which encourages spatial continuity across pixels and consistent assignments along edges.
- These global methods generally perform better than local methods in handling object boundaries and resolving ambiguous matches.
- Global optimization problems are often computationally challenging and can be classified as NP-hard.
- On the other hand, local methods have simpler computations but may struggle with object boundaries and ambiguous matches.
In order to get a good stereo matching, we want to avoid the following:
- We expect the surface to have texture. If we have the image of a smooth wall, then we do not have much texture and if we take a small window
T
then we may get many matches between the left and right image. - If we do have texture in our image, then this texture should not be repetitive. If we have a repetitive pattern we will get multiple matches in the right image where they are all going to be perfectly good matches.
- An inherent problem of stereo matching is foreshortening whereby the projected area of our window onto the left image is different from the projected area in the right image. Hence, we are not matching the same brightness patterns but a warped or distorted version of it.
Now we will try to find the distances of objects using the KITTI Dataset. For that, we first need to know the configuration of the cameras on the autonomous car. Thankfully, we have a very detailed explanation describing the position and the baseline between the cameras as shown below.
When downloading the dataset for 3D Object Detection, we have the following files:
- Left Images
- Right Images
- Calibration Data
- Labels
Below is an example of how the calibration data is. Since we are using left and right color images, we need to extract P2
and P3
to get the projection matrices.
P0: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 0.000000000000e+00 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P1: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 -3.875744000000e+02 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P2: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 4.485728000000e+01 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 2.163791000000e-01 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 2.745884000000e-03
P3: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 -3.395242000000e+02 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 2.199936000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 2.729905000000e-03
R0_rect: 9.999239000000e-01 9.837760000000e-03 -7.445048000000e-03 -9.869795000000e-03 9.999421000000e-01 -4.278459000000e-03 7.402527000000e-03 4.351614000000e-03 9.999631000000e-01
Tr_velo_to_cam: 7.533745000000e-03 -9.999714000000e-01 -6.166020000000e-04 -4.069766000000e-03 1.480249000000e-02 7.280733000000e-04 -9.998902000000e-01 -7.631618000000e-02 9.998621000000e-01 7.523790000000e-03 1.480755000000e-02 -2.717806000000e-01
Tr_imu_to_velo: 9.999976000000e-01 7.553071000000e-04 -2.035826000000e-03 -8.086759000000e-01 -7.854027000000e-04 9.998898000000e-01 -1.482298000000e-02 3.195559000000e-01 2.024406000000e-03 1.482454000000e-02 9.998881000000e-01 -7.997231000000e-01
Note that we get a left and right projection matrix:
Left P Matrix
[[ 721.54 0 609.56 44.857]
[ 0 721.54 172.85 0.21638]
[ 0 0 1 0.0027459]]
Right P Matrix
[[ 721.54 0 609.56 -339.52]
[ 0 721.54 172.85 2.1999]
[ 0 0 1 0.0027299]]
RO to Rect Matrix
[[ 0.99992 0.0098378 -0.007445]
[ -0.0098698 0.99994 -0.0042785]
[ 0.0074025 0.0043516 0.99996]]
Velodyne to Camera Matrix
[[ 0.0075337 -0.99997 -0.0006166 -0.0040698]
[ 0.014802 0.00072807 -0.99989 -0.076316]
[ 0.99986 0.0075238 0.014808 -0.27178]]
IMU to Velodyne Matrix
[[ 1 0.00075531 -0.0020358 -0.80868]
[ -0.0007854 0.99989 -0.014823 0.31956]
[ 0.0020244 0.014825 0.99989 -0.79972]]
Next, we need to extract the Intrinsic Matrix and the Extrinsic Matrix from our Projection Matrix. Recall from the equation below:
We will use OpenCV's decomposeProjectionMatrix
function for that:
def decompose_projection_matrix(projection_matrix):
"""
Decompose a projection matrix into camera matrix, rotation matrix, and translation vector.
Args:
projection_matrix (numpy.ndarray): 3x4 projection matrix.
Returns:
tuple: Tuple containing the decomposed components:
- camera_matrix (numpy.ndarray): Camera matrix. [ fx 0 cx ]
[ 0 fy cy ]
[ 0 0 1 ]
- rotation_matrix (numpy.ndarray): Rotation matrix.
- translation_vector (numpy.ndarray): Translation vector.
"""
# Decompose the projection matrix
camera_matrix, rotation_matrix, translation_vector, _, _, _, _ = cv2.decomposeProjectionMatrix(projection_matrix)
translation_vector = translation_vector/translation_vector[3]
# Return the decomposed components
return camera_matrix, rotation_matrix, translation_vector
Below is what we get:
Camera Matrix Left:
[[ 721.54 0 609.56]
[ 0 721.54 172.85]
[ 0 0 1]]
Rotation Matrix Left:
[[ 1 0 0]
[ 0 1 0]
[ 0 0 1]]
Translation Vector Left:
[[ -0.059849]
[ 0.00035793]
[ -0.0027459]
[ 1]]
Camera Matrix Right:
[[ 721.54 0 609.56]
[ 0 721.54 172.85]
[ 0 0 1]]
Rotation Matrix Right:
[[ 1 0 0]
[ 0 1 0]
[ 0 0 1]]
Translation Vector Right:
[[ 0.47286]
[ -0.002395]
[ -0.0027299]
[ 1]]
Observe how the Rotation Matrix is just an Identity Matrix. We need to extract the Focal Length from the Intrinsic Matrix and calculate the baseline from the translation vectors as such:
# Extract the focal length and baseline
focal_length_x = camera_matrix_right[0, 0]
focal_length_y = camera_matrix_right[1, 1]
baseline = abs(translation_vector_left[0] - translation_vector_right[0])
Output:
Focal Length (x-direction): 721.5377
Focal Length (y-direction): 721.5377
Baseline: [ 0.53271]
Observe the baseline is 0.53271
m same as on the configuration on the autonomous car.
We get a label file for each pair of images. Below is our label.txt file with each object having 15 values.
Car 0.00 0 -1.56 564.62 174.59 616.43 224.74 1.61 1.66 3.20 -0.69 1.69 25.01 -1.59
Car 0.00 0 1.71 481.59 180.09 512.55 202.42 1.40 1.51 3.70 -7.43 1.88 47.55 1.55
Car 0.00 0 1.64 542.05 175.55 565.27 193.79 1.46 1.66 4.05 -4.71 1.71 60.52 1.56
Cyclist 0.00 0 1.89 330.60 176.09 355.61 213.60 1.72 0.50 1.95 -12.63 1.88 34.09 1.54
DontCare -1 -1 -10 753.33 164.32 798.00 186.74 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 738.50 171.32 753.27 184.42 -1 -1 -1 -1000 -1000 -1000 -10
#Values Name Description
----------------------------------------------------------------------------
1 type Describes the type of object: 'Car', 'Van', 'Truck',
'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
'Misc' or 'DontCare'
1 truncated Float from 0 (non-truncated) to 1 (truncated), where
truncated refers to the object leaving image boundaries
1 occluded Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded
2 = largely occluded, 3 = unknown
1 alpha Observation angle of object, ranging [-pi..pi]
4 bbox 2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3 dimensions 3D object dimensions: height, width, length (in meters)
3 location 3D object location x,y,z in camera coordinates (in meters)
1 rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi]
1 score Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.
Credit to DarylClimb for this explanation.
Note that DontCare
labels denote regions in which objects have not been labeled, for example, because they have been too far away from the laser scanner. We need to extract the bounding box coordinates and the distance of each object from the labels. We will compare them with the ground truth.
def ground_truth_bbox(labels_file, object_class):
lines = labels_file.split('\n') # Split the input string into lines
bounding_boxes = []
for line in lines:
line = line.strip() # Remove leading/trailing whitespace
if line:
data = line.split()
# Extract relevant values for each object
obj_type = data[0]
#print(obj_type)
truncated = float(data[1])
occluded = int(data[2])
alpha = float(data[3])
bbox = [float(val) for val in data[4:8]]
dimensions = [float(val) for val in data[8:11]]
location = [float(val) for val in data[11:14]]
distance = float(data[13])
rotation_y = float(data[14])
if obj_type in object_class: # Check if obj_type is in the desired classes
# Append bounding box dimensions and distance to the list
bounding_boxes.append((bbox, distance))
# Print the 3D bounding box information
print("Object Type:", obj_type)
print("Truncated:", truncated)
print("Occluded:", occluded)
print("Alpha:", alpha)
print("Bounding Box:", bbox)
print("Dimensions:", dimensions)
print("Location:", location)
print("True Distance:", distance)
print("Rotation Y:", rotation_y)
print("------------------------")
return bounding_boxes
Output:
Object Type: Car
Truncated: 0.0
Occluded: 0
Alpha: 1.71
Bounding Box: [481.59, 180.09, 512.55, 202.42]
Dimensions: [1.4, 1.51, 3.7]
Location: [-7.43, 1.88, 47.55]
True Distance: 47.55
Rotation Y: 1.55
We start by displaying the right and left images of the KITTI dataset. Note that since the baseline is 54 cm
which is quite large compared to the OAK-D stereo cameras, we will have a greater disparity hence, it will be more accurate to calculate the depth map.
Next, we will use OpenCV's Block Matching function (StereoBM_create
) and Semi-Global Block Matching function (StereoSGBM_create
) to calculate the disparity. We will need to fine-tune some parameters: num_disparities
, block_size
, and window_size
.
def compute_disparity(left_img, right_img, num_disparities=6 * 16, block_size=11, window_size=6, matcher="stereo_sgbm", show_disparity=True):
"""
Compute the disparity map for a given stereo-image pair.
Args:
image (numpy.ndarray): Left image of the stereo pair.
img_pair (numpy.ndarray): Right image of the stereo pair.
num_disparities (int): Maximum disparity minus minimum disparity.
block_size (int): Size of the block window. It must be an odd number.
window_size (int): Size of the disparity smoothness window.
matcher (str): Matcher algorithm to use ("stereo_bm" or "stereo_sgbm").
show_disparity (bool): Whether to display the disparity map using matplotlib.
Returns:
numpy.ndarray: The computed disparity map.
"""
if matcher == "stereo_bm":
# Create a Stereo BM matcher
matcher = cv2.StereoBM_create(numDisparities=num_disparities, blockSize=block_size)
elif matcher == "stereo_sgbm":
# Create a Stereo SGBM matcher
matcher = cv2.StereoSGBM_create(
minDisparity=0, numDisparities=num_disparities, blockSize=block_size,
P1=8 * 3 * window_size ** 2, P2=32 * 3 * window_size ** 2,
mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
)
# Convert the images to grayscale
left_gray = cv2.cvtColor(left_img, cv2.COLOR_BGR2GRAY)
right_gray = cv2.cvtColor(right_img, cv2.COLOR_BGR2GRAY)
# Compute the disparity map
disparity = matcher.compute(left_gray, right_gray).astype(np.float32) / 16
if show_disparity:
# Display the disparity map using matplotlib
plt.imshow(disparity, cmap="CMRmap_r") #CMRmap_r # cividis
plt.title('Disparity map with SGBM', fontsize=12)
plt.show()
return disparity
Observe how the disparity using Block Matching is noisier than SGBM. We will use the SGBM algorithm from now on.
Note that we have a white strip on the left of the images as there are no features to match from the left image to the right image. This depends on the block_size
parameter.
Next, we will use the disparity map to output a depth map. We will use the equation below:
We take in the baseline, focal length, and the disparity map in order to calculate the depth of each pixel:
def calculate_depth_map(disparity, baseline, focal_length, show_depth_map=True):
"""
Calculates the depth map from a given disparity map, baseline, and focal length.
Args:
disparity (numpy.ndarray): Disparity map.
baseline (float): Baseline between the cameras.
focal_length (float): Focal length of the camera.
Returns:
numpy.ndarray: Depth map.
"""
# Replace all instances of 0 and -1 disparity with a small minimum value (to avoid div by 0 or negatives)
disparity[disparity == 0] = 0.1
disparity[disparity == -1] = 0.1
# Initialize the depth map to match the size of the disparity map
depth_map = np.ones(disparity.shape, np.single)
# Calculate the depths
depth_map[:] = focal_length * baseline / disparity[:]
if show_depth_map:
# Display the disparity map using matplotlib
plt.imshow(depth_map, cmap="cividis") #CMRmap_r # cividis
plt.show()
return depth_map
Our output is a depth map that outputs the depth of each pixel.
The depth map represents per-pixel depth estimation
. However, we are interested in getting the depth or distances of some specific obstacles such as cars or pedestrians. Hence, we will use an object detection algorithm in order to segment our depth map and get the distances of these objects only.
output_depth_1.mp4
The next step is to use YOLOv8
to detect objects in our frames. We want to get the coordinates of the bounding boxes.
def get_bounding_box_center_frame(frame, model, names, object_class, show_output=True):
bbox_coordinates = []
frame_copy = frame.copy()
# Perform object detection on the input frame using the specified model
results = model(frame)
# Iterate over the results of object detection
for result in results:
# Iterate over each bounding box detected in the result
for r in result.boxes.data.tolist():
# Extract the coordinates, score, and class ID from the bounding box
x1, y1, x2, y2, score, class_id = r
x1 = int(x1)
x2 = int(x2)
y1 = int(y1)
y2 = int(y2)
# Get the class name based on the class ID
class_name = names.get(class_id)
# Check if the class name matches the specified object_class and the detection score is above a threshold
if class_name in object_class and score > 0.5:
bbox_coordinates.append([x1, y1, x2, y2])
# Draw bounding box on the frame
cv2.rectangle(frame_copy, (x1, y1), (x2, y2), (0, 255, 0), 2)
if show_output:
# Convert frame from BGR to RGB
frame_rgb = cv2.cvtColor(frame_copy, cv2.COLOR_BGR2RGB)
# Show the output frame with bounding boxes
cv2.imshow("Output", frame_rgb)
cv2.waitKey(0)
cv2.destroyAllWindows()
# Return the list of center coordinates
return bbox_coordinates
We decide to detect cars
, persons
, and bicycles
only.
Since now we have the bounding box coordinates and the depth map for each frame, we can use two pieces of data and extract the depth of each obstacle.
We get the distance of an obstacle by indexing the depth map using the center of the bounding boxes.
def calculate_distance(bbox_coordinates, frame, depth_map, disparity_map, show_output=True):
frame_copy = frame.copy()
# Normalize the disparity map to [0, 255]
disparity_map_normalized = cv2.normalize(disparity_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U)
# Apply a colorful colormap to the disparity map
colormap = cv2.COLORMAP_JET
disparity_map_colored = cv2.applyColorMap(disparity_map_normalized, colormap)
# Normalize the depth map to [0, 255]
depth_map_normalized = cv2.normalize(depth_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U)
# Apply a colorful colormap to the depth map
colormap = cv2.COLORMAP_BONE
depth_map_colored = cv2.applyColorMap(depth_map_normalized, colormap)
for bbox_coor in bbox_coordinates:
x1, y1, x2, y2 = bbox_coor
center_x = (x1 + x2) // 2
center_y = (y1 + y2) // 2
distance = depth_map[center_y][center_x]
print("Calculated distance:", distance)
# Convert distance to string
distance_str = f"{distance:.2f} m"
# Draw bounding box on the frame
cv2.rectangle(frame_copy, (x1, y1), (x2, y2), (0, 255, 0), 2)
# Draw bounding box on the frame
cv2.rectangle(disparity_map_colored, (x1, y1), (x2, y2), (0, 255, 0), 2)
# Draw bounding box on the frame
cv2.rectangle(depth_map_colored, (x1, y1), (x2, y2), (0, 255, 0), 2)
# Calculate the text size
text_size, _ = cv2.getTextSize(distance_str, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 2)
# Calculate the position for placing the text
text_x = center_x - text_size[0] // 2
text_y = y1 - 10 # Place the text slightly above the bounding box
# Calculate the rectangle coordinates
rect_x1 = text_x - 5
rect_y1 = text_y - text_size[1] - 5
rect_x2 = text_x + text_size[0] + 5
rect_y2 = text_y + 5
# Draw white rectangle behind the text
cv2.rectangle(frame_copy, (rect_x1, rect_y1), (rect_x2, rect_y2), (255, 255, 255), cv2.FILLED)
# Put text at the center of the bounding box
cv2.putText(frame_copy, distance_str, (text_x, text_y),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 2, cv2.LINE_AA)
# Draw white rectangle behind the text
cv2.rectangle(disparity_map_colored, (rect_x1, rect_y1), (rect_x2, rect_y2), (255, 255, 255), cv2.FILLED)
# Put text at the center of the bounding box
cv2.putText(disparity_map_colored, distance_str, (text_x, text_y),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 2, cv2.LINE_AA)
# Draw white rectangle behind the text
cv2.rectangle(depth_map_colored, (rect_x1, rect_y1), (rect_x2, rect_y2), (255, 255, 255), cv2.FILLED)
# Put text at the center of the bounding box
cv2.putText(depth_map_colored, distance_str, (text_x, text_y),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 2, cv2.LINE_AA)
if show_output:
# Convert frame from BGR to RGB
#frame_rgb = cv2.cvtColor(frame_copy, cv2.COLOR_BGR2RGB)
# Show the output frame with bounding boxes
cv2.imshow("Output disparity map", disparity_map_colored)
cv2.imshow("Output frame", frame_copy)
cv2.imshow("Output depth map", depth_map_colored)
cv2.waitKey(0)
cv2.destroyAllWindows()
return disparity_map_colored, frame_copy, depth_map_colored
We draw the bounding boxes and the distance of the objects from the camera is displayed on top in meters.
We can also compare our prediction of distances with the ground truth
. The image on the left(red) is the ground truth whereas the image on the right is our prediction using SGBM. We still have a discrepancy of 0.41 m
which is still a huge number when it comes to self-driving cars.
In this example, we have a difference of 0.61 m
.
Finally, we want to create a pipeline
function whereby:
- We take in all left and right images
- Calculate the disparity
- Create a depth map
- Run an object detection
- Display the distances with their bounding boxes.
def pipeline(left_image, right_image, object_class):
"""
Performs a pipeline of operations on stereo images to obtain a colored disparity map, RGB frame, and colored depth map.
Input:
- left_image: Left stereo image (RGB format)
- right_image: Right stereo image (RGB format)
- object_class: List of object classes of interest for bounding box retrieval
Output:
- disparity_map_colored: Colored disparity map (RGB format)
- frame_rgb: RGB frame
- depth_map_colored: Colored depth map (RGB format)
"""
global focal_length
# Calculate the disparity map
disparity_map = compute_disparity(left_image, right_image, num_disparities=90, block_size=5, window_size=5,
matcher="stereo_sgbm", show_disparity=False)
# Calculate the depth map
depth_map = calculate_depth_map(disparity_map, baseline, focal_length, show_depth_map=False)
# Get bounding box coordinates for specified object classes
bbox_coordinates = get_bounding_box_center_frame(left_image, model, names, object_class, show_output=False)
# Calculate colored disparity map, RGB frame, and colored depth map
disparity_map_colored, frame_rgb, depth_map_colored = calculate_distance(bbox_coordinates, left_image, depth_map, disparity_map, show_output=False)
return disparity_map_colored, frame_rgb, depth_map_colored
Below are the results:
concat_1.mp4
concat_2.mp4
concat_3.mp4
concat_4.mp4
In conclusion, while traditional techniques such as Local
, Global
, and Semi-Global Matching
algorithms have been employed in this project, the utilization of Deep Learning provides alternative approaches to determining disparity. For example:
-
MC-CNN: an algorithm that uses Siamese Networks logic to find disparity
-
PSM-Net: an algorithm that uses Pyramids architectures
-
AnyNet: an algorithm optimized for mobile
-
RAFT-Stereo: an optical flow algorithm adapted to Stereo Vision
-
CreStereo which is compatible with the OAK-D camera
By employing advanced neural networks architectures like MC-CNN, PSM-Net, AnyNet, RAFT-Stereo, and CreStereo, we can achieve more accurate and reliable depth maps, enabling UAVs to better perceive their surroundings. This improved distance estimation is crucial for obstacle detection during UAV package delivery operations. With precise depth perception, UAVs can effectively identify and locate obstacles such as buildings, trees, or other objects that may obstruct their flight path. By leveraging deep learning, UAVs can analyze real-time scenes, enabling them to make informed decisions and adjust their trajectory to avoid potential collisions or navigate complex environments safely.
[1] Mileyan, I. (n.d.). AnyNet. GitHub. https://github.com/mileyan/AnyNet
[2] Chang, J. R. (n.d.). PSMNet. GitHub. https://github.com/JiaRenChang/PSMNet
[3] Autonomous Vision Group. (n.d.). Unimatch. GitHub. https://github.com/autonomousvision/unimatch
[4] Zhang, M. (2016). Scene Understanding and Mapping in Robotics: Algorithms and System Integration. Massachusetts Institute of Technology. https://groups.csail.mit.edu/commit/papers/2016/min-zhang-meng-thesis.pdf
[5] Eigen, D., Puhrsch, C., & Fergus, R. (2015). Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. arXiv:1510.05970. https://arxiv.org/abs/1510.05970
[6] Kendall, A., & Gal, Y. (2018). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv:1803.08669. https://arxiv.org/pdf/1803.08669.pdf
[7] Author(s) not specified. (2022). Title not specified. arXiv:2203.11483. https://arxiv.org/abs/2203.11483
[8] Episci Inc. (n.d.). Swarmsense. https://www.episci.com/product/swarmsense/
[9] Skybrush. (n.d.). https://skybrush.io/
[10] ModalAI. (n.d.). https://www.modalai.com/
[11] Climb, D. (n.d.). Projections. GitHub. https://github.com/darylclimb/cvml_project/tree/master/projections
[12] Jain, S. (n.d.). Depth Estimation 1: Basics and Intuition. Towards Data Science. https://towardsdatascience.com/depth-estimation-1-basics-and-intuition-86f2c9538cd1
[13] Author not specified. (n.d.). Uncertainty in Depth Estimation. Towards Data Science. https://towardsdatascience.com/uncertainty-in-depth-estimation-c3f04f44f9
[14] Author not specified. (n.d.). Camera-LiDAR Projection: Navigating between 2D and 3D. Medium. https://medium.com/swlh/camera-lidar-projection-navigating-between-2d-and-3d-911c78167a94
[15] Geiger, A., Lenz, P., & Urtasun, R. (2013). Are we ready for autonomous driving? The KITTI vision benchmark suite. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3354-3361. https://www.mrt.kit.edu/z/publ/download/2013/GeigerAl2013IJRR.pdf
[16] Sevensense Robotics. (n.d.). Monocular 3D Object Detection and Box Fitting Trained End-to-End from Image Labels. [Video file]. https://www.youtube.com/watch?v=AYjgeaQR8uQ&ab_channel=SevensenseRobotics
[17] Author(s) not specified. (2020). Title not specified. arXiv:2007.10743. https://arxiv.org/pdf/2007.10743.pdf