In this paper, a novel type of neural network is designed that directly consumes point clouds, providing a unified architecture for applications ranging from Object classification, Part segmentation to scene semantic parsing.
Typical convolutional architectures require highly regular input data formats like image grids or 3D voxels in order to perform weight sharing and other kernel optimizations. \ Since point clouds are not in a regular format, they are typically transformed to regular 3D voxel grids or collection of images before feeding to the deep net architecture. This data representation transformation renders the resulting data unnecessarily voluminous.
PointNet is a unified architecture that directly takes point clouds as inputs and outputs either class labels for the entire input or per point segment labels for each point of the input.
The basic architecture of the network is simple as in the initial stages, each point is processed identically and independently. The points are represented by just three coordinates (x, y, z). Additional dimensions may be added by computing normals and other local or global features.
A deep learning framework that directly consumes unordered point sets as inputs. A point cloud is represented as a set of 3D points where each point P is a vector of its (x, y, z) coordinate plus extra feature channels such as color, normal, etc.
-
For the Object Classification task, the input cloud is directly sampled from a shape or pre-segmented from a scene point cloud.
-
For Semantic Segmentation, the input can be a single object for part region segmentation or a sub-volume from a 3D scene for object region segmentation.
The network has three key modules:
- The max pooling layer as a symmetric function to aggregate information from all the points
- A local and global information combintion structure
- A two joint alignment network that aligns both input points and features.
In order to make a model invarient to input permutation, three strategies exist
- Sort input into a canonical order
- Treat the input as a sequence to train an RNN
- Use a simple symmetric function to aggregate the information from each point.
The symmetric function takes n vectors as input and outputs a new vector that is invariant to the input order. The idea is to approximate a general function defined on a point set by applying a symmetric function on transformed elements in the set: where
,
and
.
The output from the above section forms a vector which is a global signature of the input set. A SVM or a Multi-Layer Perceptron classifier can be trained on the shape global features for classification.
The semantic labeling of a point cloud has to be invariant if the point cloud undergoes a certain geometrical transformation, such as rigit transformation.
A natural solution is to aligh all input set to a canonical space before feature extraction.
Input form of point clouds allows us to achieve this goal in a much simpler way. There is no need to invent any new layers and no alias is introduced as in the image case. An affine transformation matrix is predicted using a mini-network and directly apply this transformation to the coordinates of the input points. The mini-network itself resembles the big network and is composed by basic modules of point independent feature extraction.
We constrain the feature transformation matrix to be close to an orthogonal matrix: where A is the feature alignment matrix predicted by the mini-network.