🚆 2nd Place Solution of AI Journey Contest 2021: AITrain 🚆
The goal of the competition is to create a computer vision system for Semantic Rail Scene Understanding. Developing an accurate and robust algorithm is a clear way to enhance rail traffic safety. Successful models can be incorporated in real-time applications to warn train drivers about possible collisions with potentially hazardous objects.
The dataset consists of over 7000 images from the ego-perspective of trains.
Each image is annotated with bounding boxes of 11 different types of objects (such as car
, human
, wagon
, trailing switch
) and dense pixel-wise semantic labeling for 3 different classes.
The quality metric of the competition is weighted average of mAP@.5 and meanIoU:
competition_metric = 0.7 * mAP@.5 + 0.3 * meanIoU
This is a code competition so the testing time and resources are limited:
- Time for inference: 15min for 300 images;
- 1 GPU Tesla V100 32 Gb;
- 3 vCPU;
- 94 GB RAM.
Solutions are run in Docker container in the offline mode.
Two main architectures of the solution are Panoptic FPN and YOLOv5.
We don't train separate models for semantic segmentation task but solely rely on Panoptic FPN
and multitask learning.
In a nutshell, Panoptic FPN
is an extended version of Mask-RCNN
with an additional semantic segmentation branch:
Panoptic FPN architecture. Image source |
YOLOv5
is a high-performing, lightweight and very popular object detection framework.
A simple codebase allows to quickly train a model on a custom dataset making YOLOv5
an attractive choice for CV competitions.
The solution is an ensemble of 6 models:
Panoptic FPN
withResNet101
backbone and standardFaster-RCNN
ROI head. The shortest image side size is chosen from[1024, 1536]
with a step of64
.Panoptic FPN
withResNet50
backbone andCascade-RCNN
ROI head. Image size:[1024, 1536]
with a step of64
.RetinaNet
withResNet50
backbone. Image size:[1280, 1796]
with a step of64
.YOLOv5m6
with2048
image size.YOLOv5m6
with2560
image size.YOLOv5l6
with1536
image resolution and label smoothing of0.1
.
To ensemble different models we use Weighted Boxes Fusion (WBF) for object detection and a simple average for semantic segmentation.
We also tried NMS, Soft-NMS and Non-maximum weighted but WBF demonstrated a superior performance.
We set iou_threshold=0.6
and equal weights to all models.
Weighted Boxes Fusion. Image source |
A bag tricks, tweaks and freebies is used to improve the performance:
- Multitask learning: all Detectron2 models are trained to solve both object detection and semantic segmentation tasks. Multitask learning, if applied correctly, improves generalization and reduces overfiting. Moreover, solving both tasks at once makes an inference more efficient.
- Test time augmentations: for each model run an inference on several augmented versions of original images.
We use image resizing augmentation with
[0.8, 1, 1.2]
scales with respect to maximum training image size. - High image resolution on both training and inference. The dataset contains quite a large amount of tiny objects so it is crucial to use high resolution images.
- Multi-scale training: using different image resolutions during training enhances the final performace.
- Light augmentations: the list of used augmentations is limited to only Random Crop, Random Brightness, Random Contrast and Random Saturation.
The flips are not used since there are some classes that depend on the sides (e.g.
facing switch left
orfacing switch right
) Harder color and spatial augmentations hurt the performance probably due to the vast amount of tiny objects and objects which class is might be recognized only by object's color (e.g. traffic light permitting or not).
The implementation is heavily based on Detectron2 and YOLOv5 frameworks.
The results in the table correspond to an inference without TTA if not specified otherwise.
Run № | Model | mAP:0.5 local | mIoU local | Metric local | mAP:0.5 public LB | mIoU public LB | Metric public LB |
---|---|---|---|---|---|---|---|
1 | Panoptic FPN, ResNet50 |
0.583 | 0.8778 | 0.6716 | 0.375 | 0.892 | 0.530 |
2 | Panoptic FPN, ResNet101 |
0.604 | 0.8885 | 0.6893 | — | — | — |
4 | Panoptic FPN, ResNet50, Cascade ROI head |
0.606 | 0.8626 | 0.6832 | — | — | — |
5 | RetinaNet, ResNet50 |
0.594 | — | — | — | — | — |
6 | YOLOv5m6, TTA, img_size=2048 |
0.619 | — | — | — | — | — |
7 | YOLOv5m6, TTA, img_size=2560 |
0.606 | — | — | — | — | — |
9 | YOLOv5l6, TTA, img_size=1536, label_smoothing=0.1 |
0.607 | — | — | — | — | — |
Ensembled run numbers | |||||||
2 + 4 + 5 |
0.642 | 0.8855 | 0.7153 | 0.415 | 0.897 | 0.560 | |
2 + 4 + 6 |
0.657 | 0.8855 | 0.7255 | — | — | — | |
2 + 4 + 5 + 6 |
0.669 | 0.8855 | 0.7341 | 0.421 | 0.897 | 0.564 | |
2 + 4 + 5 + 6 + 7 |
0.676 | 0.8855 | 0.7393 | 0.440 | 0.897 | 0.577 | |
2 + 4 + 5 + 6 + 7 with TTA |
0.667 | 0.8875 | 0.7336 | 0.453 | 0.899 | 0.587 | |
2 + 4 + 5 + 6 + 7 + 9 |
0.685 | 0.8855 | 0.7449 | 0.434 | 0.897 | 0.573 | |
2 + 4 + 5 + 6 + 7 + 9 with TTA |
0.674 | 0.8875 | 0.7384 | 0.447 | 0.899 | 0.583 |
Start a Docker container via docker-compose:
JUPYTER_PORT=8888 GPUS=all docker-compose -p $USER up -d --build
All the following steps are supposed to be run in the container.
Download and unpack the data into data/raw
directory:
data/
├── raw/
│ ├── bboxes
│ ├── images
│ └── masks
Run the following commands to prepare the dataset for Detectron2 models:
PYTHONPATH=$(pwd)/src python3 -m data.data2coco
PYTHONPATH=$(pwd)/src/baseline python3 -m evaluation.masks2json --path_to_masks data/raw/masks --path_to_save test.json
PYTHONPATH=$(pwd)/src python3 -m data.prepare_masks
PYTHONPATH=$(pwd)/src python3 -m data.split
To prepare the dataset for YOLOv5 use the baseline notebook provided by organizers.
The data structure after the data preparation should look as following:
data/
├── raw/
│ ├── bboxes
│ ├── images
│ ├── masks
│ ├── detection_coco.json
│ └── segmentation_coco.json
├── processed/
│ ├── train
│ ├── val
│ ├── masks
│ └── test_filenames.json
├── yolo/
│ ├── images
| └── labels
You can take a look at the processed dataset with the visualization notebook.
The configs for Detectron2 models are located here.
For example, to train a Panoptic FPN
with ResNet101
backbone run the following command
bash train_dt2.sh my-sota-run main-v100
To train a YOLOv5
model run the following commands
cd src/baseline/yolov5
python3 train.py --rect --img 2048 --batch 16 --epochs 100 --data aitrain_dataset.yaml --weights yolov5m6.pt --hyp data/hyps/hyp_aitrain.yaml --name my-sota-run
Run this notebook to evaluate the model and to also run a grid search for inference parameters. To visualize and look at the predictions use this notebook.
The training results (model weights and configs) should be located in outputs/
directory.
Modify the solution file to select the required runs and run
./make_submission.sh "dt2-model-1,dt2-model2,dt2-model3" "yolo-model-1,yolo-model-2"