The long-standing goal of computer vision is to gain a high-level understanding of digital images and videos, similar to how the human brain is able to perceive objects, movement and depth in it’s visual field. Recently advances in Convolutional Neural Networks have pushed computer vision to the limit. By dividing the task of human-like visual perception into bite-sized challenges such as classification, segmentation and tracking and establishing benchmarks significant progress was made, even outperforming humans in certain areas. However, there is still a long way to go in building a unified model, that performs all these tasks at once and is also capable of working robustly in a real world scenario and not only on a given dataset. The proposed robust one-stage segmentation and tracking model aims to further this quest, by unifying the tasks of panoptic segmentation and tracking. Furthermore, our goal for this model is not to be bound to any specific benchmark dataset, but to provide robustness on real world examples. We accomplish this goal by extending Panoptic-DeepLab [Che+20] by an previous offset branch, to enable it to track objects in a video. Furthermore, we train this model on multiple datasets simultaneously without setting hyperparameters for any specific dataset.
Install Detectron2 following the instructions.
To train a model with run:
cd /path/to/detectron2/projects/Post
python train_net.py --config-file configs/KITTI-MOTS/post_R_52_os16_mg124_poly_200k_bs1_kitti_mots_crop_384_dsconv.yamlModel evaluation can be done similarly:
cd /path/to/detectron2/projects/Post
python train_net.py --config-file configs/KITTI-MOTS/panoptic_deeplab_R_52_os16_mg124_poly_200k_bs64_crop_640_640_kitti_mots_dsconv.yaml --inference-only MODEL.WEIGHTS models/model_final_5e6da2.pkl INPUT.CROP.ENABLED FalseIf you want to benchmark the network speed without post-processing, you can run the evaluation script with MODEL.PANOPTIC_DEEPLAB.BENCHMARK_NETWORK_SPEED True:
cd /path/to/detectron2/projects/Poar
python train_net.py --config-file configs/KITTI-MOTS/panoptic_deeplab_R_52_os16_mg124_poly_200k_bs64_crop_640_640_kitti_mots_dsconv.yaml --eval-only MODEL.WEIGHTS /path/to/model_checkpoint MODEL.PANOPTIC_DEEPLAB.BENCHMARK_NETWORK_SPEED TrueCityscapes models are trained with ImageNet pretraining.
| Method | Backbone | Output resolution |
PQ | SQ | RQ | mIoU | AP | Memory (M) | model id | download |
|---|---|---|---|---|---|---|---|---|---|---|
| Panoptic-DeepLab | R50-DC5 | 1024×2048 | 58.6 | 80.9 | 71.2 | 75.9 | 29.8 | 8668 | - | model | metrics |
| Panoptic-DeepLab | R52-DC5 | 1024×2048 | 60.3 | 81.5 | 72.9 | 78.2 | 33.2 | 9682 | 30841561 | model | metrics |
| Panoptic-DeepLab (DSConv) | R52-DC5 | 1024×2048 | 60.3 | 81.0 | 73.2 | 78.7 | 32.1 | 10466 | 33148034 | model | metrics |
Note:
- R52: a ResNet-50 with its first 7x7 convolution replaced by 3 3x3 convolutions. This modification has been used in most semantic segmentation papers. We pre-train this backbone on ImageNet using the default recipe of pytorch examples.
- DC5 means using dilated convolution in
res5. - We use a smaller training crop size (512x1024) than the original paper (1025x2049), we find using larger crop size (1024x2048) could further improve PQ by 1.5% but also degrades AP by 3%.
- The implementation with regular Conv2d in ASPP and head is much heavier head than the original paper.
- This implementation does not include optimized post-processing code needed for deployment. Post-processing the network outputs now takes similar amount of time to the network itself. Please refer to speed in the original paper for comparison.
- DSConv refers to using DepthwiseSeparableConv2d in ASPP and decoder. The implementation with DSConv is identical to the original paper.
COCO models are trained with ImageNet pretraining on 16 V100s.
| Method | Backbone | Output resolution |
PQ | SQ | RQ | Box AP | Mask AP | Memory (M) | model id | download |
|---|---|---|---|---|---|---|---|---|---|---|
| Panoptic-DeepLab (DSConv) | R52-DC5 | 640×640 | 35.5 | 77.3 | 44.7 | 18.6 | 19.7 | 246448865 | model | metrics |
Note:
- R52: a ResNet-50 with its first 7x7 convolution replaced by 3 3x3 convolutions. This modification has been used in most semantic segmentation papers. We pre-train this backbone on ImageNet using the default recipe of pytorch examples.
- DC5 means using dilated convolution in
res5. - This reproduced number matches the original paper (35.5 vs. 35.1 PQ).
- This implementation does not include optimized post-processing code needed for deployment. Post-processing the network
Detectron2 is Facebook AI Research's next generation software system that implements state-of-the-art object detection algorithms. It is a ground-up rewrite of the previous version, Detectron, and it originates from maskrcnn-benchmark.
- It is powered by the PyTorch deep learning framework.
- Includes more features such as panoptic segmentation, Densepose, Cascade R-CNN, rotated bounding boxes, PointRend, DeepLab, etc.
- Can be used as a library to support different projects on top of it. We'll open source more research projects in this way.
- It trains much faster.
- Models can be exported to TorchScript format or Caffe2 format for deployment.
See our blog post to see more demos and learn about detectron2.
See INSTALL.md.
Follow the installation instructions to install detectron2.
See Getting Started with Detectron2, and the Colab Notebook to learn about basic usage.
Learn more at our documentation. And see projects/ for some projects that are built on top of detectron2.
We provide a large set of baseline results and trained models available for download in the Detectron2 Model Zoo.
Detectron2 is released under the Apache 2.0 license.
If you use Detectron2 in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.
@misc{wu2019detectron2,
author = {Yuxin Wu and Alexander Kirillov and Francisco Massa and
Wan-Yen Lo and Ross Girshick},
title = {Detectron2},
howpublished = {\url{https://github.com/facebookresearch/detectron2}},
year = {2019}
}