detrex: Benchmarking Detection Transformers


Project Lead. *Equal Contribution. Corresponding Author.

1International Digital Economy Academy (IDEA) 2University of Science and Technology of China 3Peking University 4Microsoft Research Asia 5Microsoft Research, Redmond

Introduction

The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms. We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition.

The modular design for DETR-based algorithms under detrex

The design comparisons of detrex with other codebases

Main Results

Benchmarking performance of DETR variants

Benchmarking the performance of DETR variants with a ResNet-50 backbone on COCO val2017. The best and second-best results are highlighted in bold and underlined, respectively.

Benchmarking performance of various backbone

Comparisons of the effectiveness of various backbones based on DINO-4scale detector.

Ablation on NMS post-process for DETR

Ablation study on DETR variants with NMS post-processing. We set the default NMS threshold to 0.8.

Ablation backbone frozen stages

Ablation study on DINO and DETA with different frozen stages based on ResNet-50 backbone. The frozen stage "1" means only freezing the stem in the backbone. Frozen stage "2" means freezing the stem and the first residual stage and stage "0" means there is no frozen layer in the backbone.

Ablation learning rate and hyper-parameters

Ablation studies on hyper-parameters for DETR variants. For a fair comparison, we use ResNet-50 as the default backbone and freeze the stem in the backbone.

Comparison between detrex and original implementations

Comparison the performance of DETR variants between detrex implementations and their original implementations

BibTeX


        @misc{ren2023detrex,
          title={detrex: Benchmarking Detection Transformers}, 
          author={Tianhe Ren and Shilong Liu and Feng Li and Hao Zhang and Ailing Zeng and Jie Yang and Xingyu Liao and Ding Jia and Hongyang Li and He Cao and Jianan Wang and Zhaoyang Zeng and Xianbiao Qi and Yuhui Yuan and Jianwei Yang and Lei Zhang},
          year={2023},
          eprint={2306.07265},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }