Week 3: Introduction to Object Detection, Mask R-CNN, and Detectron2

Mar 12, 2021

Hi everyone! Thanks for checking out my third blog post. This week, I will cover the architectures that I am using in my project. Since I aim to track seizure patients in videos, I need to train an object detection architecture to recognize patients. Thankfully, I will not need to implement an object detection model from scratch. Instead, I can finetune an existing state-of-the-art model to my current dataset. The object detection model that I used in my project was Detectron2. In this short post, I will explain the main ideas behind object detection. Next week, we will dive deeper into my code and how I implemented Detectron2 on a dataset of seizure patient videos (including results)! Let’s begin by discussing Mask R-CNN, the foundation for Detectron2.

Mask R-CNN Basics

Convolutional Neural Networks have become the standard for image recognition and computer vision tasks in relation to Deep Learning. Tasks like object detection and semantic segmentation have also become crucial due to applications in sectors like autonomous driving and robotics. Overall, Detectron2 (Source) is a powerful library that provides various high-performance models for object detection. The model that I have chosen to use in my project is Mask R-CNN, an object detection network that is provided as part of the Detectron2 library. In summary, Mask R-CNN is a conceptually simple and flexible framework built by Facebook AI Research for the task of object instance segmentation (the more technical term for object detection). It builds on the work of Faster R-CNN, an existing model, by adding a branch for predicting object masks in parallel with a branch for bounding box recognition. It also has delivered top performance on the COCO 2016 dataset, a benchmark for object detection tasks. The main framework is written in Python and powered by the Caffe2 deep learning framework, a predecessor to PyTorch. It is of interest to my project due to its smooth integration with COCO data and easy accessibility via Detectron2. It is also important to note that Detectron2 features a more robust version of the original Mask R-CNN architecture and features PyTorch as its ground framework. It has improved features and also boasts new functionality – densepose, Cascade R-CNN, rotated bounding boxes, panoptic segmentation, etc. Detectron2 also trains much faster. It is a solid option due to its more modular design and flexibility to train at high speed on single or multiple GPU servers. Overall, Detectron2 features a collection of state-of-the-art object detection and segmentation algorithms beyond Mask R-CNN. These are outlined here:

Architecture of Mask R-CNN

While Detectron2 does feature generalized R-CNN models, it is important to still review the original Mask R-CNN Benchmark framework. Mask R-CNN’s benchmark that is used in Detectron2 uses multiple frameworks. It is an overall two-stage procedure that uses parallelism in regard to the class and box. Every candidate object has two outputs: a class label and a bounding box offset. Building from Fast R-CNN, the first stage of a Regional Proposal Network (RPN) is adopted. Pixel-to-pixel alignment is added and features are extracted using RoIPool from each candidate box. RoIPool and RoIAlign are techniques that perform quantization for the pixels (via pooling and spatial structure respectively). Note that RoI stands for “Region of Interest.” RoIAlign is presented below:


Mask R-CNN is also special because it outputs a binary mask for each RoI in contrast to most object segmentation systems where classification depends on mask predictions. For each RoI, a multi-task loss is defined as L = Lc +Lb +Lm. The mask representation is quite valuable to the performance of Detectron2/Mask R-CNN due to the fact that it encodes an input object’s spatial layout and thus produces faster inference. Mask R-CNN also features a convolutional backbone for feature extraction (creates pixel-to-pixel correspondence provided by convolutions) and a network head for bounding box recognition as part of both classification and regression. To reiterate, RoIPool is an operation that extracts a small feature map (ex. 7 x 7) from each RoI and thus quantizes a floating number RoI to the ”discrete granularity of the feature map.” The RoIAlign layer (visualized above) is what properly realigns the extracted features with the input following the quantization performed by RoIPool. The convolutional backbone of the architecture features a vanilla ResNet of either 50 or 101 layers. The enhanced network goes by the name ResNet-50-C4 due to its feature extraction in the 4th stage (known as C4). A Feature Pyramid Network (FPN) is subsequently added to create levels of prioritization for features. FPN uses a top-down architecture with lateral connections. Overall, to summarize this section, Mask R-CNN is a neural network that processes visual data and creates RoIs (regions of interest) to track objects and detect them via FPNs. The heads of these networks are presented below for reference:

Applications to Object Detection

Let’s step back from this deep analysis of Mask R-CNN. While it’s definitely fascinating to understand how Mask R-CNN works at a very granular level, it’s also crucial to see how this tool plays a role in my larger model pipeline. Once the data for our model is prepared in the COCO format and registered, we can train an object detection model on that dataset. We can customize the parameters to maximize performance and accuracy. The code below shows the main parameters for the model. The Mask R-CNN model can be loaded and chosen as a config (cfg) file. We can then adjust parameters such as the batch size (cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE), learning rate (cfg.SOLVER.BASE_LR), number of classes (cfg.MODEL.ROI_HEADS.NUM_CLASSES), etc. In short, all of these parameters factor into the accuracy of the model. We will explore more on this next week. See the code below for more:


I hope you enjoyed this short yet in-depth overview of object detection models like Mask R-CNN (Detectron2). Next week, we will explore how I customized these models for my specific task (detecting seizure patients) and what results I achieved. Thanks for sticking through this post!



  • Images 1, 2, and 3:He, Kaiming, et al. “Mask R-CNN.” ArXiv.org, 24 Jan. 2018, arxiv.org/abs/1703.06870v3.
  • Image 4: my own code

2 Replies to “Week 3: Introduction to Object Detection, Mask R-CNN, and Detectron2”

  1. Advit D. says:

    Hi Siddharth,

    Very thorough explanation! It was interesting to learn about Mask R-CNN and Detectron2. I had no clue that these techniques are so powerful! Looking forward to next week’s updates 🙂

  2. Hriday Chhabria says:

    Hey Sid! It was great to learn more about the Object Detection process! I look forward to reading your update next week! Keep up the good work!

Leave a Reply

Your email address will not be published.