Week 2: Diving Deep into Data Collection, Annotation, Preprocessing, and the COCO Format

Mar 05, 2021


Hi everyone! Welcome to my second senior project blog. I initially anticipated that I would write this blog about the computer vision and object detection algorithms used in my project (Detectron2, Object Detection, etc.), but I have instead chosen to first dive deep into the data collection and processing. This will be a blog packed with a fair amount of info since I will present a high-level overview of the data collection and preprocessing methods currently being used in my project. As noted in my first blog, my project aims to track seizure patients in the clinical setting and identify anomalies in their movement. Since this is an end-to-end computer vision task, my project can be divided into three stages. The first stage is that of data annotation (data collection). The majority of this stage of the project was completed in the summer before I began my project. I used an online annotation tool known as CVAT (Computer Vision Annotation Tool) to manually identify patients in longer videos. In total, I annotated roughly 600 frames across 50 videos. The next phase of my project focuses on data preprocessing (cleaning the data). In this stage of the project, I used the COCO data format to store my annotations. I also separated my data into training, testing, and validation. The third stage of the project involves fine-tuning a model for detecting the patient based on the preprocessed COCO formatted video data (this is the material for next week’s blog). Let’s dive into the data-related sections of the project below. 

CVAT, Data Annotation, Data Preparation: 

My project broadly aims to track seizure patients via deep learning. In other words, I want to train a computer vision model that could track patients and recognize their positions. Note that this is a reasonably difficult problem since patients are typically covered in blankets and wiring. To train a model for this task, we need to give our model some examples of data so that it can learn to recognize seizure patients over time (with more videos and images). Thus, the first step of my pipeline involves gathering and annotating seizure video data. To annotate data means to manually construct a bounding box that encloses objects of interest. In this case, the only object of interest in a seizure video is the patient. Thus, I needed to begin by annotating patients manually in seizure videos. Let’s get into how I did it. 

Given that I needed to annotate several videos efficiently, I needed to first choose an annotator. I decided to go with CVAT. CVAT is an online annotator tool that can be used for image analysis. It is publicly documented on GitHub under the OpenCV computer vision library. CVAT stands for Computer Vision Annotation Tool and allows users to interact and designate labels and attributes to annotate several different objects within either image or video input. It has applications in all fields directly related to computer vision and video processing. Once a user is logged into the platform at cvat.org, they can access and configure tasks. Each task has labels and a constructor. I was able to run the demo locally without installation. Once the task is initiated, users can annotate in a format as shown below:

To create the annotation task, the user must press Create new task and specify the parameters of the task. There is a basic configuration (name) and there are also labels. The constructor is a tool that allows the user to add and adjust labels. The name of the label can be set with the “name label” field. It is also possible for users to add attributes. After the attribute’s name is set, other options like radio, checkbox, text, and number are also possible. Labels can also be configured in a raw manner as the raw command allows the user to edit and copy labels in JSON format, rather than the constructor’s interface. After the basic configuration is complete, the user can specify more details with the advanced configuration setting. This allows the user to control z-order (order drawn on polygons), data compression via zip chunks, image quality (ranges from 1 to 95), overlap size (interpolation of videos or annotation of independent images), segment size, start/stop frame, frame step, etc. After a project is created, it is then present on the CVAT dashboard. The dashboard is where users can edit or open projects through the Open icon and Actions menu. Users can also Dump Annotations or Export as a dataset. Tasks are sorted based on creation order and CVAT highlights specific content: if a task is using CVAT for images, it is annotation mode, whereas a task using CVAT for video is in interpolation mode. For the purposes of this project, we will mostly be using CVAT for video annotation (interpolation). One can open the task details to reveal an interface like this:

To begin the annotation process, we can follow a link inside the Jobs section. The number of links depends on the size of the task as described in the Overlap Size and Segment Size parameters. The annotator itself is a simple drag and drop interface. Frames will be loaded in the background and frames/images can be annotated one-by-one. If any issues arise, the user can always return to the dashboard to adjust the job name, the frames, status, duration, etc. CVAT also allows for the ability for jobs to be assigned and copied.

Once a user logs into the main annotation tool, they are presented with a panel. The main part of the panel is the workspace (where the current image/frame is). The tool overall consists of a header, a top panel, a workspace, a controls sidebar, and an objects sidebar. To navigate through the annotation task, arrows can be used to move to the next/previous frame and a scroll bar slider can be used to scroll through all frames. For interpolation (video annotation), there exist techniques to play/pause the video as well. The image/frame itself can be moved through buttons on the controls sidebar. It can also be moved through holding the left mouse button inside an area without annotated objects. Once an object is highlighted in the tool, it is considered annotated. The main type 3 of annotation is considered shape annotation. There are five basic shapes that can be used for annotation purposes: Rectangle/Bounding Box, Polygon, Polyline, Points, Cuboid, Tag. The various methods are presented below:

All the shapes are presented in the annotation bar. The shapes can be used to create new annotations for a set of images or add/modify/delete objects for existing annotations. Adding shapes requires CVAT to be in shape mode. Let’s walk through the process of adding a bounding box. The user will need to select Rectangle on the controls sidebar. Before they can proceed, they must select the correct Label and the appropriate Drawing Method, for example, ”car” and ”2 points.” They must then select two opposite points. After this, the bounding box is ready: Compared to shape mode, track mode allows the user to create new annotations for a sequence of frames and to add/modify/delete objects for existing annotations. Users can also edit tracks or even merge multiple shapes. To use track mode, the user must select ”Track” at the bottom of the menu for new annotations. In the case of a bounding box, a rectangle will be automatically interpolated on the next frames. One final useful mode is the attribute annotation mode. This mode allows for the user to edit attributes with fast navigation between objects and frames. It can be accessed from the drop-down list in the top panel. There will be a special panel that appears which shows the attribute and its labels. There will be shortcuts for changing the attribute. The user can switch between attributes by using the up and down arrows. To download annotations, the user can click the save button or use the ctrl+s shortcut. When the annotations are saved, they can be dumped in a format such as COCO, YOLO, etc. 

Now that we have discussed a broad overview of CVAT, I will provide insight into what I did for my project. I used CVAT to annotate 50 videos of seizure patients. These videos were quite long usually (over 50,000 frames) so I only manually annotated 600 frames each. I annotated patients as one of two 2 categories: male and female patients. I also stored separate attributes such as age (infant, toddler, child, teenager, adult, etc.) and skin color (using the Fitzpatrick scale of F1, very fair, to F6, a darker skin color). The seizure patient video data itself is secure and comes from Stanford Medicine. I downloaded the videos via FileZilla and annotated them via a remote computer (which I can ssh into). The 50 videos were then downloaded to the same remote computer and organized by file. To understand how these videos were cleaned and stored for later use in the model, we need to discuss the COCO data format.  

COCO Data Format, Data Preprocessing

To use the CVAT annotations, we would need our computer to understand them. In other words, a computer vision model should be able to understand the coordinates of a bounding box and attach them to a certain image. While there are multiple data formats for computer vision tasks, in my project, I chose to use the COCO (Common Objects in Context – Link) data format. Remember that in any computer vision or machine learning problem, it’s crucial to have labeled data since this forms the basis of what we want our model to be able to recognize. Let’s talk about the COCO data format. It contains five different annotation types: object detection, keypoint detection, stuff segmentation, panoptic segmentation, image captioning. Since I hope to detect the patient in the image, I aim to use the object detection data type. What is the COCO data format composed of? An annotation of a video in the COCO format is typically a folder with two subfolders: annotations and images. The images folder can be either empty or contain the list of images (usually jpegs) that were annotated in the video. I will explain later why it can be empty. The annotations subfolder will contain a JSON file. You may be wondering – what is a JSON file? A JSON file is essentially a dictionary of lists. In other words, it stores metadata and info for the properties of the video. Ideally, one could look through this JSON file and find which parts of the video were annotated. Here are the basic building blocks for a JSON file: 

Info: general facts about the dataset

Licenses: notes the list of license for the images in the file 

Categories: list of categories used in the annotation

Images: contains info and references of images/frames used in the video

Annotations: lists all annotations in the dataset. Has separate dictionaries for segmentation, image_id, bbox, id, etc. Segmentation refers to the type of annotation used (can be RLE or polygon). Image_id carries a reference to the original image. Bbox is the coordinates of the bounding box (x, y, width, height)

Now that we have broken down the COCO data format, let’s explain why it is necessary for my project. For object detection, a model will need to be trained to recognize the patient from annotated bounding boxes across a variety of videos. COCO provides us with a way for the model and computer vision algorithms to understand where the patient is in an annotated video. The COCO format itself also provides a large-scale computer vision dataset of over 1.5 million object instances for over 80 object categories. 

Now that I have broken down the COCO format, let’s discuss how I cleaned the video data and preprocessed it. After the COCO format was downloaded for each video, all the COCO data (a JSON file and images) of all 50 videos were stored in the same folder. My next goal was to combine the videos and conduct a split. In machine learning tasks, we need to separate our data into training, validation, and testing. The training dataset is what the model trains on (i.e. gains an understanding of seizure patients and learns to identify them). The validation dataset is how we tune our model and improve its performance (i.e. help the model become better at recognizing where the seizure patient is in a video). The testing dataset is unseen data that enables us to evaluate the performance of our model (i.e. test the model on unseen patient videos). Currently, I have split my data into 60% training, 20% validation, and 20% testing.

To accomplish this, I had to create several Python scripts to ensure the data was clean. Firstly, I created a script to extract only the annotated images from the existing COCO file. This involved iterating through the JSON files for each video and dumping the solely annotated frames into an entirely new JSON file. After I collected these new JSON files together (i.e. references to just the annotated images), I then proceeded to combine them together. Thankfully, I did not need to do this manually. I used a library known as Datumaro (datum for short). Datum enables users to merge COCO data (JSON files) into one large JSON file. I chose to use Datum since it also takes care of the image IDs and names through reindexing. In other words, if I combined two videos, the frames/IDs of the second video would not start from 0 but rather the size of the last video plus 1 (ex. 70,000). It also fixes images that have the same name. The final script that I needed to write was one that cleans the image paths in the merged COCO JSON file. For some reason, 3 of the videos out of 30 in the merged training set had their file paths reverted to a cryptic name (ex. “tmp/…….” instead of “images/video_x/….”) I manually identified which videos had this problem and fixed it by cleaning the path in the JSON file. Following the use of these scripts, the video data was fully preprocessed and ready for initial experiments.

To recap this longer blog, we covered data preparation, annotation, COCO formatting, and data cleaning (all falling under preprocessing). Next week’s blog will be an exciting journey since we will go in-depth into the specific video experiments and some of my first results for patient tracking. We will also break down object detection algorithms like DE:TR and Detectron2:

Detectron2 Model


Thanks so much for sticking all the way through this blog and see you next week! 


All credit for these images belongs to the authors.

  • CVAT content (images 1, 2, 3) – OpenVinoToolkit: Source
  • COCO Images (images 4, 5, 6, 7, 8) – TowardsDataScience (Renu Khandelwal): Source
  • Detectron2 (image 9) – Facebook AI Research: Source


4 Replies to “Week 2: Diving Deep into Data Collection, Annotation, Preprocessing, and the COCO Format”

  1. Shreyas B. says:

    This project looks really interesting! Looks like you put a lot of effort into this post, and I really like how in-depth your explanations are!

  2. Advit D. says:

    Impressive progress this week! 50,000 frames seems like a lot of work

    Are there any alternatives to manually labelling/annotating videos? Given larger and larger data sets, this process of non-automated labelling seems like it wouldn’t be very fun of a task to do…

  3. Hriday C. says:

    Hey Sid!

    This seems like it took a lot of work like Advit mentioned? I was wondering if there was any way in which you could make use of any form of unsupervised learning to increase efficiency with classification?

  4. Ronith G. says:

    Hey Sid!

    Amazing work this week. It looks like you put a lot of time into using CVAT to manually annotate the videos of seizure patients. Like Advit mentioned, do you think there’s an easier way to label the frames to train the model? I think that would save a lot of time and make your work a lot easier.

Leave a Reply

Your email address will not be published.