Week 5: Exploring Inference, Video Tracking, and Data Augmentation

Mar 26, 2021

Hi everyone! Welcome back to my senior project blog. This week will be a shorter blog but I hope to cover an entirely new segment of the project: post-processing and additional pre-processing. Given that I am getting my first results and some interesting examples of my model in action, I will mainly be sharing some post-processing techniques with you in today’s blog. Let’s get right into it! 

Recap of Existing Work:

To briefly recap last week’s blog, we discussed the code of my pipeline and how I was able to train a model that can detect seizure patients from videos of their activity in a clinical setting. Last week, I was able to attain the initial results for this model and I have been able to improve those results through tuning. Tuning is the process of adjusting hyperparameters in our model. In my case, I experimented with different learning rates and found that the best for my model was 10^-5. I also settled on a batch size of 512. For definitions of these hyperparameters and how they impact the performance of the model, be sure to check out last week’s blog! Last week, I also mentioned the importance of validation loss in regard to overfitting. Overfitting is when our model essentially memorizes the training set. This is bad because the model isn’t actually learning how to recognize seizure patients. As mentioned in last week’s blog, I attained a validation loss value of 1.70. After tuning the model with the specified learning rate and batch size, I was able to lower the validation loss to 0.59. This is a strong improvement since it indicates that the model is picking up seizure patients to a degree even on unseen data (our validation dataset of 10 videos). Let’s understand why the validation loss is still relatively high.

For most object detection problems, models like Detectron2 will achieve high performance: typically an Average Precision above 0.8. For the problem of seizure patient detection, things are more difficult than usual because patients are covered in wiring (electrodes part of their headgear), blankets, and miscellaneous objects (like a teddy bear or nurse). In several videos, the patient’s head is barely visible and their main body is covered by a blanket. For an object detection system, this is a very difficult problem since there are no immediately obvious features. Note that features are the identifiers that a model uses for detection. To enhance the performance of our model on a difficult problem like seizure patient detection, we will need to get a bit creative! In the next section, we will examine some approaches that I applied for both data pre-processing and post-processing. 

Inference and Video Tracking

While it may be a bit counterintuitive, let’s first discuss some post-processing techniques. After we train our model, we can use Detectron2’s “predictor” function to get visualizations of our model’s performance on real data. In my case, I started off by doing inference on a few random patient images. You may be wondering – What is inference? Inference is a process in machine learning where we input new data points into the model to calculate an output or prediction. The model has not yet seen the data points (new images/videos) that we will input so we can get a real sense of its predictive power. To perform inference for my particular task, I show the model an image of a seizure patient that it has not seen before. The output is a predicted bounding box of the patient. I wanted to show some inference results in this blog but given that this data is PHI (Protected Health Information), it is unfortunately not possible to show patient images. To get a better sense of what these videos/images look like, here is an example that is publicly available:

Note that the example above is a much clearer case of a seizure patient. Most of the data I am using is not as clear. To see the results of my model on entire videos, I had to develop a simple patient tracking system. This is a technique whereby I record inference results frame-by-frame and then stitch them into a video (with a certain FPS). Currently, I am doing 15 frames per second. Once I finished the code for the tracking system, I created 10 MP4 files (one each for video in the validation dataset). Let’s discuss the results of these videos. 

The results of my model do vary across the type of video and clinical setting. Even in a relatively small dataset of 50 videos, there is a diverse selection of patients and their respective environments (beds, tables, nurses, machines, etc.) In the simplest cases, the video is in color and the patient is clearly visible without blankets or other devices (they make up the majority of the frame). In these videos, my model performs quite well (detecting the patient with >99% accuracy). The next type of video is a medium difficulty case. In these color videos, the patient is usually covered in multiple blankets and is smaller in regard to the entire frame. My model does relatively well on these videos (> 95% accuracy) but does occasionally miss the patient in a few frames. The final category of videos is the most difficult case. In these cases, the video is usually grayscale and the patient is almost entirely covered in blankets or headgear. Even I find it difficult to spot the patient in these videos due to the various occlusions (obstacles in the frame)! Another issue that reduces performance is that of other adults entering the frame. In some cases, a parent lays next to the patient and this creates additional difficulty for the model. Based on all these factors, it’s clear that the model can do a bit better with the appropriate measures. Let’s discuss how we can enable the model to get better beyond tuning.

Data Augmentation

Your neural network is only as good as the data you feed it.

In the simplest terms possible, data augmentation is a method that enables us to increase the diversity of our training set. If we add more variance to the training set, the model can become better at predicting different kinds of test cases. It’s a common rule in machine learning that a larger dataset equates to greater performance. For a state-of-the-art architecture like Detectron2, there are millions of parameters (see Blog #3 for a detailed explanation of Detectron2 and Mask R-CNN). The question for us is how can we get as much data as possible so that we can make use of these parameters? We can do that through data augmentation. Data augmentation techniques include translation, rotation, scaling, flipping, etc. For example, if you see the image below, it features 3 different tennis balls:

In this case, the second and third images are simply translations of the first. A neural network would view this sequence of images as three distinct tennis balls. The reason why data augmentation works is due to a principle of convolutional neural networks (CNNs) known as spatial invariance. Even if an object is placed in different positions or orientations, a CNN like ResNet can still robustly classify objects. Now let’s discuss how data augmentation can help us out in a task like seizure detection. Remember when we talked about the fact that sometimes patients are too small in the frame or videos can be entirely grayscale. Data augmentation can help us solve these problems since we can now expand our dataset to address these deficiencies. For example, we can scale images so that the patient looks bigger or smaller. We can also do slight rotations to enable the network to better understand a patient whose head is titled. We can also turn existing color videos into black and white videos to help the network better perform on grayscale data! I aim to start doing data augmentation next week. 


Today, we explored more results for my model and the possibilities of video tracking and data augmentation. Video tracking is mainly a post-processing technique used to understand the shortcomings of our model while data augmentation is a preprocessing technique that will improve the network’s performance in the future. Next week, I hope to spend time doing more tuning and data augmentation. Thanks for sticking through the blog and I will see you next week! 


  • Image 1 (Seizure Patient): G, Mr. “Groundbreaking Seizure Surgery in Long Island Gives Boy New Hope.” PIX11, PIX11, 5 Apr. 2019, pix11.com/news/its-a-g-thing/groundbreaking-seizure-surgery-in-long-island-gives-boy-new-hope/.
  • Image 2 (Tennis Balls): Gandhi, Arun. “Data Augmentation: How to Use Deep Learning When You Have Limited Data.” AI & Machine Learning Blog, AI & Machine Learning Blog, 8 Mar. 2021, nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/.
  • Quote: Gandhi, Arun. “Data Augmentation: How to Use Deep Learning When You Have Limited Data.” AI & Machine Learning Blog, AI & Machine Learning Blog, 8 Mar. 2021, nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/.

One Reply to “Week 5: Exploring Inference, Video Tracking, and Data Augmentation”

  1. Advit D. says:

    Nice progress! Just out of curiosity, would you say that your model generally gives more false negatives or false positives? Are there any benefits or downsides of leaning more towards one side?

Leave a Reply

Your email address will not be published. Required fields are marked *