Project Title: Deep Learning Benchmark for Discovering Transcription Factor Motifs
BASIS Advisor: Mr. Linhares
Internship Location: Stanford University
Onsite Mentor: Dr. Kundaje
Diseases like cancer, Alzheimer’s, and cardiovascular disease are caused by misregulation of gene expression. Gene regulation is managed by a large variety of transcription factor (TF) proteins that each bind to sequence-specific sites on the genome called motifs. However, the exact sequences to which TFs bind are not easily discoverable by scientists. The complex algorithms of machine-learning models can analyze vast quantities of genetic data and find exact convoluted motif patterns. At Stanford University, I will develop a benchmark for the rigorous evaluation of deep learning methods used for discovering transcription factor protein binding sites on the human genome. The benchmark will score a model on several axes, including fidelity, sensitivity, specificity, stability, and computational complexity. It will be used to score methods such as insilico mutagenesis, DeepLIFT, Integrated Gradients, and SHAP. The end product of this research will be a synthetic genetic dataset and a set of software tools, such that any method can be run using the dataset and its results scored for accuracy using the developed tools. Helping discover motifs is essential for scientists to better understand TF-controlled diseases and generate molecular-based cures for these diseases.
My Posts
Week 12
Hey guys! This week I basically just wrapped up a lot of small things here and there as well as worked on my presentation. Here’s a summary of what I did: •I designed a CNN for the SPI1 dataset. I used the “momma_dragonn” framework, which is based on Keras (Python Deep Learning Library). The CNN […]
Week 11
Hi guys! A lot of this week was just me working on my slides, but I did figure out that error I had with displaying the correct metrics! I also spent a considerable amount of time adding three new interpretation methods to my benchmarking process: integrated_gradients, reveal_cancel, and guided_backprop. My numbers are looking a lot […]
Week 10
Hi guys! I ran DeepLIFT and got new auROCs and auPRCs for the SPI1 dataset. Unfortunately, my numbers were a lot higher only for some sequences and not others. The (kind of) good thing that I realized though was that there is almost definitely an error in the way that I am plotting my auROCs […]
Week 9
Hi guys! Not so great news: the auROCs and auPRCs that I got were not high at all. This is not so great because low values for these metrics mean that the net is not always placing high importance on the right positions (there should be high importance on the portions of the sequences that […]
Week 8
Hi! So I finished running momma_dragonn and DeepLIFT and got better results (yay)! My numbers were in the 0.5-0.6 range this time, which seems to be a lot better than before. However, my mentor thinks I should analyze my results a little more closely and wants me to quantify the auROC and auPRC for individual […]
Week 7
Hi – I’m not stuck anymore! I stepped through each one of my lines of code (big oof) and found that my problem was with some unnecessary if statement I wrote awhile ago. However, my numbers for my motif importance scores are still not looking the best – they’re only ~0.3 when they should be […]
Week 6
Hey guys. I’M. STILL. STUCK. That’s okay though, because while I work on debugging that, many strange graphs and one sequence at a time, I spent a lot of time reading my text on critical Python libraries and testing them out. I especially spent a lot of time getting familiar with numpy (np), which provides […]
Week 5
Hey guys! So I’ve been trying to work through my roadblock by rerunning my analysis individually on a select few sequences. My goal was that for a specific sequence, I would find the specific importance scores at each of the 400 base pair positions in the sequence by running DeepLIFT. Then, I would locate the […]
Week 4 – Problems (yay)
Hi guys! This week I ran into some problems. I was able to run DeepLIFT and extract a motif in the form of position weight matrices (I’ll ask my mentor if I’m allowed to attach pictures next week). In order to determine the quality of the motifs and assess the performance of DeepLIFT (aka benchmark […]
Week 3: Training momma_dragonn
Hi guys! This week I trained the momma_dragonn convolutional neural network on the K562 and universal DNAse data (aka the two bed files). It took a long time figuring out how to initialize all the hyperparameters and setting up all the necessary files from GitHub. It also took a lot of waiting until the convolutional […]
Week 2
Hi guys! This week, I finished reading the “Transcriptional Enhancers” paper. I learned about a wet lab method for finding transcription factor binding sites called chromatin immunoprecipitation sequencing (ChIP-sequencing). ChIP-sequencing consists of sending marker proteins into a cell with TFs linked to its genome. The marker proteins literally “mark the spots” where the TFs are. […]
Week 1 – Background Information on Transcription Factors
Hi guys! Before I go over what I did this week, I think it would be fit for me to explain what my senior project is about. I want to couple an understanding of genomics and cell physiology with a deep knowledge of machine learning algorithms to discover the DNA patterns associated with terminal and […]