• Project Title: Deep Learning Benchmark for Discovering Transcription Factor Motifs

  • BASIS Advisor: Mr. Linhares

  • Internship Location: Stanford University

  • Onsite Mentor: Dr. Kundaje

Diseases like cancer, Alzheimer’s, and cardiovascular disease are caused by misregulation of gene expression. Gene regulation is managed by a large variety of transcription factor (TF) proteins that each bind to sequence-specific sites on the genome called motifs. However, the exact sequences to which TFs bind are not easily discoverable by scientists. The complex algorithms of machine-learning models can analyze vast quantities of genetic data and find exact convoluted motif patterns. At Stanford University, I will develop a benchmark for the rigorous evaluation of deep learning methods used for discovering transcription factor protein binding sites on the human genome. The benchmark will score a model on several axes, including fidelity, sensitivity, specificity, stability, and computational complexity. It will be used to score methods such as insilico mutagenesis, DeepLIFT, Integrated Gradients, and SHAP. The end product of this research will be a synthetic genetic dataset and a set of software tools, such that any method can be run using the dataset and its results scored for accuracy using the developed tools. Helping discover motifs is essential for scientists to better understand TF-controlled diseases and generate molecular-based cures for these diseases.