Project Title: "All Data is Good Data- Or Not" An Analysis Within the GTEx Consortium
BASIS Advisor: Mr. Thomas
Internship Location: Stanford SCGPM
Onsite Mentor: Mr. Akshay Sanghi
In the world of complex genetic analysis, databases containing gene expression data are invaluable to the progression of modern medical research. One such database is the GTEx Database, which contains “normal” genotypic data from thousands of patients, based on their tissue of origin. This database is widely used for big data analyses involving DNA and RNA data in order to study tissue-specific genetic variations. However, phenotypic data associated with GTEx uncovers the fact that not all patients recorded are of a “normal” phenotype. For my project, I will work with the Stanford Center for Genomics and Personalized Medicine to elucidate whether or not the GTEX (Genotype-Tissue Expression) Dataset truly represents the gene expression of normal tissues, through statistical approaches to gene-expression data. After a complete analysis involving the 28 tissues present in this dataset and the expression of patients in the phenotypic data, we will be able to better understand the composition of this database and its contents. This will help further validate studies which use this dataset and allow us to better use this data in future projects. Uncovering whether the GTEX dataset is truly normal will help ensure that studies involving its data are more accurate in the future, and allow comparisons to be made involving the diseased tissue within the dataset.
My Posts
10- I’m running out of titles with the word “Data”
Hey Guys! Guess what? I fell sick- again!!!!!! When I wasn’t desperately clinging to my blankets for warmth and survival, I was coding! I found a way to replicate the PCA plot from last week with a new software, but ran into errors when representing the “Normal” and diseased tissues within the plot. I’m still […]
9- Data and the Deathly Hallows
Sup guys! So this week I (tried to) read more about PCA, and learned how I could potentially figure out the driving variables in the PCA clustering. HOWEVER, a dire sickness diverted my attention from this honorable goal. Not making any accusations, but when I wore my Berkeley engineer shirt to my lab at Stanford, […]
8- Data for Dummies
This week I read a lot about PCA plots and decided to create one for the thyroid tissue dataset within GTEx. Basically, PCA plots serve to demonstrate variation within a dataset, between different samples. They also help to cluster similar data together. For the thyroid data I used, I separated the samples into “normal” phenotype […]
7- Back at it again with the Data
So this week consisted of a clustering analysis of the tissue types in my analysis. Based on the Pearson correlation matrix I generated last week, I used Ward’s hierarchical clustering in order to group similar tissues together. My result ended up looking like this: It’s not very aesthetically pleasing, but it does the job. Looking […]
6- The Chronicles of Data
So this week I began the initial analysis for my now normalized and (hopefully) perfectly edited dataset. I generated a Pearson correlation matrix between all the tissues as a starting point, and will use this in order to cluster the tissue types later on. I also presented to the head of the lab, and got […]
5- Data is how we roll
So I recently crawled out of the rut I was in with regards to my senior project. Glad to be among the living again! Couple updates: So I FINALLY got that large dataset downloaded, and was able to visualize it. I condensed it by taking the medians of each tissue type for every gene, and […]
4- A series of unfortunate data
Looks like it’s been a pretty rough week for our senior projects. Let’s just say that I’ve now become a strong believer in Murphy’s Law. So I have this dataset. Just a harmless dataset of normalized RNA-seq counts which is every so slightly on the larger side. But after countless tries, I am not able […]
3- The one with the Data
Since you last had the incomparable pleasure of reading my senior project blog, about 17.5 quintillion bytes of data were generated. Needless to say, your digital footprint is much more extensive (and much more public) than your carbon footprint. What is privacy, anyways? Moving on: This past week I spent reveling in the world of […]
2- “I’m lazy so I’ll use deep learning.”
Oof. But data cleaning REALLY is a drag. Take this scenario. I have clinical data for around 200-500 patients for each tissue in the GTEx database. And there are almost 30 different tissue types. If I have to separate all the patients in each of these tissue types into “Healthy” and “Unhealthy” categories based on […]
Got Data?
This just about sums it up. Kinda sus, right? Data has been the driving source of productivity and innovation in many fields for the past 15 years. If you ask any company, data is beautiful. In fact, that’s the first search result when you type data is… on google: But I’ll tell you what […]