Got Data?

Feb 19, 2019

This just about sums it up. Kinda sus, right?

Data has been the driving source of productivity and innovation in many fields for the past 15 years. If you ask any company, data is beautiful. In fact, that’s the first search result when you type data is… on google:


But I’ll tell you what data is. Data is garbage. Yes, you heard me. GARBAGE. Most of it is, anyways.

At first, data was a huge hit for big businesses. It brought efficiency, increased customer satisfaction, and gave companies a competitive advantage over others. “Big Data” became a buzz word, soon to be followed by “Deep Learning.” Data Scientist became the “sexiest job of the 21st century.” That’s right. Want to be sexy? Maybe data science is for you. Data was god. Whatever it was, and wherever it came from, it had to be useful, right? Life at Facebook probably went a little like this:

“Hey Mark, we just got that new data on which people are most likely to wash their hands before exiting the bathroom!”

“Really!? Add that right in… Oh, and while you’re at it– Give Russia a call.”

Meanwhile, you’re sitting there wondering why Facebook is advertising a new brand of “Lenin Soap”. Hmm. But in recent years, data has arguably become more trouble than it’s worth. Problems like data incompleteness, incompatibility, and even accuracy have infiltrated the system, providing no real improvements, and in some cases, customer dissatisfaction. There is no real line between data that is “good” or “bad”. And in some cases, that’s more harmful than usual.

Data has facilitated strides in medical research in recent years, thanks to the wealth of ‘omics’ data that can be produced by hi-tech machines. Algorithms for analysis have been optimized so that our understanding of diseases and health can be furthered, and large repositories of medical data, or “biobanks,” have been created for use in such analyses. One such biobank is the GTEx (Genotype-Tissue Expression) Portal, a mobile site created by the Broad Institute which provides access to data such as gene expression and histological images from tissues of “normal” phenotype patients. This database is widely used in genomic analyses as a “normal,” or “control group” in comparison to diseased tissue gene expression. Of course, “normal” seems to refer to disease-free tissue, but when looking at associated pathological data, this is not the case. Many samples are actually infiltrated by autoimmune diseases, and in some cases, even cancer. This observation brought to my attention the blurred line between “good” and “bad” data, and made me question whether the data in GTEx is actually suitable to be used as a control group.

In an effort to investigate this, my project will elucidate whether or not the GTEx Database truly represents the gene expression of normal tissues, through statistical approaches to gene-expression data. I will build a patient classifier using NLP(Natural Language Processing) and a deep-learning approach (yay buzzwords!!!), and test its accuracy on GTEx pathological data. Later, I will assess whether samples classified as “normal/healthy” are significantly different from those characterized as “abnormal/unhealthy,” and determine whether this difference poses a threat when using GTEx as a control group.

6 Replies to “Got Data?”

  1. Rishi A. says:

    The perspective you have on big data is really interesting and refreshing. I never thought to look at data the way you described here, and now I am very suspicious of whether data is actually as useful as it is marketed to be. I have personally never heard of GTex before but it sounds as if it could be faulty if proven by your projects. If so, what will be the implications? The cartoons on this page also brought a smile to my face 🙂

  2. Ivana B says:

    I am so excited about this project! If you prove that Gtex data is good – that’s a good thing, as I understand it is used a lot in research, and it would just solidify/reinforce the findings that came from using that dataset. However, if you prove that the data is not good – it’s gonna be “yuge”. 🙂

  3. William Thomas says:

    This is a fantastic first blog post. The misuse of statistics is one of the biggest travesties we face today. Mostly because the information is true, but what it is used for doesn’t correlate.

  4. Serina K. says:

    Shreya, I love how open you are about “big data” and “data learning”. I can’t wait to see your progress on making a patient classifier!

  5. Eva P. says:

    Shreya, your project sounds really interesting! Combining biological data analysis with NLP is a pretty unique and revolutionary way of utilizing ML.

  6. Cindy K. says:

    Hi Shreya, your post is so exciting and I like how you challenged conventional views of data. I really enjoyed those cartoons!

Leave a Reply

Your email address will not be published.