This just about sums it up. Kinda sus, right?
Data has been the driving source of productivity and innovation in many fields for the past 15 years. If you ask any company, data is beautiful. In fact, that’s the first search result when you type data is… on google:
But I’ll tell you what data is. Data is garbage. Yes, you heard me. GARBAGE. Most of it is, anyways.
At first, data was a huge hit for big businesses. It brought efficiency, increased customer satisfaction, and gave companies a competitive advantage over others. “Big Data” became a buzz word, soon to be followed by “Deep Learning.” Data Scientist became the “sexiest job of the 21st century.” That’s right. Want to be sexy? Maybe data science is for you. Data was god. Whatever it was, and wherever it came from, it had to be useful, right? Life at Facebook probably went a little like this:
“Hey Mark, we just got that new data on which people are most likely to wash their hands before exiting the bathroom!”
“Really!? Add that right in… Oh, and while you’re at it– Give Russia a call.”
Meanwhile, you’re sitting there wondering why Facebook is advertising a new brand of “Lenin Soap”. Hmm. But in recent years, data has arguably become more trouble than it’s worth. Problems like data incompleteness, incompatibility, and even accuracy have infiltrated the system, providing no real improvements, and in some cases, customer dissatisfaction. There is no real line between data that is “good” or “bad”. And in some cases, that’s more harmful than usual.
Data has facilitated strides in medical research in recent years, thanks to the wealth of ‘omics’ data that can be produced by hi-tech machines. Algorithms for analysis have been optimized so that our understanding of diseases and health can be furthered, and large repositories of medical data, or “biobanks,” have been created for use in such analyses. One such biobank is the GTEx (Genotype-Tissue Expression) Portal, a mobile site created by the Broad Institute which provides access to data such as gene expression and histological images from tissues of “normal” phenotype patients. This database is widely used in genomic analyses as a “normal,” or “control group” in comparison to diseased tissue gene expression. Of course, “normal” seems to refer to disease-free tissue, but when looking at associated pathological data, this is not the case. Many samples are actually infiltrated by autoimmune diseases, and in some cases, even cancer. This observation brought to my attention the blurred line between “good” and “bad” data, and made me question whether the data in GTEx is actually suitable to be used as a control group.
In an effort to investigate this, my project will elucidate whether or not the GTEx Database truly represents the gene expression of normal tissues, through statistical approaches to gene-expression data. I will build a patient classifier using NLP(Natural Language Processing) and a deep-learning approach (yay buzzwords!!!), and test its accuracy on GTEx pathological data. Later, I will assess whether samples classified as “normal/healthy” are significantly different from those characterized as “abnormal/unhealthy,” and determine whether this difference poses a threat when using GTEx as a control group.