Oof. But data cleaning REALLY is a drag. Take this scenario. I have clinical data for around 200-500 patients for each tissue in the GTEx database. And there are almost 30 different tissue types. If I have to separate all the patients in each of these tissue types into “Healthy” and “Unhealthy” categories based on their clinical description, I might actually go crazy. Or more crazy. WHatever.
You might’ve heard the phrase “Necessity is the mother of invention.” I think thats wrong. At least partially. If you ask me, laziness is the mother of invention. And my mother doesn’t need to tell me that I am NOT sorting through over 10,000 patients on a nice Sunday afternoon. Or any other afternoon.
Instead, I’ve decided to build a deep learning model as a patient classifier, using Keras and Tensorflow in Python. It’ll probably take only 100 more hours than the time it would’ve taken me to sort through all the patients manually.
This week, I curated my dataset, and began building the model. I used a 75/25 train/test split, and decided to add an extra feature to vectorize the patient descriptions in the data I’m using. I ran into a couple issues here, but I still have my sanity. I also read a paper on a cell-identifying tool I might be using called CIBERSORT.