8- Data for Dummies
This week I read a lot about PCA plots and decided to create one for the thyroid tissue dataset within GTEx. Basically, PCA plots serve to demonstrate variation within a dataset, between different samples. They also help to cluster similar data together.
For the thyroid data I used, I separated the samples into “normal” phenotype samples, and samples with thyroid related autoimmune disease based on the associated clinical data. Then, I processed the data and created this:
The blue samples represent “normal” thyroid tissue, and reds represent samples with autoimmune disorder. The PCA plot serves to represent variation in the data from multiple dimensions. What I saw here is that the thyroid samples with autoimmune disease do cluster slightly away from the normal samples, but what it actually looks like is that the thyroid samples as a whole have two separate clusters.
I’ll be reading more about PCA and what tools I can use to identify the clusters properly next week. Hopefully I’ll be able to explain it in a less vague way soon :). I’ll also try to pull out the sources of variation in this graph as well, and see what’s driving the clusters.
That sounds cool Shreya! I’m kind of wondering why the two clusters overlap so much if one has the disease and one doesn’t? Maybe I’m misunderstanding, but I thought they would be farther apart?
The graph looks exciting! I’m curious about the two groups of “normal” thyroid tissue and I can’t wait to hear more about what you find!