4- A series of unfortunate data

Mar 19, 2019

Looks like it’s been a pretty rough week for our senior projects. Let’s just say that I’ve now become a strong believer in Murphy’s Law.

So I have this dataset. Just a harmless dataset of normalized RNA-seq counts which is every so slightly on the larger side. But after countless tries, I am not able to access this data on the genomics cluster, much less modify it before analysis. As if all it takes is 8GB of data to crash any server I launch. I’ll probably be troubleshooting this into next week, but in the meantime, I’ve researched and saved the code I need to start my analysis. This would include steps to further normalize the Transcript per Million values of RNA-seq data by first log transforming them, and then centering them by row medians. After this, I would use a number of statistical tests and clustering methods in order to extract tissue-specific genes from the data. Needless to say, it takes a lot of statistical manipulation in order to even come close to figuring out your genes. Hopefully, it’s worth it!

TL;DR : Above is a semi-rant which concerns the frustrations of dealing with large datasets. Enjoy the cartoons!

4 Replies to “4- A series of unfortunate data”

  1. Eva P. says:

    Wow Shreya, I am so glad we are failing together! How many sequences are in your data files?

    1. Shreya S. says:

      I basically have a matrix of RNA-seq counts for 12,000 samples, and 196,000 isoforms. I’ll have to reduce this later, if I could only get it downloaded :(.

  2. Cindy K. says:

    Looks like the three of us are all experiencing some difficult times… Hopefully we’ll get things sorted out soon, though? (One can only hope.)

    1. Shreya S. says:


Leave a Reply

Your email address will not be published.