Hi guys, this week I started coding!
I used the Genome Browser to download some data to work with. In truth, it’s a bit more than just “some” data… It’s a whopping 89,513 KB file of data!!1 It took a ridiculously long time to download2 and an even longer time to figure out how to use the file. See, it downloaded as something called an INTERVAL file, which I had not encountered before, and despite my many attempts to open it, it refused to make itself useful. I ended up spending hours trying to read the file, and finally just decided to convert it into something I could actually work with: a Comma Separated Values file.3
This week, I had three main goals:
1. Select and display certain columns of data (there are a total of 17 columns): genoName, genoStart, genoEnd, strand, repName, repClass, and repFamily.
2. Count the number of distinct classes (repClass), families (repFamily), and subfamilies (repName).
3. Create a histogram representing goal #2.
This was a really exciting week. Forgetting just how large the file was4, I used it to test my code and ended up crashing Shell (the platform I run my Python programs on). Fun times! Now, I’m using a considerably smaller file when testing my code.5 With goals one and two complete, I attempted to tackle number three. This was a bit of a challenge for me because I had never used code to make graphs before.6 Many websites later, I finally made my own baby histogram! Just look at it — isn’t it adorable?
Now I’m just working on making the “official” histograms for my project, and in the meantime, I’m still reading those research/review articles.
Until next time!
* * *
1 To put this into context, according to this source, a typical five-page word-processor document or a typical HTML web page takes up about 30 KB of data (1 KB = 1,000 bytes). That means that the file I’m using is almost 3,000x the size of that five-page document. That’s equivalent to almost 15,000 pages of a word-processor document or 3,000 HTML web pages. Imagine that!
2 Fine, it took maybe ten minutes max.
3 Those from the Data Structures class might remember this as a .csv file from our time making glossaries for our ERDs. ☺
4 It has 17,825,792 entries in spreadsheet form!!
5 This one has 340 entries and is only 2 KB in size.
6 I’d always just use a spreadsheet editor, which was way easier to figure out how to use.