week two
Hi guys, this week I started coding!
I used the Genome Browser to download some data to work with. In truth, it’s a bit more than just “some” data… It’s a whopping 89,513 KB file of data!!1 It took a ridiculously long time to download2 and an even longer time to figure out how to use the file. See, it downloaded as something called an INTERVAL file, which I had not encountered before, and despite my many attempts to open it, it refused to make itself useful. I ended up spending hours trying to read the file, and finally just decided to convert it into something I could actually work with: a Comma Separated Values file.3
This week, I had three main goals:
1. Select and display certain columns of data (there are a total of 17 columns): genoName, genoStart, genoEnd, strand, repName, repClass, and repFamily.
2. Count the number of distinct classes (repClass), families (repFamily), and subfamilies (repName).
3. Create a histogram representing goal #2.
This was a really exciting week. Forgetting just how large the file was4, I used it to test my code and ended up crashing Shell (the platform I run my Python programs on). Fun times! Now, I’m using a considerably smaller file when testing my code.5 With goals one and two complete, I attempted to tackle number three. This was a bit of a challenge for me because I had never used code to make graphs before.6 Many websites later, I finally made my own baby histogram! Just look at it — isn’t it adorable?

I guess this is a pre-evolved histogram? It still has a long way to go before it can be considered “useful”, after all.
Now I’m just working on making the “official” histograms for my project, and in the meantime, I’m still reading those research/review articles.
Until next time!
* * *
1 To put this into context, according to this source, a typical five-page word-processor document or a typical HTML web page takes up about 30 KB of data (1 KB = 1,000 bytes). That means that the file I’m using is almost 3,000x the size of that five-page document. That’s equivalent to almost 15,000 pages of a word-processor document or 3,000 HTML web pages. Imagine that!
2 Fine, it took maybe ten minutes max.
3 Those from the Data Structures class might remember this as a .csv file from our time making glossaries for our ERDs. ☺
4 It has 17,825,792 entries in spreadsheet form!!
5 This one has 340 entries and is only 2 KB in size.
6 I’d always just use a spreadsheet editor, which was way easier to figure out how to use.
Hi Cindy! I like your histogram 🙂 I’m also using interval files for genomic regions :O It’s much easier to understand your histogram than interval files.
Hey Cindy! I love your baby histogram, and can’t wait to see the evolved versions!