I’m interested by data and information visualization, but unfortunately know next to nothing about statistics or data analysis. I’m trying to learn some stats on my own, so I’ve been casting about for some “interesting” data to mine.
While using iTunes one day, I noticed that there were quite a few of the ~2900 songs in my library that had never been played even once. I was curious about the distribution of play counts, and while I was at it wondered how strong the correlation between the length of time a song had been in the library and the number of times it was played. I exported the Library and wrote some python scripts to extract data (using this helpful library to parse the plist-in-XML file that itunes exports).
It turns out I have 208 unplayed songs in my library, and additionally lots of low single digit playcount songs. Here’s an (ugly excel generated) histogram:
It doesn’t really hold to the power law well because of the way it seems to level off for a while, and the dip at zero playcounts.
While I was delving around, I figured I would see if theres any correlation between the length of time a song has been in my library, and the number of times it’s been played. The dot plot turned out interesting.
Looks like there’s a weak positive correlation between age and playcount, which is to be expected. What intrigued me more is the vertical lines of dots that seem to indicate music being added in significant bunches which at least on first glance seem to be bigger than one album.
One of these days I’ll have to slap together something interactive so I can see what songs those clusters actually are.