Classical Music Has Open Data Sets?
In open data sets, Suby Raman found a lot of really interesting stories to tell about the performing arts. Because he’s a composer, he knew what to look for in the data and what would matter to people. Because he’s a programmer, he knew how to handle the big data set itself.
Back in October, the Washington Post offered some blog coverage of Suby Raman’s first big data and the arts post. While the dramatic headline made it sound like Raman had written a death knell for opera, what he’d actually done was taken reams of detailed data about the Metropolitan Opera’s performances and analyzed it for trends. Because he’s a composer, he knew what to look for in the data and what would matter to people. Because he’s a programmer, he knew how to handle the big data set itself.
In that data, he found a lot of really interesting stories to tell–not about the death of opera itself, but about how the Metropolitan Opera has been producing more and more works by dead composers. As the repertoire has solidified, the average age of a composition has gone pretty far up. At New Music USA, we find that less than awesome, and we’re grateful for the clear explanation, complete with charts and graphs and labelled axes. (The labelled axes are incredibly important.)
Some of the criticism lobbed at Raman in, mostly, various comment threads, attacked him for only looking at this one opera company instead of the entire field. That position completely misses the reality of what data science in the arts is actually like. Suby Raman didn’t pick the Metropolitan Opera out of a hat, try to say it’s a representative opera company, or argue that it’s more important than any other house. He picked it because they’re the ones who released all their historical data in a neat and usable package. Here’s his own disclaimer:
About the data: data was acquired from the Met Opera Database, in a timeframe from 1905 to present. One “performance” is a night of an identifiable opera performance. Opera performances data was scraped from the HTML and matched up to scraped composer/opera data from Wikipedia. The process of scraping/matching may have introduced some error.
Data may be big and getting bigger, but it’s not exactly thick on the ground in the performing arts. There is no IMDB for string quartets, composers, ballets, or even plays (though in theater there are some who are trying to make one). Where clear data sets exist, there’s a lot we can learn from them, and we should definitely encourage our big institutions to make more of their data transparently available to the public. A better headline for that Washington Post piece might have been “Holy crap, even the Metropolitan Opera is opening up their data?”
While the data available for analysis are limited in scope, they are still valuable, and Suby Raman did some great science by being clear about the limitations of his method, and about how more could be done with more information.
Plus he used a lot of animated gifs in that blog post. Who doesn’t love a good animated gif?
If you want more evidence of Raman’s competence to analyze data in the performing arts, check out the disclaimer text from his most recent blog post. This one’s on the gender diversity of major American orchestras:
Note: Dataset is 1,833 unique orchestra performers, taken from current orchestra roster pages. “Laureate” and “Emeritus” musicians were not included. “Top 20 Orchestras” was defined as the top 20 orchestras ranked by base salary. Librarians and other personnel were not included; you guys are fantastic, but this just examines musicians. I will post the dataset to GitHub in a few days, check here for the link.
Because of variances in doubling vs. unique performers, “ancillary” instruments like piccolo, English horn, etc. were included in their more common section (flute, oboe, etc).
He’s brought both a good knowledge of data scraping and a good knowledge of the inner working of orchestras to the job. I can think of few data scientists with the music background to let them do this. My one quibble, and it is not insignificant, is that there isn’t anything in here about how the gender of each player was determined. Was it by name? By appearance? It probably wasn’t by self-identification, and there do seem to be only two kinds of gender in this analysis, which is its own problem. But the stories Raman is trying to tell here are also clear, also compelling, and also a strong, evidence-based call for change. One particularly good point, revealed by proper analysis like this, is how many principal flutists are men, despite most flutists in the study’s orchestras being women. Wow. This post only does have one animated gif, though.
Did I mention that Raman also writes music?