The Rio Olympics officially kick off tomorrow evening. Despite the concerns about water, housing, and the general lack of readiness in Rio, it appears that the show will go on. The United States is sending 554 athletes in sports ranging from Track and Field to Canoe. The other day, I spent some time looking at the list of athletes and my curious nature got the best of me. I started to ask questions about the team:

• Is there any particular state or college that produced more Olympic athletes than others
• Which age group is most common?
• Which states are most heavily represented in particular sports?

So, I decided to get some data and analyze it in order to answer these questions. Fortunately for me, the Team USA website provides a listing of the athletes in Excel format (http://www.teamusa.org/road-to-rio-2016/team-usa/athletes). As is typical of any data analysis project, I first had to do some data cleansing—the states, birth dates, and colleges all required quite a bit of cleansing and normalization. After enhancing the data with various calculations and rankings, I was ready to analyze it. To visualize the data, I used a web-based visualization product called RAW, which was built by Density Design Lab (http://app.raw.densitydesign.org). RAW is a great tool and, if you’ll bear with me until the end, I’ll talk more about it.

Sport Analysis
First, I created a Bubble Chart (RAW calls it a Circle Packing chart) based on sport.

I knew that Track and Field had a lot of events and a lot of athletes, but I was quite surprised to see how much larger its team was than other sports. With 129 athletes, Track and Field is almost three times the size of the next largest sport, Swimming.

School/College Analysis
Next, I wanted to see how various schools and colleges are represented.

As you can see, California universities—Stanford, UC Berkeley, USC—represent a large number of the athletes. The highest number of athletes come from Stanford (29) and UC Berkeley is right on its heels (27).  I think, perhaps, the most interesting observation is that homeschoolers are quite heavily represented. My presumption is many parents of athletes decided to homeschool in order to provide their children with the schedule flexibility needed for their grueling training schedules.

Age Analysis
From here, I wanted to analyze the ages of the athletes.

The bulk of the athletes are 19-31, with the largest group being 26, but I was glad to see some team members in their 40’s and even some in their 50’s! Interestingly, these ages align pretty nicely with the aging curves of various sports. I’ve written a bit about aging curves before (see my analysis of Tiger Woods’s decline: https://www.linkedin.com/pulse/tiger-woodss-decline-what-does-data-say-kenneth-flerlage-1), but I’ll explain again here.

The first aging curve was created by a man named Bill James (I refer to him as “the original Nate Silver”). He was a true baseball statistics geek who created the voluminous Baseball Abstract starting in 1977. He was also instrumental in the creation of "sabermetrics", which is a system of analysis of baseball and baseball players through use of statistics. The real Nate Silver, as well as numerous others, have applied sabermetric models to other areas—for Nate Silver, most notably politics.

Through his study of baseball statistics, James developed the idea of aging curves. These curves essentially show the average performance improvement or decline that should be expected of a player as he/she ages. Below is an example aging curve for baseball (taken from http://www.fangraphs.com/library/the-beginners-guide-to-aging-curves/)

Since then, people have created similar aging curves for numerous sports (my post on Tiger Woods, for instance, used a golf aging curve as evidence that Tiger’s decline is due mostly to his age). Interestingly, the aging curves for common sports such as baseball, basketball, and soccer/football tend to show that the average athlete peaks around the age of 26 or 27. So, it should come as no surprise that the most highly represented ages of US Olympic athletes are 26 and 27.

State Analysis
Next, I decided to analyze the US team based on state (Note: The data set I obtained had birth state, hometown state, and current state. I chose to use hometown state for my analysis).

The bubble chart clearly shows that California is the most heavily represented. This is not surprising, of course, given that California is the most populous state. The rest of the top 5 in population—Texas, Florida, New York, and Illinois—are also quite heavily represented. I decided it’s not really fair to compare raw numbers like this. So, I obtained some 2015 population estimates from census.gov (http://www.census.gov/popest/data/state/totals/2015/) and changed the chart to show athletes per capita. The resulting bubble chart looks quite different.

California is still very well represented, but some very small states (I’m using the term “state”, but also including the District of Columbia) are quite heavily represented. The District of Columbia, ranked 49th in population, is ranked 1st in Olympic athletes per capita. Rhode Island, ranked 43rd in population, is ranked 2nd in athletes per capita.

Note: Five states— Wyoming, West Virginia, North Dakota, Montana, and Alaska—are not on either chart because they are not sending any athletes to the Olympics.

State to Sport Analysis
Finally, I thought it would be interesting to see how states relate to sports. For this, I created a Sankey diagram (RAW calls it an Alluvial Diagram).

I should note that this chart was based on raw numbers, so once again, California is more heavily represented. There are definitely some interesting observations that can be made from this visualization. First, Oregon loves Track and Field! The vast majority of Oregon’s athletes (nearly 80%) are competing in Track and Field. A second interesting observation is that almost every member of the men’s and women’s water polo teams come from California.

Well, that’s all for now. Feel free to analyze these charts further and let me know if you make any additional interesting observations!

Before I go, I want to take just a moment to talk about the visualization tool used for these diagrams. As noted above, I used a web-based tool called RAW, which was built by Density Design Lab (http://app.raw.densitydesign.org). This is a fantastic tool for quickly creating some very nice visualizations. You can easily create Sankey diagrams (Alluvial Diagrams), bubble charts (Circle Packing), plus 14 other chart types (you can also design your own). And the process is very simple—you just drop your data onto the site, choose your visualization, and select your dimensions and metrics. You can, of course, create these types of visualizations in other tools like Tableau, but some of the more complex visualizations, such as Sankey diagrams, can be quite difficult and time consuming to create. With RAW, you have your visualization in minutes. RAW does not have all the bells and whistles of Tableau—you can’t interact with the visualization, change filters, etc.—but for quick, simple visualizations, it really is a fantastic tool. Be sure to check it out.

Enjoy the Olympics everyone!!

Ken Flerlage, August 4, 2016

#### 1 comment:

1. This is a very informative post Ken. Thank you!. I'll something similar for Winter Olympics 2018 :)