Elizabeth Richardson, Trinity Communications
For one weekend each spring, Perkins Library turns into an analytical marathon. Laptops are open, whiteboards are filled with ideas and groups of undergraduate students huddle around tables to make sense of a massive mystery dataset.
The event is DataFest, organized by the American Statistical Association (ASA), and it challenges students to do something they rarely get to do in class: Dive into a huge, messy dataset and see what they can uncover.
“It’s a data analysis competition,” said Alexander Fisher, assistant professor of the practice of Statistical Science and one of the faculty organizers. “What makes DataFest unique is that there is a large, surprise data set, unlike what students usually see in the classroom.”
This year, 102 students participated. Teams of up to five students receive the dataset on Friday evening and have roughly 48 hours — closer to 40 in practice — to analyze it and present their findings by Sunday afternoon. The dataset, provided by a different sponsor each year, is often massive.
“If you try to run any models that you’re used to running from class just out of the box on the whole thing, it will probably fail,” Fisher said. Instead, students have to think creatively about how to break a large problem into smaller pieces and use the tools they know in new ways.
"What makes DataFest unique is that there is a large, surprise data set, unlike what students usually see in the classroom." — Alexander Fisher, Duke Statistical Science Faculty
The weekend begins with an introduction to the dataset. Then, students move to Perkins Library, where they settle in with their teams and begin analyzing. Some stay late into the night, while others head back to dorms or nearby hotels before returning the next morning. Everything takes place over one weekend.
To make the technical side easier, the statistics department provides a shared computing environment that students can access remotely, giving everyone the data and computing power they need.
“It’s critical that students don’t have to waste time on the technology,” said Joan Combs Durso, research scientist in Statistical Science and another faculty organizer. “When you’re getting students who have a random variety of experiences, it’s important that the technology is accessible to all of them.”
While Duke hosts its own event, DataFest happens at dozens of institutions around the world between March and early May, with all students working with the same dataset on different weekends.
Students from University of North Carolina at Chapel Hill make the short trip to Duke each year to participate, while others come from farther away. This year, teams came from Furman University and Coastal Carolina University, adding to the collaborative atmosphere. Teams compete for several awards, but organizers say the focus is less on prizes and more on the experience of working together on a real-world problem.
“When you’re getting students who have a random variety of experiences, it’s important that the technology is accessible to all of them.” — Joan Combs Durso, Duke Statistical Science Faculty
For many students, DataFest is the first time they encounter data that resembles what analysts work with outside academia.
This year’s dataset involved healthcare data, and was sponsored by Stormont Vale Health, a healthcare system in Topeka, Kansas. In the past, sponsors have included the American Bar Association, Expedia and Ticketmaster.
“For many students, it’s the first time they’ve been exposed to a large, messy dataset that is not unlike what they’ll see outside of school,” Combs Durso said. “It gives them a chance to try out the skills that they have on real-world data and real-world questions.”
Another part of what makes DataFest distinctive is its openness. Unlike research opportunities or internships, which often have barriers to entry, any undergraduate can participate.
“Anybody can sign up and participate in DataFest,” Fisher said. “We often see students from the intro data science class coming in and winning prizes.”
Participants come from a wide range of majors, including computer science, mathematics, global health and the social sciences.
“Statistics majors do not make up the majority of participants,” Fisher said.
“Over time, DataFest at Duke has become not just a competition, but an important formative experience. It’s one that helps students see themselves as capable data practitioners while also fostering a sense of community among participants from different institutions who come together to learn from one another.” — Mine Çetinkaya-Rundel, Duke Statistical Science Faculty
As a first-year student, Tien Thai, ’29, and the rest of her team knew they were not equipped with the skill sets other teams had, but they still came in excited to learn.
“The topic this year was quite relevant to my career path,” she said, “and as a premed student who has a large interest in data science and computational biology, we picked a topic that we wanted to not just address the dataset from a health perspective, but a sociology perspective as well.”
This led to her team’s project: "The Sepsis Burden: Geographic and Social Disparities represented in Stormont Vail Health.” She believed that having a team with diverse interests led to great ideas being fostered and to conversations beyond pure data.
Thai was shocked to learn her team made it to the final round of the competition.
“We were not expecting to win at all,” she said. “I remember one of my teammates confirming with me three times that they said our names to make it into the finals.”
The surprise only added to what Thai said was a great experience, and one she is eager to repeat next year.
Throughout the weekend, students also interact with volunteer consultants and judges, many of whom come from industry.
“We get a lot of industry interaction,” Fisher said. “Students are able to get feedback from professionals in ways they might not otherwise experience.”
Mine Çetinkaya-Rundel, professor of the practice of Statistical Science, has been working with DataFest since it came to Duke.
“Over time, DataFest at Duke has become not just a competition, but an important formative experience,” she said. “It’s one that helps students see themselves as capable data practitioners while also fostering a sense of community among participants from different institutions who come together to learn from one another.”
Beyond the competition itself, Çetinkaya-Rundel said that DataFest has become a space where learning extends beyond the classroom.
Students receive feedback from faculty mentors and industry professionals, work across disciplines and apply the techniques they’ve learned in class to real-world data problems.
The weekend reinforces the skills students build in their coursework while giving them an opportunity to test those skills in a fast-paced, collaborative environment. For organizers, the highlight is seeing students collaborate, experiment and learn from each other over the course of the weekend.
“Being there that weekend and seeing it all come together is a joy,” Fisher said.