Glowing objects near Mount Rainier, crashes in Roswell—for millennia, people have looked up to the skies and witnessed mysterious unidentified flying objects (UFOs). But are these signs of alien civilizations, or the result of earthly phenomena such as weather balloons, rocket launches, or even the release of a popular sci-fi movie?
“I really enjoyed this class because it taught us the process a data scientist would really follow in real life.” Eric Hachuel
These are some of the questions that 60 USC computer science master’s students set out to answer last semester in CS599: Content Detection and Analysis for Big Data, an advanced content and data analysis class taught and developed by computer science adjunct professor Chris Mattmann. A lifelong fan of all things space-related, Mattmann is a principal data scientist with NASA’s Jet Propulsion Laboratory (JPL) in Pasadena and the director of USC’s Information Retrieval and Data Science group.
Working in small teams, students in the class sifted through thousands of documented sightings of UFOs using techniques like data mining and feature extraction to discover how factors such as movie releases and weather events influenced sighting patterns. So, what did they discover? While they didn’t find any conclusive evidence of alien life visiting our planet, the students did unearth some interesting patterns hidden within the data.
“We found sci-fi movie releases correlated with an increase in sightings,” said Matheos Asfaw, who took the course last semester. “We also found that events like thunderstorms caused an uptick in reports.”
Asfaw, who graduated in May and has since secured a full-time role with Santa Monica software company Cornerstone OnDemand, was a member of Team 8, along with Eric Hachuel, Pablo Guidice, Bruno Mazetti and Teague Ashcraft. You can see the team’s visualization and analysis of their research here.
The Truth is Out There?
Although the topic of UFOs may raise a smile, make no mistake—exploring a mystery of this magnitude requires rigorous analyses involving massive amounts of unstructured data, including dates, locations and descriptions of sightings.
In fact, each team analyzed a public database of more 60,000 sightings, which they enriched with data from other sources during the semester. Finally, they used data visualization techniques to evaluate the data analysis tools and communicate their insights. Along the way, they learned important lessons about using data to paint a clearer picture of a complex problem.
“I came to USC for my master’s because I wanted to be a computer scientist,” said Eric Hachuel, who earned his undergraduate in industrial engineering and is working at SpaceX this summer as a software flight reliability intern.
“I really enjoyed this class because it taught us the process a data scientist would really follow in real life. It’s a dynamic process—it’s not straightforward and it can be frustrating. You have to really dig to find the answers, but that’s how you learn.”
Close encounters of the data kind
As a PhD student in computer science at USC, Mattmann, who graduated in 2007, relished the opportunity to work on projects with real-world implications. It was during this time that he co-developed the Apache Tika software used to extract data in the Panama Papers, exposing how some wealthy individuals exploited offshore tax regimes.
Now, he vows to bring the same practical experience to his own students by focusing on a different real-world topic with every course. In the past, his students have wrangled data about polar ice and the sale of illegal weapons online, both of which were real-life research projects Mattmann was working on at the time.
Most importantly, no matter their skill-level on the first day, Mattmann hopes his students will leave the class prepared to tackle even the most unwieldy data.
“In this class, I’m hoping to expose students to real-world big data end-to-end: how to collect it, what to do with it, and how to infer knowledge ethically and responsibly,” said Mattmann.
In addition to the UFO project, students also participated in weekly class presentations addressing diverse topics, including the ethical considerations of data collection, from the rise and fall of bitcoin to the Cambridge Analytica data leak.
“I want my students to do something new and relevant every time,” said Mattmann. “We have students from a broad range of backgrounds with a variety of different undergraduate degrees and many come in knowing very little about data analysis. But they leave well on their way to becoming masters of the tools.”
To extract as much meaningful information to analyze as possible, students scraped data from ufostalker.com, a UFO sighting website, and converted PDF files from archival sightings into a searchable format that could be added to the database. They also used the machine-learning software co-created by Mattmann to analyze images of UFO sightings to automatically group similar images.
“I want my students to do something new and relevant every time.” Chris Mattmann
Through this type of project-based learning, students learn to solve problems the same way a computer scientist would in the real world—including facing the fundamental challenges of dealing with large amounts of raw data.
“We learned the importance of cleaning the data early on in the process,” said Pablo Guidice, a member of Team 8.
“For example, we scraped data from an online forum about UFO sightings, but people were also posting ads in the same forum. If we hadn’t cleaned the data first, it could have led us to the wrong conclusions.”
The struggle was all part of the plan for Mattmann. By confronting this real-world data messiness, he aims to give students a taste of life as professional data scientists and prepare them to tackle data projects outside the classroom.
“In their future jobs, they may get asked to find similar images in a huge data set or combine datasets with different values—after this class they’re ready to do that,” said Mattmann, adding with a smile: “And who knows, maybe someday they’ll get asked to investigate UFOs.”