One framework to rule them all

by Dan Carroll

George Amvrosiadis is part of a team from CMU’s PDL and Los Alamos National Lab designing a record-breaking file system framework for the next era of supercomputing.

Trinity occupies a footprint the size of an entire floor of most office buildings, but its silently toiling workers are not flesh and blood. Trinity is a supercomputer at Los Alamos National Laboratory in New Mexico, made up of row upon row of CPUs stacked from the white-tiled floor to the fluorescent ceiling.

The machine is responsible for helping to maintain the United States’ nuclear stockpile, but it is also a valuable tool for researchers from a broad range of fields. The supercomputer can run huge simulations, modeling some of the most complex phenomena known to science.

However continued advances in computing power have raised new issues for researchers.

“If you find a way to double the number of CPUs that you have,” says George Amvrosiadis “you still have a problem of building software that will scale to use them efficiently.” He’s an assistant research professor in Carnegie Mellon’s Parallel Data Lab.

Amvrosiadis was part of a team that recently lent a hand to a cosmologist from Los Alamos struggling to simulate complex plasma phenomena. The problem wasn’t that Trinity lacked the power to run the simulations, but rather, that it was unable to create and store the massive amounts of data quickly and efficiently. That’s where Amvrosiadis and the DeltaFS team came in.

DeltaFS is a file system designed to alleviate the significant burden placed on supercomputers by data-intensive simulations like the cosmologist’s plasma simulation.

When it comes to supercomputing, efficiency is the name of the game. If a task can’t be completed within the amount of time allotted, then the simulation will go incomplete, and precious time will have been wasted. With researchers vying for limited computing resources, any time wasted is a major loss.

DeltaFS was able to streamline the plasma simulation, bringing what had once been too resource-demanding a task within the supercomputer’s capabilities by tweaking a couple parts of how Trinity processed and moved the data.

First, DeltaFS changed the size and quantity of files the simulation program created. Rather than taking large snapshots encompassing every particle in the simulation—which numbered more than a trillion—at once, DeltaFS created a much smaller file for each individual particle. This made it much easier for the scientists to track the activity of individual particles.

Through DeltaFS, Trinity was able to create a record-breaking trillion files in just two minutes.

Additionally, DeltaFS was able to take advantage of the roughly 10% of simulation time that is usually spent storing the data created, during which Trinity’s CPUs are sitting idle. The system tagged data as it flowed to storage and created searchable indices that eliminated hours of time that scientists would have had to spend combing through data manually. This allowed the scientists to retrieve the information they needed 1,000–5,000 times faster than prior methods.

The team could not have been more thrilled with the success of DeltaFS’ first real-world test run and are already looking ahead to the future.

“We're looking to get it into production and have the cosmologist who originally contacted us use it in his latest experiment,” says Amvrosiadis. “To me that's more of a success story than anything else. Often a lot of the work ends with just publishing a paper and then you're done; that’s just anticlimactic.”

But he and the rest of the team aren’t just looking to limit their efforts to cosmological simulations. They’re currently looking at ways to expand DeltaFS for use with everything from earthquake simulations to crystallography. With countries across the globe striving to create machines that can compute at the exascale, meaning 10¹⁸calculations per second, there’s a growing need to streamline these demanding processes wherever possible.

The trick to finding a one-size-fits-all (or at least most) replacement for the current purpose-built systems in use, is designing the file system to be flexible enough for scientists and researchers to tailor it to their own specific needs.

“What researchers end up doing is stitching a solution together that is customized to exactly what they need, which takes a lot of developer hours,” says Amvrosiadis. “As soon as something changes they have to sit back down to the drawing board and start from scratch and redesign all their code.”

Amvrosiadis and the team have already demonstrated a couple of ways that efficiency can be improved, such as indexing or altering file size and quantity. Now they’re looking into further ways to take advantage of potential inefficiencies, like using in-process analysis to eliminate unneeded data before it ever reaches storage or compressing information in preparation for transfer to other labs.

Solutions like these center around repurposing CPU downtime to perform tasks that will contribute back into the information pipeline and creating smarter ways to organize and store data, increasing overall efficiency.

The idea is to let the expert scientists identify the areas where they have room for improvement or untapped resources, and to take advantage of the toolkit and versatile framework DeltaFS can provide.

As the world moves toward exascale computing, the pace that software development must maintain to keep pace with hardware improvements will only increase. Amvrosiadis even hopes that one day more advanced AI techniques could be incorporated to do much of the observational work performed by scientists, cutting down on observation time and freeing them to focus on analysis and study. But for him and the rest of the DeltaFS team, all of that starts with finding little solutions to improve huge processes.

“I don’t know if there’s one framework to rule them all yet—but that’s the goal.”

The DeltaFS project includes Professors George Amvrosiadis, Garth Gibson, and Greg Ganger, Systems Scientist Chuck Cranor, and Ph.D. student Qing Zheng. Also involved were Los Alamos National Lab’s Brad Settlemyer and Gary Grider.