Like the world's best pit crews, groups of highly trained scientists make sure everything works together at the supercomputers available through the Department of Energy's Office of Science. Like their NASCAR counterparts, these crews get the best possible performance from the custom-built supercomputers. Their machines are fast and powerful, but they aren't intuitive. Add to that the fact that researchers using the computers are racing against the clock, too – to get answers before their time on the machine expires.
Unwieldy data files, slow programs and hardware crashes all eat into that time. For example, working with data files that are 10 times larger than they've used before can slow the code to a crawl or a crash. The team works with scientists long before their time starts on the supercomputer to optimize codes and improve workflows.
"We want them to get the performance they need to get the job done," says Katherine Riley. She's director of science for the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility at Argonne National Laboratory in Illinois.
Supporting some of the world's fastest computers requires more than an encyclopedic understanding of supercomputers. A deep knowledge of mathematics is a must. Add to that an outgoing personality and a desire to help. Top it off with a doctoral degree and extensive experience in a scientific field.
"We're able to speak the computational language and also the science," says Richard Gerber, now a senior science advisor at National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science user facility at Lawrence Berkeley National Laboratory in California. But he started, 20 years ago, answering questions from users about the center's supercomputer.
Last year, the dozen-strong team at NERSC, located in California, answered 6,435 enquiries and earned, once again, top marks for their contributions. "Everybody that calls in has a problem that they can't figure it out on their own," says Gerber. "We work with very smart, sophisticated users. We provide that extra bit of knowledge they need."
That extra bit can also take the form of training events. Short, intense programming marathons called hack-a-thons are quite popular. Fernanda Foertter, who leads training at the Oak Ridge Leadership Computing Facility, a DOE Office of Science user facility at Oak Ridge National Laboratory in Tennessee, started these events to increase use of the facility's supercomputer, known as Titan. Foertter saw researchers attend workshops but struggle to find the time and resources to apply what they learned to optimize their codes for Titan.
At a hack-a-thon, Foertter pairs each team of five or so software developers with two mentors from the support team and lots of food. The goal is to get the software to run well on the supercomputer. And that's what's happening for scientists at national labs, universities, and elsewhere. In fact, at the end of one event, researchers got their program, which analyzes how fluid moves in the brain, to run eight months of data in eight hours.
Despite optimizing the code, crashes happen. At ALCF, the support team analyzes every crash. "It's time consuming, but it is important to us to understand why," says Richard Coffey, ALCF director of user experience. "If it's a system failure, we can give a refund. If it's a bug in the user's software, we work hard to help them understand what they could do differently."
Just as the world's fastest cars require more than a driver and a pit crew, so supercomputers have talented teams who don't work directly with the scientist driving the research. Each supercomputer has experts handling issues such as file management, maintenance, security, and much more.
While the machines often get the attention at the end of a race, it's the talents of people that win races. In the race to answer tough questions, it's the crews supporting DOE's supercomputers that make it possible to cross the finish line.
Explore further: Slimming down supercomputers