What makes a proton spin? That is one of the biggest mysteries in physics. Although researchers do not fully understand the underlying physics of this phenomenon, they do know that it contributes to the stability of the universe, magnetic interactions, and are a vital component of technologies like Magnetic Resonance Imaging (MRI) machines that are used in hospitals around the globe.
To solve this mystery, researchers are smashing together polarized proton beams in the Relativistic Heavy Ion Collider (RHIC) at Brookhaven National Laboratory (BNL) where an international collaboration, including scientists from the Lawrence Berkeley National Laboratory (Berkeley Lab) Nuclear Sciences Division, operate the STAR experiment. STAR, which stands for Solenoidal Tracker at RHIC, uses exquisitely sensitive detectors to record data about the subatomic debris these smash-ups leave behind.
With the possibility of an important payoff, scientists want to analyze that data as soon as possible. But after five months of data collection, some researchers must wait another 10 months for off-line processing to complete detector calibration, reconstruction, and analysis. So, it could be more than a year before they see the full analysis of these experiments.
According to Jerome Lauret, the software and computing leader for the STAR experiment at BNL, waiting a year and a half to get experiment analysis is a huge setback for graduate students working on their theses, not to mention researchers who need to the data to further their own research. This is one reason why the collaboration is eager to explore the advantages of new computing paradigms. Lauret and some of his colleagues think a cloud-computing environment is the next model for such large experiments.
So, when the National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab and the Leadership Computing Facility (ACLF) at Argonne National Laboratory received Recovery Act funding to set up two, joint cloud computing testbeds, STAR researchers were among the first to “test-drive” the systems. The testbeds, dubbed the Magellan project were built to examine whether virtual clusters could be a cost-effective and energy-efficient paradigm for science. Both are made up of IBM iDataplex clusters that allow users to install their own operating systems and software stacks as Virtual Machines (VMs). A virtual machine is a software environment that executes programs just like a physical machine, a critical need for scientists who want applications that are fine-tuned for a particular computer environment to run on a different kind of system. Commercial clouds are often distributed throughout the world to spread load and gain reliability. The Magellan project emulates that idea with the two sites.
The Proton Spin Crisis
All elementary particles have spin, or intrinsic angular momentum. Although protons are a fundamental component of atoms—which comprise nearly all visible matter—they are not true elementary particles because they can be broken down even further. In fact, protons consist of three quarks bound together by gluons. Until 25 years ago, researchers believed that a proton’s spin could be calculated simply by adding up the spin states of its component quarks. But experiments conducted in the late 1980s proved that only a portion of the proton spin comes from quarks. This revelation sparked the “proton spin crisis.”
Jan Balewski, a Massachusetts Institute of Technology (MIT)-based member of the STAR collaboration, is searching for this “missing proton spin” using W-boson events produced in 1 percent of proton-proton collisions recorded by RHIC’s STAR detector this year. Scientists at STAR suspect these events may be key to understanding how much spin is carried by other elementary particles, like sea quarks, which are quark-antiquark pairs that pop into existence and immediately annihilate each other. Although they exist inside a proton only briefly, some believe that sea quarks may also contribute to the proton’s spin.
“The visible matter of the universe consists predominantly of proton-like particles. If the results of our experiment cause a revision of our understanding of the proton makeup, this will impact how we describe visible matter in the universe," says Balewski.
STAR Experiments in the Magellan Cloud
In an ideal world, Balewski notes that the STAR experiment would have almost-real time event processing. For calibration, this would allow the MIT team to spot certain expected characteristics of measured W events and determine that all the detector components work well, or identify issues that need to be fixed. However, this type of processing of all STAR data would require continuous access to about 10,000 CPU cores. Given that only 4,400 CPUs at BNL are available to the STAR collaboration, half of which is typically used for data production, this would not be possible. Centers like NERSC and ALCF have many more cores, but jobs wait in queues to be scheduled.
In addition to providing more computing power, Lauret notes that cloud computing indirectly helps the STAR experiment by motivating students to start work earlier on calibration tasks. “Cloud computing puts our work in a ‘human understandable time-frame,’” he says. “In the case of the W-boson work, I saw that the students were extremely motivated to work hard on calibration tasks, knowing that they could start on their thesis in a few months rather than 1.5 years later. This is a true game changer and a paradigm shift for our scientific community.”
As an example, he cites MIT graduate student Matthew Walker who was so excited with the idea of using cloud computing resources for his own thesis work that was willing to do the leg work to build an initial STAR VM for Amazon’s EC2 several years ago. Since then, motivated students have been steering large scale processing with pre-packaged STAR VMs. In fact, students were very involved in running on Magellan. He also acknowledges Indiana University graduate student Justin Stevens who “worked day and night” modifying the W analysis code on the STAR VMs to produce sensible results from events taken on the previous day, as well as a team of dedicated detector and calibration experts.
“Cloud computing puts our work in a human-understandable time frame,” - Jerome Lauret, software and computing lead for the STAR experiment.
According to Shane Canon, who heads NERSC’s Technology Integration Group, the W-boson analysis and data processing was an ideal project to run on the Magellan cloud computing testbed because it once the software was packaged as a Virtual Machine instance, it could run on any cloud platform. The jobs within the Star analysis also require little communication, so they can be spread throughout a distributed cloud infrastructure. Thus NERSC offered the STAR collaboration 20 eight-core nodes on it Magellan testbeds to “experiment” in a cloud environment. This offer was eventually supplemented with equivalent resources on the ALCF’s Magellan system.
With this offer, a computing team led by Balewski and Lauret adapted the W-boson workflow for Magellan. Among the collaborators was Berkeley Lab’s Doug Olson, long-time member of the STAR team who worked with NERSC’s Iwona Sakrejda to build VM images that mirrored the STAR reconstruction and analysis workflow on the center’s existing PDSF cluster. PDSF is optimized to handle data intensive science projects that use grid technologies both for remote job submissions and data transfers. In addition to Sakrejda, Berkeley Lab’s Lavanya Ramakrisnan also helped the computing team spawn and monitor their VMs.
This collaboration resulted in a real-time cloud-based data processing system, so the data processing proceeds during the five-months of experiments and finished at nearly the same time. The team took advantage of on-demand resources with custom scripts to automate and prioritize the workflow. Every 30 minutes two independent processes at BNL check the STAR collaboration’s compute cluster for new files. A connection between STAR clusters at BNL and NERSC’s Data Transfer nodes (DTNs), which are optimized for wide area network transfers, is then established via Globus Online. The Department of Energy’s Energy Sciences Network (ESnet) carries the data to California, and once it arrives at NERSC, the data “parked” on the center’s global file system until it is used by one of the VMs.
“Parking data on global scratch gives us a 20 terabyte buffer between the network and the VMs. This maximizes the scientific workflow by ensuring that data is available to the VMs at all times and none of the processors sit idle,” says Sakrejda.
Once the data processing is complete, the results are backed up in NERSC’s mass storage system where the entire STAR community can instantly access it via its computing resources at PDSF. A copy of the analyzed events is also sent back to BNL for permanent archival. The Magellan testbed at NERSC is based on the popular open source Eucalyptus cloud software, while the ACLF testbed runs OpenStack Software and the Nimbus Tookit—after the W-boson team successfully launched their images on Eucalyptus, Olson worked with staff at Argonne to tweak the images to run on the other platforms. The STAR team currently runs a coherent cluster of over 100 VMs from three Magellan resource pools – Eucalyptus at NERSC, Nimbus at ANL and OpenStack at ANL. The total number of cores has exceeded 800, and they expect to cross the threshold of 1,000 parallel jobs soon.
“The STAR collaboration has been using GRID resources for quite a while, and had a vision of running data processing on lots of resources around the country. But one of the biggest challenges has been getting the STAR software running at all of those sites and maintaining it,” says Olson. “In the cloud, we can validate one machine image and with little tweaks have a bunch of sites run it.”
Aside from the real-time W event analysis, the STAR collaboration also processed a large sample of gold-gold events on the Magellan testbeds. “A few weeks before the end of 2011 data taking, our team was told that if we processed events taken a few weeks ago and showed preliminary findings, RHIC would forego previously scheduled detector tests to allow us to continue to colliding gold beams to study the quark gluon plasma,” says Lauret. “This was unprecedented. RHIC actually changed the running plan based on offline analysis of data acquired in the same year. We wouldn’t have been able to do this without the cloud resources.”
Science in the Cloud: Lessons Learned
In addition to the scientific successes, one of the main lessons learned from this experiment is that building a VM image is a lot easier in concept than reality.
“Building a VM is not something I think a typical scientist can do, you need a lot of system admin skills,” says Olson. “With my advanced computing background, it was a two or three week learning exercise for me took me about two to three weeks to build one of these systems from scratch and several days to adapt the image to run on the clouds at ACLF.”
“All of Doug’s work was well worth the effort, because now we can run at multiple places with confidence in the analysis results,” adds Lauret.
“So far we have used the cloud environment to offload peak computing needs and in this context it has worked out really well, but running these images is not a turnkey operation,” says Balewski. “Since we are ramping up this effort there is still a lot of handwork involved and it is not easy. In this computational experiment it was very helpful to work with Iwona because she’s been helping us run on PDSF for years, she understood our science, specific computing requirements of STAR analysis, without much explanation.”
According to Sakrejda, two other issues to consider when building VMs are image size and security. She notes that the image creator must be very careful not to compromise the security of the VMs by leaving in personal information like passwords or usernames. The developer also needs to make the image as complete as possible but also small because it resides in memory. “The size of an image will take away from memory available for the application,” she says.
“In this case study, the STAR collaboration really leveraged the existing infrastructure of NERSC, ACLF and ESnet to run on the cloud. From this work we learned that scientific cloud computing is more than just about processing data. It’s a whole environment that includes storage, WAN and LAN transfers, security, and scientific software consulting support,” says Canon. “The STAR demonstration really illustrates the power of having extra computing available to bring to bear on a problem, and that there is a whole ecosystem that is needed to make it possible.”
This story was adapted from an ISGTW article written by Miriam Boon.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.