older files for starting point

argonne-lcf · Jan 28, 2025 · 6ad2363 · 6ad2363
1 parent b2d8716
commit 6ad2363
Show file tree

Hide file tree

Showing 530 changed files with 41,946 additions and 0 deletions.
diff --git a/01_year-in-review/about-alcf.md b/01_year-in-review/about-alcf.md
@@ -0,0 +1,21 @@
+---
+layout: page
+
+theme: white
+permalink: year-in-review/about-alcf
+
+title: About ALCF
+hero-img-source: "TCSBuilding.jpg"
+hero-img-caption: "The ALCF is a national scientific user facility located at Argonne National Laboratory."
+
+aside: about-numbers.md
+---
+
+The Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory, enables breakthroughs in science and engineering by providing supercomputing and AI resources to the research community.
+
+ALCF computing resources—available to researchers from academia, industry, and government agencies—support large-scale computing projects aimed at solving some of the world’s most complex and challenging scientific problems. Through awards of computing time and support services, the ALCF enables researchers to accelerate the pace of discovery and innovation across a broad range of disciplines.  
+
+As a key player in the nation's efforts to provide the most advanced computing resources for science, the ALCF is helping to chart new directions in scientific computing through a convergence of simulation, data science, and AI methods and capabilities.
+
+Supported by the DOE’s Advanced Scientific Computing Research (ASCR) program, the ALCF and its partner organization, the Oak Ridge Leadership Computing Facility, operate leadership-class supercomputing resources that are orders of magnitude more powerful than the systems typically used for open scientific research.
+
diff --git a/01_year-in-review/directors-letter.md b/01_year-in-review/directors-letter.md
@@ -0,0 +1,28 @@
+---
+layout: page
+
+theme: white
+permalink: year-in-review/directors-letter
+
+title: Director&rsquo;s Letter
+hero-img-source: "papka.png"
+hero-img-caption: "Michael E. Papka, ALCF Director"
+---
+
+The process of planning for and installing a supercomputer takes years. It includes a critical period of stabilizing the system through validation, verification, and scale-up activities, which can vary for each machine. However, unlike ALCF’s previous or current production machines, Aurora’s long ramp-up journey has also included several configuration changes and COVID-related supply chain issues. 
+
+Aurora is a highly advanced system designed for various AI and scientific computing applications. It will also be used to train a one-trillion-parameter large language model for scientific research. Aurora’s architecture boasts more endpoints in the interconnect technology than any other system, and it has over 60,000 GPUs, making it the system with the largest number of GPUs in the world.
+
+In 2023, ALCF made significant progress toward realizing Aurora’s full capabilities. In June, Aurora completed the installation of its 10,624th and final blade. Shortly after, Argonne shared the results of benchmarking runs for about half of Aurora to the TOP500. These results were used in the November announcement of the world’s fastest supercomputers, where Aurora secured the second position. Once the full system goes online, its theoretical peak performance is expected to be approximately two exaflops.
+
+Some application teams participating in the DOE’s Exascale Computing Project and the ALCF’s Aurora Early Science Program have begun using Aurora to scale and optimize their applications for the system’s initial science campaigns. Soon to follow will be all the early science teams and an additional 24 INCITE research teams in 2024.
+
+This new exascale machine brings with it some more big changes. Theta, one of ALCF’s production systems, was retired on December 31, 2023. ThetaGPU will be decoupled and reconfigured to become a new system named Sophia, which will be used for AI development and as a production resource for visualization and analysis. Meanwhile, the ALCF AI Testbed will continue to make more production systems available to the research community.
+
+For more than three decades, researchers at Argonne have been developing tools and methods that connect powerful computing resources with large-scale experiments, such as the Advanced Photon Source and the DIII-D National Fusion Facility. Their work is shaping the future of inter-facility workflows by automating them and identifying ways to make these workflows reusable and adaptable for different experiments. Argonne’s Nexus effort, in which ALCF plays a key role, offers the framework for a unified platform to manage high-throughput workflows across the HPC landscape.
+
+In the following pages, you will learn more about how Nexus supports the DOE’s goal of building a broadscale Integrated Research Infrastructure (IRI) that leverages supercomputing facilities for experiment-time data analysis. The IRI will accelerate the next generation of data-intensive research by combining scientific facilities, supercomputing resources, and new data technologies like AI, machine learning, and edge computing.
+
+In 2023, we continued our commitment to education and workforce development by organizing a number of informative learning experiences and training events. As part of this effort, ALCF staff members led a pilot program called “Introduction to High-Performance Computing Bootcamp” in collaboration with other DOE labs. This was an immersive program designed for students in STEM to work on energy justice projects using computational and data science tools learned throughout the week. In a separate effort, the ALCF worked on developing the curriculum for its “Intro to AI-Driven Science on Supercomputers” training course, with the aim of adapting the content to introduce undergraduates and graduates to the basics of large language models for future course offerings.
+
+To conclude, I express my sincere gratitude to the exceptional staff, vendor partners, and program office, who have all contributed to making ALCF one of the leading scientific supercomputing facilities in the world. Each year, we take the time to share our numerous achievements with you in our Annual Report, and while there are many more exciting changes on the horizon, I truly appreciate this opportunity.
diff --git a/01_year-in-review/year-in-review.md b/01_year-in-review/year-in-review.md
@@ -0,0 +1,76 @@
+---
+layout: page
+
+title: ALCF Leadership
+
+theme: white
+permalink: year-in-review/year-in-review
+---
+
+
+
+{% include media-img.html
+   source= "Allcock_1600x900.jpg"
+   caption= "Bill Allcock"
+%}
+
+# Bill Allcock, ALCF Director of Operations
+
+One of the most significant changes of the year was the retirement of Theta, Cooley, and the theta-fs0 storage systems. They were great systems that helped our users accomplish a lot of science. From the operations perspective, there is a silver lining in that it reduces the number of systems and makes our operational environment more uniform without them, but it is still sad to see them go.
+
+We made some significant improvements to our systems over the course of the year. 
+- The ALCF AI Testbed’s Graphcore and Groq systems were made available for use and all four publicly available tested systems (Cerebras, SambaNova, Groq, and Graphcore) got significant upgrades.
+- Polaris network hardware was upgraded from Slingshot 10 to Slingshot 11, doubling the max theoretical bandwidth. We are working on system software upgrades that will include the Slingshot software, programming environment, and NVIDIA drivers.
+- The HPSS disk cache was increased from 1PB to 9PB, significantly improving the probability of a “cache hit” and faster data retrieval.
+
+Operationally, we continue to expand our support for DOE's Integrated Research Infrastructure. Much of our initial work was with Argonne’s Advanced Photon Source, and while we continue to work with them, we are also collaborating with other facilities. From the operations side, we are working to make it faster and easier to create new on-demand endpoints. This includes making the endpoints more robust and easier for scientists to manage.
+
+Last, but certainly not least, the Operations team has been decisively engaged in the Aurora bring-up. We have done extensive work to assist in the stabilization efforts. We continue to work on developing software and processes to manage the gargantuan amount of logs and telemetry that the system produces. We have provided support for scheduling. Our system admins have developed extensive prolog and epilogue hooks to detect and, where possible, automatically remediate known issues on the system while the vendors work on a permanent resolution. We have also assisted in supporting the user community. Because of the NDA (Non-Disclosure Agreement) requirements, we set up a special Slack instance to facilitate discussion and have assisted in conducting training.
+
+We continue to collaborate with Altair Engineering and the OpenPBS community. We found some scale-related bugs that were making administration on Aurora slow and difficult. We worked closely with Altair and they provided patch updates very quickly and integrated those fixes into the production releases. We continued our work on porting PBS to the AI Testbed systems, but their unique hardware architectures and constraints have been challenging. However, later in the year, we were forced to table the AI system work and focus on Aurora.
+
+
+{% include media-img.html
+   source= "Kumaran_1600x900.jpg"
+   caption= "Kalyan Kumaran"
+%}
+
+# Kalyan Kumaran, ALCF Director of Technology
+
+Over the past year, we made considerable progress in deploying Aurora, enhancing our AI for Science capabilities, and advancing the development of DOE’s Integrated Research Infrastructure (IRI). On the Aurora front, our team was instrumental in enabling a partial system run that earned the #2 spot on the Top500 List in November. It was also great to see Aurora’s DAOS storage system place #1 on the IO500 production list. We helped get several early science applications up and running on Aurora – some of which have scaled to 2,000 nodes with very promising performance numbers compared to other GPU-powered systems. Our team also made some notable advances with scientific visualizations, demonstrating interactive visualization capabilities using blood flow simulation data generated with the HARVEY code on Aurora hardware and producing animations from HACC cosmology simulations that ran at scale on the system.
+
+We continued to work closely with Intel to improve and scale oneAPI software, bringing many pieces into production. On Aurora, the AI for Science models driving the deployment of AI frameworks (TensorFlow, PyTorch) have achieved an average single GPU performance more than 2x faster than NVIDIA A100, driven by close collaboration between Argonne staff and Intel engineers. Other efforts included using the Argonne-developed chipStar HIP implementation for Intel GPUs to get HIP applications running on Aurora. To help support Aurora users and the broader exascale computing community in the future, we played a role in launching the DAOS Foundation, which is working to advance the use of DAOS for next-generation HPC and AI/ML workloads, and the Unified Acceleration (UXL) Foundation, which was formed to drive an open standard accelerator software ecosystem. ALCF team members also continued to contribute to the development of standards for various programming languages and frameworks, including C++, OpenCL, SYCL, and OpenMP.
+
+In the AI for science realm, we enhanced the capabilities of the ALCF AI Testbed with two new system deployments (Groq, Graphcore) and two system upgrades (Cerebras, SambaNova). With a total of four different accelerators available for open science research, we partnered with the vendors to host a series of ALCF training workshops, as well as a SC23 tutorial, that introduced each system’s hardware and software and helped researchers get started. The team published a paper on performance portability across the three major GPU vendors' architectures at SC23, demonstrating that all three of them are good for AI for science workloads. The Intel GPU on Aurora demonstrated the best performance at the time of the study. Our staff also contributed to the development of MLCommon’s new storage performance benchmark for AI/ML workloads and submitted results using our Polaris supercomputer and Eagle file system, which demonstrated efficient I/O operations for state-of-the-art AI applications at scale. In addition, we deployed a large language model service on Sunspot and demonstrated its capabilities at Intel’s SC23 booth.
+
+Finally, our ongoing efforts to develop IRI tools and capabilities got a boost with Polaris and the launch of Argonne’s Nexus — a coordinated effort that builds on our decades of research to integrate HPC resources with experiments. We currently have workflows from the Advanced Photon Source and the DIII-D National Fusion Facility running on Polaris, as well as workflows prototyped for DOE’s Earth System Grid Federation and Fermilab’s flagship Short Baseline Neutrino Program. Our team also delivered talks to share our IRI research at the Monterey Data Conference, the Smoky Mountains Computational Sciences and Engineering Conference, Confab23, and the DOE booth at SC23. With momentum building for continued advances in our IRI activities, the Aurora deployment, and AI for science, we have a lot to look forward to in 2024.
+
+{% include media-img.html
+   source= "Ramprakash_1600x900.jpg"
+   caption= "Jini Ramprakash"
+%}
+
+# Jini Ramprakash, ALCF Deputy Director
+
+It was a busy year for the ALCF as we continued to make strides in deploying new systems, tools, and capabilities to support HPC- and AI-driven scientific research, while also broadening our outreach efforts to engage with new communities. In the outreach space, we partnered with colleagues at the Exascale Computing Project, NERSC, OLCF, and the Sustainable Horizons Institute to host DOE’s first “Intro to HPC Bootcamp.” With an emphasis on energy justice and workforce development, the event welcomed around 60 college students (many with little to no background in scientific computing) to use HPC for hands-on projects focused on making positive social impacts. It was very gratifying to see how engaged the students were in this immersive, week-long event. The bootcamp is a great addition to our extensive outreach efforts aimed at cultivating the next-generation computing workforce.
+
+Our ongoing efforts to develop an Integrated Research Infrastructure (IRI) also made considerable progress this year. As a member of DOE’s IRI Task Force and IRI Blueprint Activity over the past few years, I’ve had the opportunity to collaborate with colleagues across the national labs to formulate a long-term strategy for integrating computing facilities like the ALCF with data-intensive experimental and observational facilities. In 2023, we released the IRI Architecture Blueprint Activity Report, which lays out a framework for moving ahead with coordinated implementation efforts across DOE. At the same time, the ALCF continued to develop and demonstrate tools and methods to integrate our supercomputers with experimental facilities, such as Argonne’s Advanced Photon Source and the DIII-D. This year, Argonne launched the “Nexus” effort, which brings together all of the lab’s new and ongoing research activities and partnerships in this domain, ensuring they align with DOE’s broader IRI vision.
+
+We also made progress toward launching the Argonne Enterprise Registration System, a new lab-wide registration platform aimed at standardizing data collection and processing for various categories of non-employees, including facility users. In 2023, we defined system requirements and issued a request for proposals for building the platform. Ultimately, the new system will help eliminate redundant data entry, simplify registration processes for both users and staff, and enhance our reporting capabilities.
+
+As a final note on 2023, we kicked off the ALCF-4 project to plan for our next-generation supercomputer, with DOE approving the CD-0 (Critical Decision-0) mission need for the project in April. We also established the leadership team (with myself as the project director and Kevin Harms as technical director) and began conversations with vendors to discuss their technology roadmaps. We look forward to ramping up the ALCF-4 project in 2024.
+
+{% include media-img.html
+   source= "Riley_1600x900.jpg"
+   caption= "Katherine Riley"
+%}
+
+# Katherine Riley, ALCF Director of Science
+
+Year after year, our user community breaks new ground in using HPC and AI for science. From improving climate modeling capabilities to speeding up the discovery of new materials and advancing our understanding of complex cosmological phenomena, the research generated by ALCF users never ceases to amaze me.
+
+In 2023, we supported 18 INCITE projects and 33 ALCC projects (across two ALCC allocation cycles), as well as numerous Director’s Discretionary projects. Many of these projects were among the last to use Theta, which was retired at the end of the year. Over its 6+ year run as our production supercomputer, Theta delivered 202 million node-hours to 636 projects. The system also played a key role in bolstering our facility’s AI and data science capabilities. Theta was a remarkably productive and reliable machine that will be missed by ALCF users and staff alike. 
+
+Research projects supported by ALCF computing resources produced 240 publications in 2023. You can read about several of these efforts in the science highlights section of this report, including a University of Illinois Chicago team that identified the exact reaction coordinates for a key protein mechanism for the first time; a team from the University of Dayton Research Institute and Air Force Research Laboratory that shed light on the complex thermal environments encountered by hypersonic vehicles; and an Argonne team that investigated the impact of disruptions in cancer screening caused by the COVID-19 pandemic. 
+
+It was also a very exciting year for Aurora as early science teams began using the exascale system for the first time. After years of diligent work to prepare codes for the Aurora’s unique architecture, the teams were able to begin scaling and optimizing their applications on the machine. Their early performance results have been very promising, giving us a glimpse of what will be possible when teams start using the full supercomputer for their research campaigns next year.