Artifact description appendix for SC 2021 conference article titled “Revealing Power, Energy and Thermal Dynamics of a 200PF Pre-Exascale Supercomputer”.
Paper link - ACM Digital Library
The repository contains:
- Python notebooks: The plots in the article are generated by these files.
- Python scripts: The processed dataset from raw log files were generated by the python scripts. Each python script has an associated README file that briefly describes the dataset and lists the field name with a description for the generated dataset.
- Dask (https://dask.org/)
- Pandas (https://pandas.pydata.org/)
- PyArrow (https://arrow.apache.org/docs/python/)
- Numpy (https://numpy.org/)
- Scipy (https://www.scipy.org/)
- Matplotlib (https://matplotlib.org/)
- Seaborn (https://seaborn.pydata.org/)
- Name: Summit per node OpenBMC telemetry Per-node per component power and temperature measurements
- Files (type and quantity): one tar file per day that archives 1,440 parquet files per day, total 399 files (2019-12-27 to 2021-1-31) except missing dates due to maintenance periods
- Memory Footprint: 8.5 TB
- Source: Per-node OpenBMC data from Summit archived via the telemetry system for MTW operations
- Frequency: 1 sec
- Data of Creation / Last Update: one file per day from 2019-12-27 to 2021-1-31
- Name: Central energy plant (CEP) data
- Files (type and quantity): One parquet per month, 12 parquet files
- Memory Footprint: 256 MB
- Source: Control system of Summit's central energy plant via the telemetry system for MTW operations
- Frequency: Approx. 15 second interval
- Data of Creation / Last Update: One file per month, 2020-1-31 ~ 2021-01-31
- Name: Job Scheduler allocation history
- Files (type and quantity): Single csv file
- Memory Footprint: 285 MB
- Source: IBM CSM system via telemetry data store for Summit
- Frequency: At occurence
- Data of Creation / Last Update: 2021-2-28
- Name: Per node job scheduler allocation history
- Files (type and quantity): Single csv file
- Memory Footprint: 14 GB
- Source: IBM CSM system via telemetry data store for Summit
- Frequency: At occurence
- Data of Creation / Last Update: 2021-2-27
- Name: Nvidia GPU XID error log
- Files (type and quantity): Single csv file
- Memory Footprint: 50 MB
- Source: Per-node syslog data via telemetry data store for Summit
- Frequency: At occurence
- Data of Creation / Last Update: 2021-2-12
- Name: Summit per node OpenBMC telemetry 10-second aggregates 10-second aggregation of min, max, mean, std per-node OpenBMC telemetry data that measures node-wise, component-wise power and temperature.
- Script:
andes-load-summit-power-temp-openbmc-init10s-agg.py
- Input: 1 sec interval Summit per node OpenBMC telemetry data
- Output: 10 second aggregates of Summit per node OpenBMC telemetry data - one parquet file per day
- Memory Footprint: 5.5 TB
- Index used:
timestamp
- Key Columns:
timestamp, input_power.[count, min, max, mean, std], p[0,1]_power.[count, min, max, mean, std], p[0,1]_gpu[0,1,2]_power.[count, min, max, mean, std], gpu[0,1,2,3,5]_[core,mem]_temp.[count, min, max, mean, std]
- Name: Cluster-level power time-series The cluster-level power time-series data has aggregated cluster-level aggregated power values at every 10 seconds. For each timestamp, the power values are calculated by taking the sum of input power from all the nodes at that instance.
- Script:
power_ts_job_ignorant.py
- Files (type and quantity): Power time series dataset with 10 seconds frequency. 1 parquet file for a day with 1 minute partition.
- Memory Footprint: 1.5 GB
- Index used:
timestamp
- Key Columns:
timestamp, count_inp, sum_inp, mean_inp, max_inp
- Name: Cluster-level CPUs and GPUs component power time-series Cluster level CPU and GPU components are calculated by aggregating power values for every CPU and GPU component in a node.
- Script:
power_ts_job_ignorant_component.py
- Files (type and quantity): CPU and GPU component power time series dataset with 10 seconds frequency. 1 parquet file for a day with 1 minute partition.
- Memory Footprint: 0.5 GB
- Index used:
timestamp
- Key Columns:
timestamp,mean_cpu_power,std_cpu_power,min_cpu_power,max_cpu_power,mean_gpu_power,std_gpu_power,max_gpu_power
- Name: Job wise power time-series The dataset has a time-series of power values for every job. It is generated by combining node-level power consumption data and the job scheduler data, which contains the list of nodes on which job has run.
- Script:
power_ts_job_aware.py
- Files (type and quantity): Power time-series and job scheduler time-series dataset each having one parquet file for a day.
- Memory Footprint: 49 GB
- Index used:
allocation_id,timestamp
- Key Columns:
allocation_id,timestamp,count_hostname,sum_inp,max_inp,mean_inp
- Name: Job wise CPU and GPU components power time-series The data has time-series of CPU and GPU power consumption usage for every jobs. It is generated by combining node-level power consumption data and the job scheduler data, which contains the list of nodes on which job has run.
- Script:
power_ts_job_aware_component.py
- Files (type and quantity): CPU and GPU components power time-series and job scheduler time-series dataset each having one parquet file for a day.
- Memory Footprint: 45 GB
- Index used:
allocation_id
- Key Columns:
allocation_id,timestamp,count_hostnamemean_cpu_power,std_cpu_power,max_cpu_power,cpu_nans,mean_gpu_power,std_gpu_power,max_gpu_power,gpu_nans
- Name: Job-level power data The per-node job-level power allocated data contains aggregated power values for a job across its run-time.
- Script:
power_job_aware.py
- Files (type and quantity): Aggregating power time-series data over its job run. The input dataset has csv files for each day and the output dataset also has csv files for each day.
- Memory Footprint: 14 GB
- Index used:
allocation_id
- Key Columns:
allocation_id,max_sum_inp,mean_sum_inp,begin_time,end_time
- Name: Job-level CPU and GPU components power data The job-level aggregated power values for per-node CPU and GPU components for a job across its run-time.
- Script:
power_job_aware_component.py
- Files (type and quantity): Aggregating CPU and GPU components power time-series data over its job run. The input dataset has csv files for each day and the output dataset also has csv files for each day.
- Memory Footprint: 200 MB
- Index used:
allocation_id
- Key Columns:
allocation_id,mean_mean_cpu_pwr,max_cpu_pwr,mean_mean_gpu_pwr,max_gpu_pwr,begin_time,end_time
- Name: Job-level energy data The job-level energy data is calculated by aggregating the energy values consumed by each node of a job.
- Script: job_energy.py
- Files (type and quantity): The dataset has one parquet file for each day. We sum up energy values across the nodes on which job has run.
- Memory Footprint: 100 MB
- Index used:
allocation_id
- Key Columns:
allocation_id,energy,gpu_energy,num_nodes,num_gpus,begin_time,end_time,job_domain,account
- Name: Thermal cluster-level time-series Each row corresponds to a 10-second time interval and contains the number of nodes with thermal measurements, the list of nodes and their GPUs that were hot, and the number of nodes in each temperature band, together with telemetrics for the cooling plant.
- Script:
andes-thermal-cluster.py
- Files (type and quantity): 1 CSV file for each day.
- Memory Footprint: 1 GB
- Index used:
timestamp
- Key Columns:
hostname
,any_nan
- Name: Thermal cluster-level time series for component types Each row corresponds to a 10-second time interval and contains information about component temperature distribution across Summit, together with telemetrics for the cooling plant.
- Script:
thermal-cluster-comptype.py
- Files (type and quantity): 1 CSV file for each day.
- Memory Footprint: 2 GB
- Index used:
timestamp
- Key Columns:
gpu_core.mean
-
Name: Thermal per-node job-level time series Each row corresponds to a 10-second time interval in a job and contains its number of nodes with thermal measurements, the list of nodes and their GPUs that were hot, and the number of nodes in each temperature band, together with telemetrics for the cooling plant.
-
Script:
andes-thermal-perjob-time.py
-
Files (type and quantity): 1 CSV file for each day.
-
Memory Footprint: 167 GB
-
Index used:
timestamp
,allocation_id
-
Key Columns:
hostname
,any_nan
-
Name: Thermal job-level time series for component types Each row corresponds to a 10-second time interval in a job and contains information about component temperature distribution across the job at this time, together with telemetrics for the cooling plant.
-
Script:
thermal-perjob-comptype.py
-
Files (type and quantity): 1 CSV file for each day.
-
Memory Footprint: 268 GB
-
Index used:
timestamp
,allocation_id
-
Key Columns:
gpu_core.mean
- Name: Summit cooling system and weather time-series Each row corresponds to a 10-second time interval and contains telemetrics for the cooling plant.
- Files (type and quantity): single parquet file
- Memory Footprint: 350 MB
- Index used:
timestamp
- Key Columns:
mtwst
,mtwrt
- Name: Main switch board meter data Power measurements at the main switch boards depicted in Figure 1-(c) in the period of 2021-01-14 ~ 2021-01-15
- Files (type and quantity): total 5 csv files, each per main switch board
- Dimension: 172,800 x 2
- Memory Footprint: 7.2MB x 4
- Index used:
timestamp
- Key Columns:
B5600_MSB{MSB_ID}_MTR\1s\KW\Total
- Filename: validation.ipynb
- Datasets: Main switch board meter data (Dataset 12), Per-node 10-second time series data
- Tools Used: pandas, dask, matplotlib, seaborn
- Primary Calculations Performed: Per-node 10-second time series data is joined with a node to MSB mapping that has been manually created from the floormap. Then a groupby summation per MSB was performed to produce 10-second mean power time-series data. This data was compared with 10-second averages of the MSB level measurements.
- Other Complimentary Calculations: N/A
- Filename: summit-pue-plot-clean.ipynb
- Datasets: Summit cooling system and weather time-series (Dataset 11)
- Tools Used: pandas, numpy, matplotlib, seaborn
- Primary Calculations Performed: 5 columns of the Summit cooling system and weather time-series data are summarized into weekly box plots over the year 2020. For the weekly power summaries, we also plot the maximum cluster-level power seen that week.
- Other Complimentary Calculations: We calculate the average PUE of 2020 and the average PUE during just the summer with Summit's chillers active.
- Filename: input_power_total_energy.ipynb
- Dataset: Job-level power data (Dataset 5)
- Tools Used: pandas, numpy, matplotlib, seaborn
- Primary Calculations Performed: The energy consumption of the jobs and maximum input power is an artifact of profiling jobs. The Gaussian kernel density plots show the distribution of input power and total energy across five classes.
- Other Complimentary Calculations: N/A
- Filename: boxplot_input_power_total_energy.ipynb
- Datasets: Job-level power data (Dataset 5), Job-level energy data (Dataset 7)
- Tools Used: dask, pandas, numpy, matplotlib, seaborn
- Primary Calculations Performed: The two leadership node count classes are compared over a variety of metrics: Number of Nodes in Job, Walltime of Job, Mean Power, Max Power, and (Mean - Max) Power Difference. Each of those are shown are cumulative density functions with the 80% level marked in red.
- Other Complimentary Calculations: N/A
- Filename: boxplot_input_power_total_energy.ipynb
- Datasets: Job-level power data (Dataset 5), Job-level energy data (Dataset 7)
- Tools Used: dask, pandas, numpy, matplotlib, seaborn
- Primary Calculations Performed: The two leadership node count classes are compared in both energy and max power. Results are further divided by OLCF project science domains and presented as boxplot distributions.
- Other Complimentary Calculations: N/A
- Filename: cpu-gpu.ipynb
- Datasets: Job wise CPU and GPU components per-node power (Dataset 6)
- Tools Used: dask, pandas, numpy, matplotlib, seaborn
- Primary Calculations Performed: Partition jobs into the 5 node count classes then produce four 2-dimensional kde-plots based on mean and maximum CPU power as 1 dimension, and GPU Power for the two leadership classes and the three smaller classes.
- Other Complimentary Calculations: N/A
- Filename: summit-edges-plot-clean.ipynb
- Datasets: Job wise power time-series (Dataset 3)
- Tools Used: pandas, numpy, matplotlib, seaborn
- Primary Calculations Performed: We identify one key 4608 node job that lasts 7 minutes long. We show a play-by-play snapshot that present boxplots of both the individual GPU powers and temperatures along with their maximums. Six instants are further examined to visualize the distribution of GPU powers versus the temperatures for all GPUs participating in the job. Lastly, GPU core temperatures at the six instants are aggregating into racks and displayed as a heatmap looking down on the Summit floor layout. Both mean and maximum GPU temperatures are plotted. Missing racks are plotted in grey and racks not participating are plotted in bright green.
- Other Complimentary Calculations: Spread of the GPU core temperatures is 15.8 degrees C and the spread of the GPU power is 62.2 Watts.