Description

Artifact description appendix for SC 2021 conference article titled “Revealing Power, Energy and Thermal Dynamics of a 200PF Pre-Exascale Supercomputer”.

Paper link - ACM Digital Library

The repository contains:

Python notebooks: The plots in the article are generated by these files.
Python scripts: The processed dataset from raw log files were generated by the python scripts. Each python script has an associated README file that briefly describes the dataset and lists the field name with a description for the generated dataset.

Section 0: Tools and Packages Used for Data Pre-processing and Analysis?

Dask (https://dask.org/)
Pandas (https://pandas.pydata.org/)
PyArrow (https://arrow.apache.org/docs/python/)
Numpy (https://numpy.org/)
Scipy (https://www.scipy.org/)
Matplotlib (https://matplotlib.org/)
Seaborn (https://seaborn.pydata.org/)

Section 1: Raw Dataset Description

Dataset A

Name: Summit per node OpenBMC telemetry Per-node per component power and temperature measurements
Files (type and quantity): one tar file per day that archives 1,440 parquet files per day, total 399 files (2019-12-27 to 2021-1-31) except missing dates due to maintenance periods
Memory Footprint: 8.5 TB
Source: Per-node OpenBMC data from Summit archived via the telemetry system for MTW operations
Frequency: 1 sec
Data of Creation / Last Update: one file per day from 2019-12-27 to 2021-1-31

Dataset B

Name: Central energy plant (CEP) data
Files (type and quantity): One parquet per month, 12 parquet files
Memory Footprint: 256 MB
Source: Control system of Summit's central energy plant via the telemetry system for MTW operations
Frequency: Approx. 15 second interval
Data of Creation / Last Update: One file per month, 2020-1-31 ~ 2021-01-31

Dataset C

Name: Job Scheduler allocation history
Files (type and quantity): Single csv file
Memory Footprint: 285 MB
Source: IBM CSM system via telemetry data store for Summit
Frequency: At occurence
Data of Creation / Last Update: 2021-2-28

Dataset D

Name: Per node job scheduler allocation history
Files (type and quantity): Single csv file
Memory Footprint: 14 GB
Source: IBM CSM system via telemetry data store for Summit
Frequency: At occurence
Data of Creation / Last Update: 2021-2-27

Dataset E

Name: Nvidia GPU XID error log
Files (type and quantity): Single csv file
Memory Footprint: 50 MB
Source: Per-node syslog data via telemetry data store for Summit
Frequency: At occurence
Data of Creation / Last Update: 2021-2-12

Section 2: Pre-Processed Dataset Description

Dataset 0

Name: Summit per node OpenBMC telemetry 10-second aggregates 10-second aggregation of min, max, mean, std per-node OpenBMC telemetry data that measures node-wise, component-wise power and temperature.
Script: andes-load-summit-power-temp-openbmc-init10s-agg.py
Input: 1 sec interval Summit per node OpenBMC telemetry data
Output: 10 second aggregates of Summit per node OpenBMC telemetry data - one parquet file per day
Memory Footprint: 5.5 TB
Index used: timestamp
Key Columns: timestamp, input_power.[count, min, max, mean, std], p[0,1]_power.[count, min, max, mean, std], p[0,1]_gpu[0,1,2]_power.[count, min, max, mean, std], gpu[0,1,2,3,5]_[core,mem]_temp.[count, min, max, mean, std]

Dataset 1

Name: Cluster-level power time-series The cluster-level power time-series data has aggregated cluster-level aggregated power values at every 10 seconds. For each timestamp, the power values are calculated by taking the sum of input power from all the nodes at that instance.
Script: power_ts_job_ignorant.py
Files (type and quantity): Power time series dataset with 10 seconds frequency. 1 parquet file for a day with 1 minute partition.
Memory Footprint: 1.5 GB
Index used: timestamp
Key Columns: timestamp, count_inp, sum_inp, mean_inp, max_inp

Dataset 2

Name: Cluster-level CPUs and GPUs component power time-series Cluster level CPU and GPU components are calculated by aggregating power values for every CPU and GPU component in a node.
Script: power_ts_job_ignorant_component.py
Files (type and quantity): CPU and GPU component power time series dataset with 10 seconds frequency. 1 parquet file for a day with 1 minute partition.
Memory Footprint: 0.5 GB
Index used: timestamp
Key Columns: timestamp,mean_cpu_power,std_cpu_power,min_cpu_power,max_cpu_power,mean_gpu_power,std_gpu_power,max_gpu_power

Dataset 3

Name: Job wise power time-series The dataset has a time-series of power values for every job. It is generated by combining node-level power consumption data and the job scheduler data, which contains the list of nodes on which job has run.
Script: power_ts_job_aware.py
Files (type and quantity): Power time-series and job scheduler time-series dataset each having one parquet file for a day.
Memory Footprint: 49 GB
Index used: allocation_id,timestamp
Key Columns: allocation_id,timestamp,count_hostname,sum_inp,max_inp,mean_inp

Dataset 4

Name: Job wise CPU and GPU components power time-series The data has time-series of CPU and GPU power consumption usage for every jobs. It is generated by combining node-level power consumption data and the job scheduler data, which contains the list of nodes on which job has run.
Script: power_ts_job_aware_component.py
Files (type and quantity): CPU and GPU components power time-series and job scheduler time-series dataset each having one parquet file for a day.
Memory Footprint: 45 GB
Index used: allocation_id
Key Columns: allocation_id,timestamp,count_hostnamemean_cpu_power,std_cpu_power,max_cpu_power,cpu_nans,mean_gpu_power,std_gpu_power,max_gpu_power,gpu_nans

Dataset 5

Name: Job-level power data The per-node job-level power allocated data contains aggregated power values for a job across its run-time.
Script: power_job_aware.py
Files (type and quantity): Aggregating power time-series data over its job run. The input dataset has csv files for each day and the output dataset also has csv files for each day.
Memory Footprint: 14 GB
Index used: allocation_id
Key Columns: allocation_id,max_sum_inp,mean_sum_inp,begin_time,end_time

Dataset 6

Name: Job-level CPU and GPU components power data The job-level aggregated power values for per-node CPU and GPU components for a job across its run-time.
Script: power_job_aware_component.py
Files (type and quantity): Aggregating CPU and GPU components power time-series data over its job run. The input dataset has csv files for each day and the output dataset also has csv files for each day.
Memory Footprint: 200 MB
Index used: allocation_id
Key Columns: allocation_id,mean_mean_cpu_pwr,max_cpu_pwr,mean_mean_gpu_pwr,max_gpu_pwr,begin_time,end_time

Dataset 7

Name: Job-level energy data The job-level energy data is calculated by aggregating the energy values consumed by each node of a job.
Script: job_energy.py
Files (type and quantity): The dataset has one parquet file for each day. We sum up energy values across the nodes on which job has run.
Memory Footprint: 100 MB
Index used: allocation_id
Key Columns: allocation_id,energy,gpu_energy,num_nodes,num_gpus,begin_time,end_time,job_domain,account

Dataset 8

Name: Thermal cluster-level time-series Each row corresponds to a 10-second time interval and contains the number of nodes with thermal measurements, the list of nodes and their GPUs that were hot, and the number of nodes in each temperature band, together with telemetrics for the cooling plant.
Script: andes-thermal-cluster.py
Files (type and quantity): 1 CSV file for each day.
Memory Footprint: 1 GB
Index used: timestamp
Key Columns: hostname, any_nan

Dataset 9

Name: Thermal cluster-level time series for component types Each row corresponds to a 10-second time interval and contains information about component temperature distribution across Summit, together with telemetrics for the cooling plant.
Script: thermal-cluster-comptype.py
Files (type and quantity): 1 CSV file for each day.
Memory Footprint: 2 GB
Index used: timestamp
Key Columns: gpu_core.mean

Dataset 10

Name: Thermal per-node job-level time series Each row corresponds to a 10-second time interval in a job and contains its number of nodes with thermal measurements, the list of nodes and their GPUs that were hot, and the number of nodes in each temperature band, together with telemetrics for the cooling plant.
Script: andes-thermal-perjob-time.py
Files (type and quantity): 1 CSV file for each day.
Memory Footprint: 167 GB
Index used: timestamp, allocation_id
Key Columns: hostname, any_nan

Dataset 11

Name: Thermal job-level time series for component types Each row corresponds to a 10-second time interval in a job and contains information about component temperature distribution across the job at this time, together with telemetrics for the cooling plant.
Script: thermal-perjob-comptype.py
Files (type and quantity): 1 CSV file for each day.
Memory Footprint: 268 GB
Index used: timestamp, allocation_id
Key Columns: gpu_core.mean

Dataset 12

Name: Summit cooling system and weather time-series Each row corresponds to a 10-second time interval and contains telemetrics for the cooling plant.
Files (type and quantity): single parquet file
Memory Footprint: 350 MB
Index used: timestamp
Key Columns: mtwst, mtwrt

Dataset 13

Name: Main switch board meter data Power measurements at the main switch boards depicted in Figure 1-(c) in the period of 2021-01-14 ~ 2021-01-15
Files (type and quantity): total 5 csv files, each per main switch board
Dimension: 172,800 x 2
Memory Footprint: 7.2MB x 4
Index used: timestamp
Key Columns: B5600_MSB{MSB_ID}_MTR\1s\KW\Total

Section 2: Analytics Descriptions

Figure 3 Power meter vs. per-node sensor at scale

Filename: validation.ipynb
Datasets: Main switch board meter data (Dataset 12), Per-node 10-second time series data
Tools Used: pandas, dask, matplotlib, seaborn
Primary Calculations Performed: Per-node 10-second time series data is joined with a node to MSB mapping that has been manually created from the floormap. Then a groupby summation per MSB was performed to produce 10-second mean power time-series data. This data was compared with 10-second averages of the MSB level measurements.
Other Complimentary Calculations: N/A

Figure 4

Filename: summit-pue-plot-clean.ipynb
Datasets: Summit cooling system and weather time-series (Dataset 11)
Tools Used: pandas, numpy, matplotlib, seaborn
Primary Calculations Performed: 5 columns of the Summit cooling system and weather time-series data are summarized into weekly box plots over the year 2020. For the weekly power summaries, we also plot the maximum cluster-level power seen that week.
Other Complimentary Calculations: We calculate the average PUE of 2020 and the average PUE during just the summer with Summit's chillers active.

Figure 5

Filename: input_power_total_energy.ipynb
Dataset: Job-level power data (Dataset 5)
Tools Used: pandas, numpy, matplotlib, seaborn
Primary Calculations Performed: The energy consumption of the jobs and maximum input power is an artifact of profiling jobs. The Gaussian kernel density plots show the distribution of input power and total energy across five classes.
Other Complimentary Calculations: N/A

Figure 6

Filename: boxplot_input_power_total_energy.ipynb
Datasets: Job-level power data (Dataset 5), Job-level energy data (Dataset 7)
Tools Used: dask, pandas, numpy, matplotlib, seaborn
Primary Calculations Performed: The two leadership node count classes are compared over a variety of metrics: Number of Nodes in Job, Walltime of Job, Mean Power, Max Power, and (Mean - Max) Power Difference. Each of those are shown are cumulative density functions with the 80% level marked in red.
Other Complimentary Calculations: N/A

Figure 7

Filename: boxplot_input_power_total_energy.ipynb
Datasets: Job-level power data (Dataset 5), Job-level energy data (Dataset 7)
Tools Used: dask, pandas, numpy, matplotlib, seaborn
Primary Calculations Performed: The two leadership node count classes are compared in both energy and max power. Results are further divided by OLCF project science domains and presented as boxplot distributions.
Other Complimentary Calculations: N/A

Figure 8

Filename: cpu-gpu.ipynb
Datasets: Job wise CPU and GPU components per-node power (Dataset 6)
Tools Used: dask, pandas, numpy, matplotlib, seaborn
Primary Calculations Performed: Partition jobs into the 5 node count classes then produce four 2-dimensional kde-plots based on mean and maximum CPU power as 1 dimension, and GPU Power for the two leadership classes and the three smaller classes.
Other Complimentary Calculations: N/A

Figure 9

Filename: summit-edges-plot-clean.ipynb
Datasets: Job wise power time-series (Dataset 3)
Tools Used: pandas, numpy, matplotlib, seaborn
Primary Calculations Performed: We identify one key 4608 node job that lasts 7 minutes long. We show a play-by-play snapshot that present boxplots of both the individual GPU powers and temperatures along with their maximums. Six instants are further examined to visualize the distribution of GPU powers versus the temperatures for all GPUs participating in the job. Lastly, GPU core temperatures at the six instants are aggregating into racks and displayed as a heatmap looking down on the Summit floor layout. Both mean and maximum GPU temperatures are plotted. Missing racks are plotted in grey and racks not participating are plotted in bright green.
Other Complimentary Calculations: Spread of the GPU core temperatures is 15.8 degrees C and the spread of the GPU power is 62.2 Watts.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
plots_codes		plots_codes
refinery218		refinery218
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

at-aaims/sc21_summit_power_analysis_artifacts

Folders and files

Latest commit

History

Repository files navigation

Description

Artifact description appendix for SC 2021 conference article titled “Revealing Power, Energy and Thermal Dynamics of a 200PF Pre-Exascale Supercomputer”.

Section 0: Tools and Packages Used for Data Pre-processing and Analysis?

Section 1: Raw Dataset Description

Dataset A

Dataset B

Dataset C

Dataset D

Dataset E

Section 2: Pre-Processed Dataset Description

Dataset 0

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Dataset 7

Dataset 8

Dataset 9

Dataset 10

Dataset 11

Dataset 12

Dataset 13

Section 2: Analytics Descriptions

Figure 3 Power meter vs. per-node sensor at scale

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages