Name		Name	Last commit message	Last commit date
parent directory ..
experiemnts/output_provided		experiemnts/output_provided
experiments_output/tables/individual_tables		experiments_output/tables/individual_tables
sample_inputs		sample_inputs
schedules		schedules
README.md		README.md
generate_figures.py		generate_figures.py
generate_tables.py		generate_tables.py
json_gen.py		json_gen.py
run_experiments.sh		run_experiments.sh

README.md

Examples

The examples here reproduce the data in our paper.

Generate input data

To prepare input data for experiments, run

python json_gen.py

to create experiments directory with input files inside:

experiments
├── DGX2
│   └── 2_chassis
│       ├── AllGather
│       │   ├── Fast
│       │   │   ├── 1KB.json
│       │   │   ├── <up to 1GB.json, 11 files in total>
│       │   ├── Fast_Early_Stop
│       │   └── Slow
│       └── AlltoAll
├── NDv2
│   ├── 2_chassis
│   │   ├── AllGather
│   │   └── AlltoAll
│   └── 4_chassis
│       ├── AllGather
│       └── AlltoAll
└── output
    ├── DGX2_output
    │   └── 2_chassis
    │       ├── AllGather
    │       └── AlltoAll
    └── NDv2_output
        ├── 2_chassis
        │   ├── AllGather
        │   └── AlltoAll
        └── 4_chassis
            ├── AllGather
            └── AlltoAll

Each of the AllGather directories contains 3 link types: Fast, Fast_Early_Stop, and Slow; AlltoAll contains Fast and Slow. Each link type directory contains 11 input config files for data sizes ranging from 1KB to 1GB. This totals to 11 * 3 (topologies i.e. DGX2 2 chassis, NDv2 2 chassis, NDv2 4 chassis) * 3 (link types) + 11 * 3 * 2 = 165 experiments to run.

Run TE-CCL

To run TE-CCL on all input files we just created, run

./run_experiments.sh

Output are written to experiments/output directory. Running everything takes approximately 200 hours (8 days).

Analyze Data

Generate Tables

After getting all outputs from TE-CCL, run

python generate_tables.py

to generate tables from raw data. Tables are organized in experiemts_output directory as follows:

experiments_output
└── tables
    ├── comparison_tables
    └── individual_tables

Tables in individual_tables contains important metrics such as algorithm bandwidth for each experiment setup.

Tables in comparison_tables compares TE-CCL against TACCL. These are Table 8 in the Appendix (see Data Differences).

Generate Figures

After getting all the tables, run

python generate_figures.py

to generate figures in the paper, whcih are

experiments_output
└── figures
    ├── fig_10_a.pdf
    ├── fig_10_b.pdf
    ├── fig_5_a.pdf
    ├── fig_5_b.pdf
    ├── fig_6_a.pdf
    └── fig_6_b.pdf

Notice that we use the solver time data from our experiments to reproduce the figures, because solver time varies a lot due to physical capacities of different machines.

Provided Output

We ran TE-CCL and provided its output in experiments/output_provided, which should have the same content in experiments/output after completing all experiments.

Additional Notes

TACCL [1]

We ran TACCL and provided the results in individual_tables. To run TACCL users should refer to the TACCL repo.

[1] Shah, A., Chidambaram, V., Cowan, M., Maleki, S., Musuvathi, M., Mytkowicz, T., Nelson, J. and Saarikivi, O., 2023. {TACCL}: Guiding Collective Algorithm Synthesis using Communication Sketches. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 593-612).

Data Differences

We have been improving TE-CCL since publication so results here are similar to or better than the results in Appendix. Notably, for NDv2 2 chassis AllGather optimal epoch duration, TE-CCL took 7201.05s for 1GB case, as shown in the Appendix. Now TE-CCL takes around just 1 minute for the same setup.

Proprietary Internal Topology

All the data and figures that involve our proprietary internal topology (int1 and int2 topology in the paper) are removed from the codebase to protect proprietary information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

README.md

Examples

Generate input data

Run TE-CCL

Analyze Data

Generate Tables

Generate Figures

Provided Output

Additional Notes

TACCL [1]

Data Differences

Proprietary Internal Topology

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

Examples

Generate input data

Run TE-CCL

Analyze Data

Generate Tables

Generate Figures

Provided Output

Additional Notes

TACCL [1]

Data Differences

Proprietary Internal Topology