GoodPut Guide Added #7

b0noI · 2025-05-27T17:39:44Z

No description provided.

…ective as Jules: I've updated the GoodPut guide based on your feedback. This update significantly enhances GOODPUT_GUIDE.md to make it more comprehensive and address your specific requests: - **Generalization:** I've reframed the guide as a general document for GoodPut optimization, using the LLaMA3-1-70B recipe as a case study. - **Sources of BadPut:** I added a new section detailing common sources of inefficiency, including quantitative examples and references from the Google Cloud blog post. - **Enhanced Technical Details:** - The "Elastic Training Solutions" section now includes more in-depth explanations of the Supervisor components (Sensor, Controller, Actuator), Host Monitors, and their interactions. It also provides more specific links to configuration parameters in `values-supervisor.yaml`, `ksa-setup.yaml`, the supervisor Helm chart, and the NVIDIA Resiliency Extension. - The "Optimized Checkpointing Solutions" section offers more detailed descriptions of asynchronous and distributed checkpointing, and the role of GCS FUSE. It includes more specific links to flags in `values.yaml`, GCS configuration in `values-gcs.yaml`, the gcs-fuse Helm chart, and external PyTorch/GCS FUSE documentation. - **Link Precision:** I've made the links to files within the repository (e.g., `values.yaml`, `README.md`) in the "DIY" and "Goodput Analysis" sections more explicit and accurate. The guide now provides a more thorough overview of GoodPut concepts, practical implementation details based on the recipe, and pointers to relevant code, configurations, and external resources.

This commit addresses several improvements to the GOODPUT_GUIDE.md: - Downloads and embeds the previously externally linked image. - Converts various plain text key references (e.g., `heartbeat_polling_period_s`) and file paths (e.g., `values-supervisor.yaml`, `ksa-setup.yaml`, Helm chart paths) into relative markdown links. - Integrates the "DIY/Lego blocks" concept into existing sections rather than having a standalone section. - Adds a new subsection explaining how to use the Supervisor system with a custom model, detailing deployment, configuration, Actuator integration, and checkpointing/resumption needs.

…and add a checkpointing preamble. This change addresses your request to: - Extend the detailed, link-rich style of the Supervisor section to other relevant sections, particularly Remediation Strategies and Goodput Analysis. - Add a preamble to the Optimized Checkpointing section to guide you on choosing between asynchronous, distributed checkpointing, and how these relate to the overall resiliency managed by the Supervisor system. - Enhance descriptions of various checkpointing parameters to clarify their use and tunability. - Remove all 'State of Support' sections for brevity. - Remove the 'Conclusion' section as requested.

This commit addresses several improvements for the GoodPut guide: 1. **Adds a TLDR Section:** A new "TLDR: Recommended Lego Blocks for Your Deployment" section has been added after the introduction. This section provides a concise, actionable summary of key strategies (Supervisor System, Asynchronous Checkpointing, Distributed Checkpointing, GCS with FUSE, and Checkpoint Frequency Tuning) to help you quickly understand how to improve GoodPut for your own workloads. 2. **Updates File References to Clickable Links:** All plain text references to internal files (e.g., `values.yaml`, `values-supervisor.yaml`, `README.md`, Helm chart paths) and relevant external documentation URLs within the guide have been converted into clickable Markdown links. This improves navigability and makes it easier for you to access referenced resources and code. Relative paths for internal files have been carefully constructed. These changes aim to make the guide more user-friendly, easier to navigate, and more actionable for customers looking to apply these resiliency and efficiency techniques to their ML training deployments.

This commit significantly refactors the GOODPUT_GUIDE.md to improve its clarity, provide more concrete examples, and enhance its overall impact based on your feedback. Key changes include: - Enhanced Introduction: Added a concrete example with calculations to quantify the cost of wasted GPU hours due to low GoodPut. - Added BadPut Summary Table: Introduced a table at the beginning of the guide, listing common sources of BadPut, their descriptions, and estimated potential GoodPut loss percentages. This provides an upfront overview of challenges. - Restructured "TLDR: Recommended Lego Blocks": Reordered the "Lego Blocks" to prioritize checkpointing strategies, followed by the supervisor system, as checkpointing often offers significant initial GoodPut improvements. Added quantitative justifications (potential GoodPut recovery) to these key blocks. - Streamlined Content: Removed the redundant "Understanding Sources of BadPut (Lost Efficiency)" section, as its content was integrated into the new summary table. - Added Concluding Section: A new section "Tying It All Together: A Holistic Approach" was added to summarize the guide's message and emphasize the synergistic application of the discussed strategies. - Comprehensive Review: Ensured all examples, references, and internal links remain accurate and contextually appropriate after the restructuring. These changes aim to make the guide more actionable and persuasive for you when looking to improve your ML training efficiency.

This commit adds a Table of Contents (TOC) to the `GOODPUT_GUIDE.md` document to improve navigation. The TOC is placed after the initial introductory paragraph and includes links to all major H2 and H3 sections. Output:

Copilot

Pull Request Overview

This PR introduces a comprehensive GoodPut guide for large-scale ML training resilience along with significant enhancements and new tests in the ML GoodPut Measurement package.

Added a detailed documentation file (GOODPUT_GUIDE.md) explaining strategies for improving ML GoodPut.
Introduced and updated source code, tests, and configuration files for the ML GoodPut Measurement package covering caching, metrics, and utility functions.

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
training/a3mega/llama3-1-70b/nemo-pretraining-gke-resiliency/GOODPUT_GUIDE.md	New documentation guide for improving ML GoodPut.
ml-goodput-measurement/pyproject.toml	New package configuration and dependency specifications.
ml-goodput-measurement/ml_goodput_measurement/tests/goodput_cache_test.py	Unit tests for the GoodputCache implementation.
ml-goodput-measurement/ml_goodput_measurement/tests/gcp_metrics_test.py	Unit tests for GCP metrics reporting functionality.
ml-goodput-measurement/ml_goodput_measurement/src/goodput_utils.py	Utility functions supporting Goodput/Badput calculations and metadata retrieval.
ml-goodput-measurement/ml_goodput_measurement/src/goodput_cache.py	Implementation of the GoodputCache with update and retrieval methods.
ml-goodput-measurement/ml_goodput_measurement/src/gcp_metrics.py	GCP metrics reporting functionality with batched time series sending.
Other files (LICENSE, CONTRIBUTING.md, CHANGELOG.md, init.py)	Packaging, licensing, and contribution guidelines updates.

Comments suppressed due to low confidence (1)

ml-goodput-measurement/ml_goodput_measurement/src/goodput_utils.py:167

The function uses a generic substring check ('time') to extract a timestamp from keys, which may cause unintended matches. It is advisable to use a more explicit key name (e.g., 'timestamp') to improve clarity and reduce potential errors.

def get_timestamp_from_log_entry(entry: dict[str, Any]) -> Optional[datetime.datetime]:

ml-goodput-measurement/ml_goodput_measurement/src/goodput_cache.py

This reverts commit 7934365.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Slava Kovalevskyi and others added 9 commits May 19, 2025 10:38

adding good put lib code

7934365

Update GOODPUT_GUIDE.md

09d314e

Docs: Add Table of Contents to GOODPUT_GUIDE.md

825da18

This commit adds a Table of Contents (TOC) to the `GOODPUT_GUIDE.md` document to improve navigation. The TOC is placed after the initial introductory paragraph and includes links to all major H2 and H3 sections. Output:

Update GOODPUT_GUIDE.md

9149a5f

b0noI requested a review from Copilot May 27, 2025 17:39

Copilot AI reviewed May 27, 2025

View reviewed changes

ml-goodput-measurement/ml_goodput_measurement/src/goodput_cache.py Outdated Show resolved Hide resolved

b0noI marked this pull request as draft May 27, 2025 17:42

Revert "adding good put lib code"

289a263

This reverts commit 7934365.

b0noI requested a review from Copilot May 27, 2025 17:44

Copilot AI reviewed May 27, 2025

View reviewed changes

b0noI requested a review from Copilot May 27, 2025 17:46

Copilot AI reviewed May 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GoodPut Guide Added #7

GoodPut Guide Added #7

Uh oh!

b0noI commented May 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

GoodPut Guide Added #7

Are you sure you want to change the base?

GoodPut Guide Added #7

Uh oh!

Conversation

b0noI commented May 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!