Skip to content

Commit 59c9588

Browse files
authored
enh(doc): Add ci-overview in docs/source/reference/ (#5137)
Signed-off-by: Venky Ganesh <[email protected]>
1 parent 88cba5f commit 59c9588

File tree

3 files changed

+115
-0
lines changed

3 files changed

+115
-0
lines changed

.github/pull_request_template.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ Launch build/test pipelines. All previously running jobs will be killed.
5353

5454
`--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"` *(OPTIONAL)* : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".
5555

56+
For guidance on mapping tests to stage names, see `docs/source/reference/ci-overview.md`.
57+
5658
### kill
5759

5860
`kill `

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ Welcome to TensorRT-LLM's Documentation!
132132

133133
reference/precision.md
134134
reference/memory.md
135+
reference/ci-overview.md
135136

136137

137138
.. toctree::

docs/source/reference/ci-overview.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Continuous Integration Overview
2+
3+
This page explains how TensorRT‑LLM's CI is organized and how individual tests map to Jenkins stages. Most stages execute integration tests defined in YAML files, while unit tests run as part of a merge‑request pipeline. The sections below describe how to locate a test and trigger the stage that runs it.
4+
5+
## Table of Contents
6+
1. [CI pipelines](#ci-pipelines)
7+
2. [Test definitions](#test-definitions)
8+
3. [Unit tests](#unit-tests)
9+
4. [Jenkins stage names](#jenkins-stage-names)
10+
5. [Finding the stage for a test](#finding-the-stage-for-a-test)
11+
6. [Waiving tests](#waiving-tests)
12+
7. [Triggering CI Best Practices](#triggering-ci-best-practices)
13+
14+
## CI pipelines
15+
16+
Pull requests do not start testing by themselves. Developers trigger the CI by commenting `/bot run` (optionally with arguments) on the pull request (see [Pull Request Template](../../../.github/pull_request_template.md) for more details). That kicks off the **merge-request pipeline** (defined in `jenkins/L0_MergeRequest.groovy`), which runs unit tests and integration tests whose YAML entries specify `stage: pre_merge`. Once a pull request is merged, a separate **post-merge pipeline** (defined in `jenkins/L0_Test.groovy`) runs every test marked `post_merge` across all supported GPU configurations.
17+
18+
`stage` tags live in the YAML files under `tests/integration/test_lists/test-db/`. Searching those files for `stage: pre_merge` shows exactly which tests the merge-request pipeline covers.
19+
20+
## Test definitions
21+
22+
Integration tests are listed under `tests/integration/test_lists/test-db/`. Most YAML files are named after the GPU or configuration they run on (for example `l0_a100.yml`). Some files, like `l0_sanity_check.yml`, use wildcards and can run on multiple hardware types. Entries contain conditions and a list of tests. Two important terms in each entry are:
23+
24+
- `stage`: either `pre_merge` or `post_merge`.
25+
- `backend`: `pytorch`, `tensorrt` or `triton`.
26+
27+
Example from `l0_a100.yml`:
28+
29+
```yaml
30+
terms:
31+
stage: post_merge
32+
backend: triton
33+
tests:
34+
- triton_server/test_triton.py::test_gpt_ib_ptuning[gpt-ib-ptuning]
35+
```
36+
37+
## Unit tests
38+
39+
Unit tests live under `tests/unittest/` and run during the merge-request pipeline. They are invoked from `jenkins/L0_MergeRequest.groovy` and do not require mapping to specific hardware stages.
40+
41+
## Jenkins stage names
42+
43+
`jenkins/L0_Test.groovy` maps stage names to these YAML files. For A100 the mapping includes:
44+
45+
```groovy
46+
"A100X-Triton-Python-[Post-Merge]-1": ["a100x", "l0_a100", 1, 2],
47+
"A100X-Triton-Python-[Post-Merge]-2": ["a100x", "l0_a100", 2, 2],
48+
```
49+
50+
The array elements are: GPU type, YAML file (without extension), shard index, and total number of shards. Only tests with `stage: post_merge` from that YAML file are selected when a `Post-Merge` stage runs.
51+
52+
## Finding the stage for a test
53+
54+
1. Locate the test in the appropriate YAML file under `tests/integration/test_lists/test-db/` and note its `stage` and `backend` values.
55+
2. Search `jenkins/L0_Test.groovy` for a stage whose YAML file matches (for example `l0_a100`) and whose name contains `[Post-Merge]` if the YAML entry uses `stage: post_merge`.
56+
3. The resulting stage name(s) are what you pass to Jenkins via the `stage_list` parameter when triggering a job.
57+
58+
### Example
59+
60+
`triton_server/test_triton.py::test_gpt_ib_ptuning[gpt-ib-ptuning]` appears in `l0_a100.yml` under `stage: post_merge` and `backend: triton`. The corresponding Jenkins stages are `A100X-Triton-Python-[Post-Merge]-1` and `A100X-Triton-Python-[Post-Merge]-2` (two shards).
61+
62+
To run the same tests on your pull request, comment:
63+
64+
```bash
65+
/bot run --stage-list "A100X-Triton-Python-[Post-Merge]-1,A100X-Triton-Python-[Post-Merge]-2"
66+
```
67+
68+
This executes the same tests that run post-merge for this hardware/backend.
69+
70+
## Waiving tests
71+
72+
Sometimes a test is known to fail due to a bug or unsupported feature. Instead
73+
of removing it from the YAML test lists, add the test name to
74+
`tests/integration/test_lists/waives.txt`. Every CI run passes this file to
75+
pytest via `--waives-file`, so the listed tests are skipped automatically.
76+
77+
Each line contains the fully qualified test name followed by an optional
78+
`SKIP (reason)` marker. A `full:GPU_TYPE/` prefix restricts the waive to a
79+
specific hardware family. Example:
80+
81+
```text
82+
examples/test_openai.py::test_llm_openai_triton_1gpu SKIP (https://nvbugspro.nvidia.com/bug/4963654)
83+
full:GH200/examples/test_qwen2audio.py::test_llm_qwen2audio_single_gpu[qwen2_audio_7b_instruct] SKIP (arm is not supported)
84+
```
85+
86+
Changes to `waives.txt` should include a bug link or brief explanation so other
87+
developers understand why the test is disabled.
88+
89+
## Triggering CI Best Practices
90+
91+
### Triggering Post-merge tests
92+
93+
When you only need to verify a handful of post-merge tests, avoid the heavy
94+
`/bot run --post-merge` command. Instead, specify exactly which stages to run:
95+
96+
```bash
97+
/bot run --stage-list "stage-A,stage-B"
98+
```
99+
100+
This runs **only** the stages listed. You can also add stages on top of the
101+
default pre-merge set:
102+
103+
```bash
104+
/bot run --extra-stage "stage-A,stage-B"
105+
```
106+
107+
Both options accept any stage name defined in `jenkins/L0_Test.groovy`. Being
108+
selective keeps CI turnaround fast and conserves hardware resources.
109+
110+
### Avoiding unnecessary `--disable-fail-fast` usage
111+
112+
Avoid habitually using `--disable-fail-fast` as it wastes scarce hardware resources. The CI system automatically reuses successful test stages when commits remain unchanged, and subsequent `/bot run` commands only retry failed stages. Overusing `--disable-fail-fast` keeps failed pipelines consuming resources (like DGX-H100s), increasing queue backlogs and reducing team efficiency.

0 commit comments

Comments
 (0)