ci(e2e): add DVP-over-DVP matrix testing with parallel execution #1577

asyachmenevflant · 2025-10-15T07:13:03Z

Description

Implement isolated DVP-over-DVP testing with parallel matrix execution for storage profiles (sds, cephrbd). Includes modular workflow architecture, CLI wrapper, and automatic cleanup for nightly runs.

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

The code is covered by unit tests.
e2e tests passed.
Documentation updated according to the changes.
Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: ci
type: chore
summary: add DVP-over-DVP matrix e2e testing

sourcery-ai

Sorry @asyachmenevflant, your pull request is larger than the review limit of 150000 diff characters

Signed-off-by: Yachmenev Anton <[email protected]>

…Config/Taskfile Signed-off-by: Yachmenev Anton <[email protected]>

…fig env

- Change from dvp.deckhouse.io/node-group=worker to hostname-based selection - Use jq to filter VMs with hostname containing 'worker' - Fixes issue where worker VMs exist but have different label structure

- Fix VirtualDisk API structure (use size + storageClassName instead of resources/accessModes) - Change SDS device from /dev/sdd to /dev/vdb - Fix TARGET_STORAGE_CLASS for Ceph bootstrap (use linstor-thin-r2 instead of ceph SC) - Fix Secret name generation to lowercase only - Revert worker VM selector to use node-group label - Note: Taskfile YAML syntax needs final heredoc fix

- Fix VirtualDisk to use size + storageClassName (not resources/accessModes) - Revert worker VM selector to use node-group label - Fix YAML syntax issues with heredoc

- Move 'Attach data disks to worker VMs' before 'Bootstrap nested cluster' - Use base StorageClass 'linstor-thin-r2' for disk attachment (available in parent cluster) - This prevents bootstrap hanging on non-existent Ceph StorageClass - Disks are attached early, storage backend configured later

…ction - Worker VMs don't have dvp.deckhouse.io/node-group=worker label - Use grep worker on VM names instead of label selector - This fixes 'No worker VMs found' error in attach-worker-disks task

- Move 'Attach data disks to worker VMs' after 'Bootstrap nested cluster' - Worker VMs are created during bootstrap, not before - This fixes 'No worker VMs found' error when trying to attach disks too early - Maintains base StorageClass 'linstor-thin-r2' for disk attachment

- Add multiple disk attachment (2 disks per VM) to avoid conflicts with system disks - Add missing sds-local-volume module for SDS CRD creation - Fix device mapping: SDS uses /dev/sdd, Ceph uses /dev/sde - Add missing Ceph modules: csi-ceph and snapshot-controller - Improve CephCluster configuration with health checks - Update workflow to use DISK_COUNT parameter Fixes issues with missing CRDs and namespace creation timeouts.

- Add SCSI bus rescan command to master VM debug step - Add separate step to activate storage disks on all worker VMs - This should make hotplugged disks visible in lsblk output - Fixes issue where disks are attached but not visible in guest OS

…rors

- Replace hotplug with blockDeviceRefs to avoid ExpandDisks featuregate issues - Add new task infra:attach-storage-disks-via-blockdevicerefs - Remove SCSI rescan steps as they are no longer needed - Clean up comments and deprecated messages

- Remove namespace from VirtualDisk metadata (not needed) - Use JSON file for patch instead of inline JSON to avoid escaping issues - Add proper VM state checking before stopping - Add error handling for VM stop failures - Improve logging for better debugging

- Replace label selector dvp.deckhouse.io/node-group=worker with name-based grep - Worker VMs don't have the expected label, causing task to find 0 VMs - This fixes blockDeviceRefs attachment for storage disks

- Add debug output to show block_device_refs content - Add validation to skip VMs with empty blockDeviceRefs - Add patch file content logging for troubleshooting - This should fix the 'must specify --patch' error

…Refs - Replace file-based kubectl patch with inline patch - Remove temporary file creation and heredoc issues - Use escaped JSON in kubectl patch -p parameter - This should fix the 'must specify --patch' error completely

- Change RUN_ID from 'e2e-' to 'nightly-nested-e2e-' prefix - This ensures namespace cleanup can find and delete test namespaces - Namespaces will now be created as 'nightly-nested-e2e-sds-XXXX' and 'nightly-nested-e2e-cephrbd-XXXX' - Matches the FILTER_PREFIX='nightly-nested-e2e-' used in cleanup job

- Add verification that VM successfully starts after adding blockDeviceRefs - Wait up to 10 minutes for VM to reach Running state - Fail fast if VM fails to start after blockDeviceRefs modification - Prevents subsequent workflow steps from running on non-functional VMs

- Replace blockDeviceRefs approach with VirtualMachineBlockDeviceAttachment - Hotplug attaches disks to running VMs without stopping them - This should resolve VM crashes during disk attachment stage

- Add worker nodes configuration in cluster-config.yaml - Fix storage class selection logic in workflow - Add namespace to VirtualDisk and VMBDA manifests - Remove excessive debug logs and checks - Fix YAML syntax in storage manifests - Use ceph-pool-r2-csi-rbd-immediate for Ceph profile Fixes: bootstrap and disk attachment failures in E2E tests

- Remove dataDisk from nodeGroups instanceClass (not supported in DVPClusterConfiguration) - Fix YAML structure with proper indentation for nodeGroups - Remove unused dataDiskSize and data fields from values.yaml - Data disks will be attached via hotplug mechanism instead Fixes: bootstrap error 'dataDisk is a forbidden property'

- Remove nodeGroups section from DVPClusterConfiguration as it's not supported - Worker nodes are created via separate DVPInstanceClass and NodeGroup resources - This matches the architecture from backup branch where only masterNodeGroup is defined - Fixes bootstrap error: nodeGroups.instanceClass.dataDisk is a forbidden property

- Add descriptive comment to cluster-config.yaml template - Clarify purpose of the configuration file for DVP-over-DVP E2E testing

- Add push event trigger for main and feat/ci-e2e-matrix branches - Workflow was only running on PR events, schedule, and manual dispatch - Now it will also trigger on direct pushes to these branches

- Remove main branch from push trigger to avoid unnecessary runs - Keep pull_request trigger for both main and feat/ci-e2e-matrix - Now workflow will only trigger on push to our feature branch

- Always use ceph-pool-r2-csi-rbd-immediate as default storage class - Remove conditional logic for storage profile selection - Always enable Ceph as base storage backend for all profiles - Enable SDS modules only for 'sds' profile as additional layer - Simplify bootstrap and disk attachment storage class configuration - Reduce complexity and improve reliability of E2E tests

- Set defaultClusterStorageClass to ceph-pool-r2-csi-rbd-immediate in mc.yaml - Improve Ceph deviceFilter to support various disk types (sd, vd, xvd) - Exclude system disks (sda/sdb/sdc) from Ceph OSD usage - Align with simplified storage logic using Ceph everywhere

- Reduce retry attempts from 120 to 30 (2.5 minutes instead of 10 minutes) - Add comprehensive debugging for disk attachment failures: * PVC status and describe output * StorageClass availability check * Ceph CSI pods status * Ceph cluster status * Enhanced periodic debug snapshots every 10 retries - Improve error reporting with detailed resource descriptions - Add debugging to both disk attachment tasks for consistency

Taskfile.yaml changes: - Replace kubectl wait with custom polling loop for better control - Add detailed PVC phase tracking with retry counter - Extract and display StorageClass, PV name, and VolumeMode separately - Add comprehensive error reporting with describe and events output - Increase PVC wait timeout from 300s to 240s (120 retries * 2s) Workflow changes: - Add DEBUG_HOTPLUG environment variable for enhanced debugging output

- Reduce log verbosity: print status every 30s instead of every 5s - Reduce debug snapshots frequency: every 60s instead of every 50s - Add fallback mechanism: check VM events for successful hotplug completion - Filter controller/handler logs by namespace/VM/VD to reduce noise - Improve error reporting by focusing on relevant logs only - This helps diagnose race conditions where disk is attached but VMBDA status not updated

asyachmenevflant requested review from Isteb4k, danilrwx, fl64, goganat, hardcoretime, nevermarine, universal-itengineer and z9r5 as code owners October 15, 2025 07:13

sourcery-ai bot reviewed Oct 15, 2025

View reviewed changes

github-actions bot assigned asyachmenevflant Oct 15, 2025

asyachmenevflant force-pushed the feat/ci-e2e-matrix branch from 496232b to bd142a1 Compare October 15, 2025 07:23

z9r5 previously approved these changes Oct 15, 2025

View reviewed changes

asyachmenevflant dismissed z9r5’s stale review via d5b347a October 15, 2025 07:53

asyachmenevflant marked this pull request as draft October 15, 2025 08:25

asyachmenevflant force-pushed the feat/ci-e2e-matrix branch from ae5f35d to 9658db7 Compare October 15, 2025 10:05

asyachmenevflant added 2 commits October 15, 2025 13:14

ci(e2e): add DVP-over-DVP matrix testing with parallel execution

df15f9f

Signed-off-by: Yachmenev Anton <[email protected]>

ci: drop artifacts/binaries from VCS; update .gitignore

7c7689c

Signed-off-by: Yachmenev Anton <[email protected]>

asyachmenevflant force-pushed the feat/ci-e2e-matrix branch from 9658db7 to 7c7689c Compare October 15, 2025 10:20

ci(e2e): restore workflow, Taskfiles, charts, scripts and docs

395d32e

Signed-off-by: Yachmenev Anton <[email protected]>

asyachmenevflant added this to the v1.2.0 milestone Oct 15, 2025

dvcr: make templates nil-safe; remove temporary root values.yaml

09542e3

Signed-off-by: Yachmenev Anton <[email protected]>

asyachmenevflant force-pushed the feat/ci-e2e-matrix branch from ebf4c5c to 09542e3 Compare October 15, 2025 12:49

asyachmenevflant added 8 commits October 15, 2025 15:57

ci: format YAML to satisfy prettier; drop temporary values.yaml

3e056d0

revert(templates/dvcr): drop E2E-related edits; keep tests via Module…

8d97988

…Config/Taskfile Signed-off-by: Yachmenev Anton <[email protected]>

ci(e2e): remove secrets usage in if; check webhook in step script

46d40ec

ci(e2e): write kubeconfig via printf to avoid heredoc EOF issues

8dc2a87

ci(e2e): resolve merge with main; keep full E2E workflow

873905d

ci(e2e): fix duplicate workflow_dispatch; prettier passes

bfa9e9e

ci(e2e): use env.E2E_K8S_URL in kubeconfig steps; drop report kubecon…

24a2a0e

…fig env

ci(e2e): build e2e-runner wrapper via Taskfile (tests/e2e task run:ci)

12a1a5c

Antony added 14 commits October 23, 2025 11:22

fix(taskfile): correct worker VM selector in infra:attach-worker-disks

2ab725a

- Change from dvp.deckhouse.io/node-group=worker to hostname-based selection - Use jq to filter VMs with hostname containing 'worker' - Fixes issue where worker VMs exist but have different label structure

hotfix: fix Taskfile YAML syntax and VirtualDisk API structure

ec262ad

- Fix VirtualDisk to use size + storageClassName (not resources/accessModes) - Revert worker VM selector to use node-group label - Fix YAML syntax issues with heredoc

fix: use VM name selector instead of node-group label for worker dete…

d0730a0

…ction - Worker VMs don't have dvp.deckhouse.io/node-group=worker label - Use grep worker on VM names instead of label selector - This fixes 'No worker VMs found' error in attach-worker-disks task

fix(ci): add sudo to SCSI rescan commands to fix permission denied er…

9a1b99c

…rors

fix(ci): use correct VM selector for worker nodes

58f8e2d

- Replace label selector dvp.deckhouse.io/node-group=worker with name-based grep - Worker VMs don't have the expected label, causing task to find 0 VMs - This fixes blockDeviceRefs attachment for storage disks

fix(ci): add debug info and validation for blockDeviceRefs patch

a2fc6e8

- Add debug output to show block_device_refs content - Add validation to skip VMs with empty blockDeviceRefs - Add patch file content logging for troubleshooting - This should fix the 'must specify --patch' error

asyachmenevflant force-pushed the feat/ci-e2e-matrix branch from b3ed99b to a8cbe11 Compare October 24, 2025 05:56

Antony added 2 commits October 24, 2025 09:54

asyachmenevflant force-pushed the feat/ci-e2e-matrix branch from fdb445e to a964de5 Compare October 24, 2025 10:06

Antony added 12 commits October 24, 2025 13:08

revert(ci): switch back to hotplug for disk attachment

0290deb

- Replace blockDeviceRefs approach with VirtualMachineBlockDeviceAttachment - Hotplug attaches disks to running VMs without stopping them - This should resolve VM crashes during disk attachment stage

docs: add comment to cluster-config template

0941430

- Add descriptive comment to cluster-config.yaml template - Clarify purpose of the configuration file for DVP-over-DVP E2E testing

fix(ci): add push trigger to E2E Matrix Tests workflow

8ce81c1

- Add push event trigger for main and feat/ci-e2e-matrix branches - Workflow was only running on PR events, schedule, and manual dispatch - Now it will also trigger on direct pushes to these branches

fix(ci): limit push trigger to feat/ci-e2e-matrix branch only

b1278d3

- Remove main branch from push trigger to avoid unnecessary runs - Keep pull_request trigger for both main and feat/ci-e2e-matrix - Now workflow will only trigger on push to our feature branch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(e2e): add DVP-over-DVP matrix testing with parallel execution #1577

ci(e2e): add DVP-over-DVP matrix testing with parallel execution #1577

asyachmenevflant commented Oct 15, 2025 •

edited by universal-itengineer

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ci(e2e): add DVP-over-DVP matrix testing with parallel execution #1577

Are you sure you want to change the base?

ci(e2e): add DVP-over-DVP matrix testing with parallel execution #1577

Conversation

asyachmenevflant commented Oct 15, 2025 • edited by universal-itengineer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

Changelog entries

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

asyachmenevflant commented Oct 15, 2025 •

edited by universal-itengineer

Loading