-
Notifications
You must be signed in to change notification settings - Fork 2
ci(e2e): add DVP-over-DVP matrix testing with parallel execution #1577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
asyachmenevflant
wants to merge
163
commits into
main
Choose a base branch
from
feat/ci-e2e-matrix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,449
−6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @asyachmenevflant, your pull request is larger than the review limit of 150000 diff characters
496232b to
bd142a1
Compare
z9r5
previously approved these changes
Oct 15, 2025
ae5f35d to
9658db7
Compare
Signed-off-by: Yachmenev Anton <[email protected]>
Signed-off-by: Yachmenev Anton <[email protected]>
9658db7 to
7c7689c
Compare
Signed-off-by: Yachmenev Anton <[email protected]>
Signed-off-by: Yachmenev Anton <[email protected]>
ebf4c5c to
09542e3
Compare
…Config/Taskfile Signed-off-by: Yachmenev Anton <[email protected]>
- Change from dvp.deckhouse.io/node-group=worker to hostname-based selection - Use jq to filter VMs with hostname containing 'worker' - Fixes issue where worker VMs exist but have different label structure
- Fix VirtualDisk API structure (use size + storageClassName instead of resources/accessModes) - Change SDS device from /dev/sdd to /dev/vdb - Fix TARGET_STORAGE_CLASS for Ceph bootstrap (use linstor-thin-r2 instead of ceph SC) - Fix Secret name generation to lowercase only - Revert worker VM selector to use node-group label - Note: Taskfile YAML syntax needs final heredoc fix
- Fix VirtualDisk to use size + storageClassName (not resources/accessModes) - Revert worker VM selector to use node-group label - Fix YAML syntax issues with heredoc
- Move 'Attach data disks to worker VMs' before 'Bootstrap nested cluster' - Use base StorageClass 'linstor-thin-r2' for disk attachment (available in parent cluster) - This prevents bootstrap hanging on non-existent Ceph StorageClass - Disks are attached early, storage backend configured later
…ction - Worker VMs don't have dvp.deckhouse.io/node-group=worker label - Use grep worker on VM names instead of label selector - This fixes 'No worker VMs found' error in attach-worker-disks task
- Move 'Attach data disks to worker VMs' after 'Bootstrap nested cluster' - Worker VMs are created during bootstrap, not before - This fixes 'No worker VMs found' error when trying to attach disks too early - Maintains base StorageClass 'linstor-thin-r2' for disk attachment
- Add multiple disk attachment (2 disks per VM) to avoid conflicts with system disks - Add missing sds-local-volume module for SDS CRD creation - Fix device mapping: SDS uses /dev/sdd, Ceph uses /dev/sde - Add missing Ceph modules: csi-ceph and snapshot-controller - Improve CephCluster configuration with health checks - Update workflow to use DISK_COUNT parameter Fixes issues with missing CRDs and namespace creation timeouts.
- Add SCSI bus rescan command to master VM debug step - Add separate step to activate storage disks on all worker VMs - This should make hotplugged disks visible in lsblk output - Fixes issue where disks are attached but not visible in guest OS
- Replace hotplug with blockDeviceRefs to avoid ExpandDisks featuregate issues - Add new task infra:attach-storage-disks-via-blockdevicerefs - Remove SCSI rescan steps as they are no longer needed - Clean up comments and deprecated messages
- Remove namespace from VirtualDisk metadata (not needed) - Use JSON file for patch instead of inline JSON to avoid escaping issues - Add proper VM state checking before stopping - Add error handling for VM stop failures - Improve logging for better debugging
- Replace label selector dvp.deckhouse.io/node-group=worker with name-based grep - Worker VMs don't have the expected label, causing task to find 0 VMs - This fixes blockDeviceRefs attachment for storage disks
- Add debug output to show block_device_refs content - Add validation to skip VMs with empty blockDeviceRefs - Add patch file content logging for troubleshooting - This should fix the 'must specify --patch' error
…Refs - Replace file-based kubectl patch with inline patch - Remove temporary file creation and heredoc issues - Use escaped JSON in kubectl patch -p parameter - This should fix the 'must specify --patch' error completely
b3ed99b to
a8cbe11
Compare
- Change RUN_ID from 'e2e-' to 'nightly-nested-e2e-' prefix - This ensures namespace cleanup can find and delete test namespaces - Namespaces will now be created as 'nightly-nested-e2e-sds-XXXX' and 'nightly-nested-e2e-cephrbd-XXXX' - Matches the FILTER_PREFIX='nightly-nested-e2e-' used in cleanup job
- Add verification that VM successfully starts after adding blockDeviceRefs - Wait up to 10 minutes for VM to reach Running state - Fail fast if VM fails to start after blockDeviceRefs modification - Prevents subsequent workflow steps from running on non-functional VMs
fdb445e to
a964de5
Compare
- Replace blockDeviceRefs approach with VirtualMachineBlockDeviceAttachment - Hotplug attaches disks to running VMs without stopping them - This should resolve VM crashes during disk attachment stage
- Add worker nodes configuration in cluster-config.yaml - Fix storage class selection logic in workflow - Add namespace to VirtualDisk and VMBDA manifests - Remove excessive debug logs and checks - Fix YAML syntax in storage manifests - Use ceph-pool-r2-csi-rbd-immediate for Ceph profile Fixes: bootstrap and disk attachment failures in E2E tests
- Remove dataDisk from nodeGroups instanceClass (not supported in DVPClusterConfiguration) - Fix YAML structure with proper indentation for nodeGroups - Remove unused dataDiskSize and data fields from values.yaml - Data disks will be attached via hotplug mechanism instead Fixes: bootstrap error 'dataDisk is a forbidden property'
- Remove nodeGroups section from DVPClusterConfiguration as it's not supported - Worker nodes are created via separate DVPInstanceClass and NodeGroup resources - This matches the architecture from backup branch where only masterNodeGroup is defined - Fixes bootstrap error: nodeGroups.instanceClass.dataDisk is a forbidden property
- Add descriptive comment to cluster-config.yaml template - Clarify purpose of the configuration file for DVP-over-DVP E2E testing
- Add push event trigger for main and feat/ci-e2e-matrix branches - Workflow was only running on PR events, schedule, and manual dispatch - Now it will also trigger on direct pushes to these branches
- Remove main branch from push trigger to avoid unnecessary runs - Keep pull_request trigger for both main and feat/ci-e2e-matrix - Now workflow will only trigger on push to our feature branch
- Always use ceph-pool-r2-csi-rbd-immediate as default storage class - Remove conditional logic for storage profile selection - Always enable Ceph as base storage backend for all profiles - Enable SDS modules only for 'sds' profile as additional layer - Simplify bootstrap and disk attachment storage class configuration - Reduce complexity and improve reliability of E2E tests
- Set defaultClusterStorageClass to ceph-pool-r2-csi-rbd-immediate in mc.yaml - Improve Ceph deviceFilter to support various disk types (sd, vd, xvd) - Exclude system disks (sda/sdb/sdc) from Ceph OSD usage - Align with simplified storage logic using Ceph everywhere
- Reduce retry attempts from 120 to 30 (2.5 minutes instead of 10 minutes) - Add comprehensive debugging for disk attachment failures: * PVC status and describe output * StorageClass availability check * Ceph CSI pods status * Ceph cluster status * Enhanced periodic debug snapshots every 10 retries - Improve error reporting with detailed resource descriptions - Add debugging to both disk attachment tasks for consistency
Taskfile.yaml changes: - Replace kubectl wait with custom polling loop for better control - Add detailed PVC phase tracking with retry counter - Extract and display StorageClass, PV name, and VolumeMode separately - Add comprehensive error reporting with describe and events output - Increase PVC wait timeout from 300s to 240s (120 retries * 2s) Workflow changes: - Add DEBUG_HOTPLUG environment variable for enhanced debugging output
- Reduce log verbosity: print status every 30s instead of every 5s - Reduce debug snapshots frequency: every 60s instead of every 50s - Add fallback mechanism: check VM events for successful hotplug completion - Filter controller/handler logs by namespace/VM/VD to reduce noise - Improve error reporting by focusing on relevant logs only - This helps diagnose race conditions where disk is attached but VMBDA status not updated
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Implement isolated DVP-over-DVP testing with parallel matrix execution for storage profiles (sds, cephrbd). Includes modular workflow architecture, CLI wrapper, and automatic cleanup for nightly runs.
Why do we need it, and what problem does it solve?
What is the expected result?
Checklist
Changelog entries