Skip to content

Conversation

@asyachmenevflant
Copy link
Contributor

@asyachmenevflant asyachmenevflant commented Oct 15, 2025

Description

Implement isolated DVP-over-DVP testing with parallel matrix execution for storage profiles (sds, cephrbd). Includes modular workflow architecture, CLI wrapper, and automatic cleanup for nightly runs.

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: ci
type: chore
summary: add DVP-over-DVP matrix e2e testing

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @asyachmenevflant, your pull request is larger than the review limit of 150000 diff characters

z9r5
z9r5 previously approved these changes Oct 15, 2025
@asyachmenevflant asyachmenevflant added this to the v1.2.0 milestone Oct 15, 2025
Antony added 14 commits October 23, 2025 11:22
- Change from dvp.deckhouse.io/node-group=worker to hostname-based selection
- Use jq to filter VMs with hostname containing 'worker'
- Fixes issue where worker VMs exist but have different label structure
- Fix VirtualDisk API structure (use size + storageClassName instead of resources/accessModes)
- Change SDS device from /dev/sdd to /dev/vdb
- Fix TARGET_STORAGE_CLASS for Ceph bootstrap (use linstor-thin-r2 instead of ceph SC)
- Fix Secret name generation to lowercase only
- Revert worker VM selector to use node-group label
- Note: Taskfile YAML syntax needs final heredoc fix
- Fix VirtualDisk to use size + storageClassName (not resources/accessModes)
- Revert worker VM selector to use node-group label
- Fix YAML syntax issues with heredoc
- Move 'Attach data disks to worker VMs' before 'Bootstrap nested cluster'
- Use base StorageClass 'linstor-thin-r2' for disk attachment (available in parent cluster)
- This prevents bootstrap hanging on non-existent Ceph StorageClass
- Disks are attached early, storage backend configured later
…ction

- Worker VMs don't have dvp.deckhouse.io/node-group=worker label
- Use grep worker on VM names instead of label selector
- This fixes 'No worker VMs found' error in attach-worker-disks task
- Move 'Attach data disks to worker VMs' after 'Bootstrap nested cluster'
- Worker VMs are created during bootstrap, not before
- This fixes 'No worker VMs found' error when trying to attach disks too early
- Maintains base StorageClass 'linstor-thin-r2' for disk attachment
- Add multiple disk attachment (2 disks per VM) to avoid conflicts with system disks
- Add missing sds-local-volume module for SDS CRD creation
- Fix device mapping: SDS uses /dev/sdd, Ceph uses /dev/sde
- Add missing Ceph modules: csi-ceph and snapshot-controller
- Improve CephCluster configuration with health checks
- Update workflow to use DISK_COUNT parameter

Fixes issues with missing CRDs and namespace creation timeouts.
- Add SCSI bus rescan command to master VM debug step
- Add separate step to activate storage disks on all worker VMs
- This should make hotplugged disks visible in lsblk output
- Fixes issue where disks are attached but not visible in guest OS
- Replace hotplug with blockDeviceRefs to avoid ExpandDisks featuregate issues
- Add new task infra:attach-storage-disks-via-blockdevicerefs
- Remove SCSI rescan steps as they are no longer needed
- Clean up comments and deprecated messages
- Remove namespace from VirtualDisk metadata (not needed)
- Use JSON file for patch instead of inline JSON to avoid escaping issues
- Add proper VM state checking before stopping
- Add error handling for VM stop failures
- Improve logging for better debugging
- Replace label selector dvp.deckhouse.io/node-group=worker with name-based grep
- Worker VMs don't have the expected label, causing task to find 0 VMs
- This fixes blockDeviceRefs attachment for storage disks
- Add debug output to show block_device_refs content
- Add validation to skip VMs with empty blockDeviceRefs
- Add patch file content logging for troubleshooting
- This should fix the 'must specify --patch' error
…Refs

- Replace file-based kubectl patch with inline patch
- Remove temporary file creation and heredoc issues
- Use escaped JSON in kubectl patch -p parameter
- This should fix the 'must specify --patch' error completely
Antony added 2 commits October 24, 2025 09:54
- Change RUN_ID from 'e2e-' to 'nightly-nested-e2e-' prefix
- This ensures namespace cleanup can find and delete test namespaces
- Namespaces will now be created as 'nightly-nested-e2e-sds-XXXX' and 'nightly-nested-e2e-cephrbd-XXXX'
- Matches the FILTER_PREFIX='nightly-nested-e2e-' used in cleanup job
- Add verification that VM successfully starts after adding blockDeviceRefs
- Wait up to 10 minutes for VM to reach Running state
- Fail fast if VM fails to start after blockDeviceRefs modification
- Prevents subsequent workflow steps from running on non-functional VMs
Antony added 12 commits October 24, 2025 13:08
- Replace blockDeviceRefs approach with VirtualMachineBlockDeviceAttachment
- Hotplug attaches disks to running VMs without stopping them
- This should resolve VM crashes during disk attachment stage
- Add worker nodes configuration in cluster-config.yaml
- Fix storage class selection logic in workflow
- Add namespace to VirtualDisk and VMBDA manifests
- Remove excessive debug logs and checks
- Fix YAML syntax in storage manifests
- Use ceph-pool-r2-csi-rbd-immediate for Ceph profile

Fixes: bootstrap and disk attachment failures in E2E tests
- Remove dataDisk from nodeGroups instanceClass (not supported in DVPClusterConfiguration)
- Fix YAML structure with proper indentation for nodeGroups
- Remove unused dataDiskSize and data fields from values.yaml
- Data disks will be attached via hotplug mechanism instead

Fixes: bootstrap error 'dataDisk is a forbidden property'
- Remove nodeGroups section from DVPClusterConfiguration as it's not supported
- Worker nodes are created via separate DVPInstanceClass and NodeGroup resources
- This matches the architecture from backup branch where only masterNodeGroup is defined
- Fixes bootstrap error: nodeGroups.instanceClass.dataDisk is a forbidden property
- Add descriptive comment to cluster-config.yaml template
- Clarify purpose of the configuration file for DVP-over-DVP E2E testing
- Add push event trigger for main and feat/ci-e2e-matrix branches
- Workflow was only running on PR events, schedule, and manual dispatch
- Now it will also trigger on direct pushes to these branches
- Remove main branch from push trigger to avoid unnecessary runs
- Keep pull_request trigger for both main and feat/ci-e2e-matrix
- Now workflow will only trigger on push to our feature branch
- Always use ceph-pool-r2-csi-rbd-immediate as default storage class
- Remove conditional logic for storage profile selection
- Always enable Ceph as base storage backend for all profiles
- Enable SDS modules only for 'sds' profile as additional layer
- Simplify bootstrap and disk attachment storage class configuration
- Reduce complexity and improve reliability of E2E tests
- Set defaultClusterStorageClass to ceph-pool-r2-csi-rbd-immediate in mc.yaml
- Improve Ceph deviceFilter to support various disk types (sd, vd, xvd)
- Exclude system disks (sda/sdb/sdc) from Ceph OSD usage
- Align with simplified storage logic using Ceph everywhere
- Reduce retry attempts from 120 to 30 (2.5 minutes instead of 10 minutes)
- Add comprehensive debugging for disk attachment failures:
  * PVC status and describe output
  * StorageClass availability check
  * Ceph CSI pods status
  * Ceph cluster status
  * Enhanced periodic debug snapshots every 10 retries
- Improve error reporting with detailed resource descriptions
- Add debugging to both disk attachment tasks for consistency
Taskfile.yaml changes:
- Replace kubectl wait with custom polling loop for better control
- Add detailed PVC phase tracking with retry counter
- Extract and display StorageClass, PV name, and VolumeMode separately
- Add comprehensive error reporting with describe and events output
- Increase PVC wait timeout from 300s to 240s (120 retries * 2s)

Workflow changes:
- Add DEBUG_HOTPLUG environment variable for enhanced debugging output
- Reduce log verbosity: print status every 30s instead of every 5s
- Reduce debug snapshots frequency: every 60s instead of every 50s
- Add fallback mechanism: check VM events for successful hotplug completion
- Filter controller/handler logs by namespace/VM/VD to reduce noise
- Improve error reporting by focusing on relevant logs only
- This helps diagnose race conditions where disk is attached but VMBDA status not updated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants