Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable test parallelism to identify and correct flaky tests #27

Open
akutz opened this issue Dec 4, 2022 · 1 comment
Open

Enable test parallelism to identify and correct flaky tests #27

akutz opened this issue Dec 4, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@akutz
Copy link
Collaborator

akutz commented Dec 4, 2022

I've noticed a bunch of new, flaky tests related to the content source and vmpubreq controllers integration tests and seem related to poorly implemented Ginkgo. For example, from the first run of the integration test job for PR #26:

------------------------------
• Failure [10.293 seconds]
Integration tests
/home/runner/work/vm-operator/vm-operator/test/builder/test_suite.go:251
  Reconcile ContentSource
  /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:165
    when ContentSource and ContentLibraryProvider exists
    /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:166
      when a new ContentSource with duplicate vm images is created
      /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:279
        should reconcile and generate a new VirtualMachineImage object [It]
        /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:319

        Timed out after 10.001s.
        Expected
            <int>: 3
        to equal
            <int>: 2

        /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:326

Re-running just the IT job usually clears up flakes like above. I believe these are occurring because our tests are race-y, and once people start creating PRs in this project, we will see these errors more frequently. When that happens:

  1. Click on the job that failed, ex. integration-test:
    image
  2. Click the icon to re-run just the failed job:
    image
  3. Click on the Re-run jobs button to re-run the failed job and its dependents:
    image

This is usually enough to fix things. However, I want to set a goal that we run Ginkgo with the -p flag, which enables suite parallelism. This would very quickly identify all of the issues we have related to the way we've constructed our tests.

This issue tracks the need to enable parallism for our tests suites.

@akutz akutz added the enhancement New feature or request label Dec 4, 2022
@akutz
Copy link
Collaborator Author

akutz commented Dec 4, 2022

Hi @yi0909 and @dilyar85,

Maybe we should file one or two more issues to at least try addressing the two flakes about which we are readily aware? It's been four runs, and this job keeps hitting these flakes:

  • Attempt #1 failed with:
    E1204 20:57:56.551859   10512 contentsource_controller.go:394] controllers/ContentSource "msg"="error in reconciling the provider ref" "error"="ContentLibraryProvider.vmoperator.vmware.com \"dummy-cl\" not found" "name"="dummy-cs" 
    E1204 20:57:56.552069   10512 controller.go:317] controller/contentsource "msg"="Reconciler error" "error"="ContentLibraryProvider.vmoperator.vmware.com \"dummy-cl\" not found" "name"="dummy-cs" "namespace"="" "reconciler group"="vmoperator.vmware.com" "reconciler kind"="ContentSource" 
    I1204 20:57:56.552223   10512 contentsource_controller.go:444] controllers/ContentSource "msg"="Received reconcile request"  "name"="dummy-cs-new"
    I1204 20:57:56.552328   10512 contentsource_controller.go:418] controllers/ContentSource "msg"="Reconciling ContentSource deletion" "name"="dummy-cs-new" 
    I1204 20:57:56.560937   10512 contentsource_controller.go:415] controllers/ContentSource "msg"="Finished Reconciling ContentSource Deletion" "name"="dummy-cs-new" 
    I1204 20:57:56.561124   10512 contentsource_controller.go:444] controllers/ContentSource "msg"="Received reconcile request"  "name"="dummy-cs"
    I1204 20:57:56.561238   10512 contentsource_controller.go:418] controllers/ContentSource "msg"="Reconciling ContentSource deletion" "name"="dummy-cs" 
    I1204 20:57:56.567905   10512 contentsource_controller.go:415] controllers/ContentSource "msg"="Finished Reconciling ContentSource Deletion" "name"="dummy-cs" 
    I1204 20:57:56.569568   10512 contentsource_controller.go:444] controllers/ContentSource "msg"="Received reconcile request"  "name"="dummy-cs-new"
    I1204 20:57:56.569769   10512 contentsource_controller.go:444] controllers/ContentSource "msg"="Received reconcile request"  "name"="dummy-cs"
    STEP: Creating a temporary namespace
    STEP: Destroying temporary namespace
    
    ------------------------------
    • Failure [10.302 seconds]
    Integration tests
    /home/runner/work/vm-operator/vm-operator/test/builder/test_suite.go:251
      Reconcile ContentSource
      /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:165
        when ContentSource and ContentLibraryProvider exists
        /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:166
          when a new ContentSource with duplicate vm images is created
          /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:279
            should reconcile and generate a new VirtualMachineImage object [It]
            /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:319
    
            Timed out after 10.001s.
            Expected
                <int>: 3
            to equal
                <int>: 2
    
            /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:326
    ------------------------------
    I1204 20:57:56.676854   10512 logr.go:249]  "msg"="Stopping and waiting for non leader election runnables"  
    I1204 20:57:56.676931   10512 logr.go:249]  "msg"="Stopping and waiting for leader election runnables"  
    I1204 20:57:56.677043   10512 controller.go:240] controller/contentsource "msg"="Shutdown signal received, waiting for all workers to finish" "reconciler group"="vmoperator.vmware.com" "reconciler kind"="ContentSource" 
    I1204 20:57:56.677134   10512 controller.go:242] controller/contentsource "msg"="All workers finished" "reconciler group"="vmoperator.vmware.com" "reconciler kind"="ContentSource" 
    I1204 20:57:56.677180   10512 logr.go:249]  "msg"="Stopping and waiting for caches"  
    I1204 20:57:56.677823   10512 logr.go:249]  "msg"="Stopping and waiting for webhooks"  
    I1204 20:57:56.677917   10512 logr.go:249]  "msg"="Wait completed, proceeding to shutdown the manager"  
    
    
    Summarizing 1 Failure:
    
    [Fail] Integration tests Reconcile ContentSource when ContentSource and ContentLibraryProvider exists when a new ContentSource with duplicate vm images is created [It] should reconcile and generate a new VirtualMachineImage object 
    /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:326
    
    Ran 3 of 3 Specs in 21.390 seconds
    FAIL! -- 2 Passed | 1 Failed | 0 Pending | 0 Skipped
    --- FAIL: TestContentSource (21.41s)
    FAIL
    coverage: 3.3% of statements in ./controllers/..., ./pkg/..., ./webhooks/...
    FAIL	github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/contentsource	21.653s
    
  • Attempt #2 failed with:
    ------------------------------
    • Failure [10.330 seconds]
    Integration tests
    /home/runner/work/vm-operator/vm-operator/test/builder/test_suite.go:251
      Reconcile ContentSource
      /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:165
        when ContentSource and ContentLibraryProvider exists
        /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:166
          when a new ContentSource with duplicate vm images is created
          /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:279
            should reconcile and generate a new VirtualMachineImage object [It]
            /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:319
    
            Timed out after 10.000s.
            Expected
                <int>: 3
            to equal
                <int>: 2
    
            /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:326
    ------------------------------
    I1204 21:28:11.588278   10617 logr.go:249]  "msg"="Stopping and waiting for non leader election runnables"  
    I1204 21:28:11.588433   10617 logr.go:249]  "msg"="Stopping and waiting for leader election runnables"  
    I1204 21:28:11.588729   10617 controller.go:240] controller/contentsource "msg"="Shutdown signal received, waiting for all workers to finish" "reconciler group"="vmoperator.vmware.com" "reconciler kind"="ContentSource" 
    I1204 21:28:11.588917   10617 controller.go:242] controller/contentsource "msg"="All workers finished" "reconciler group"="vmoperator.vmware.com" "reconciler kind"="ContentSource" 
    I1204 21:28:11.589065   10617 logr.go:249]  "msg"="Stopping and waiting for caches"  
    I1204 21:28:11.590610   10617 logr.go:249]  "msg"="Stopping and waiting for webhooks"  
    I1204 21:28:11.591016   10617 logr.go:249]  "msg"="Wait completed, proceeding to shutdown the manager"  
    
    
    Summarizing 1 Failure:
    
    [Fail] Integration tests Reconcile ContentSource when ContentSource and ContentLibraryProvider exists when a new ContentSource with duplicate vm images is created [It] should reconcile and generate a new VirtualMachineImage object 
    /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:326
    
    Ran 3 of 3 Specs in 21.310 seconds
    FAIL! -- 2 Passed | 1 Failed | 0 Pending | 0 Skipped
    --- FAIL: TestContentSource (21.34s)
    FAIL
    coverage: 3.3% of statements in ./controllers/..., ./pkg/..., ./webhooks/...
    FAIL	github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/contentsource	21.568s
    
  • Attempt #3 failed with:
     •I1204 21:45:02.627894   14199 response.go:42] vmoperator-controller-manager/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachinepublishrequest/8466f652-e290-4531-ba96-7585709a161b/dummy-vmpub "msg"="validation denied"  "code"=422 "reason"="spec.target: Invalid value: v1alpha1.VirtualMachinePublishRequestTarget{Item:v1alpha1.VirtualMachinePublishRequestTargetItem{Name:\"dummy-item\", Description:\"\"}, Location:v1alpha1.VirtualMachinePublishRequestTargetLocation{Name:\"alternate-cl\", APIVersion:\"imageregistry.vmware.com/v1alpha1\", Kind:\"ContentLibrary\"}}: field is immutable"
    •I1204 21:45:02.646661   14199 response.go:42] vmoperator-controller-manager/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachinepublishrequest/86b9dd67-a2e8-461b-8459-4fb6f464ff5c/dummy-vmpub "msg"="validation denied"  "code"=422 "reason"="spec.source.name: Not found: \"dummy-vm\""
    STEP: Creating a temporary namespace
    
    ------------------------------
    • Failure in Spec Setup (BeforeEach) [0.020 seconds]
    Integration tests
    /home/runner/work/vm-operator/vm-operator/test/builder/test_suite.go:251
      Invoking Delete
      /home/runner/work/vm-operator/vm-operator/webhooks/virtualmachinepublishrequest/validation/virtualmachinepublishrequest_validator_intg_test.go:21
        when delete is performed [BeforeEach]
        /home/runner/work/vm-operator/vm-operator/webhooks/virtualmachinepublishrequest/validation/virtualmachinepublishrequest_validator_intg_test.go:174
          should allow the request
          /home/runner/work/vm-operator/vm-operator/webhooks/virtualmachinepublishrequest/validation/virtualmachinepublishrequest_validator_intg_test.go:175
    
          Unexpected error:
              <*errors.StatusError | 0xc000b53680>: {
                  ErrStatus: {
                      TypeMeta: {Kind: "", APIVersion: ""},
                      ListMeta: {
                          SelfLink: "",
                          ResourceVersion: "",
                          Continue: "",
                          RemainingItemCount: nil,
                      },
                      Status: "Failure",
                      Message: "admission webhook \"default.validating.virtualmachinepublishrequest.vmoperator.vmware.com\" denied the request: spec.source.name: Not found: \"dummy-vm\"",
                      Reason: "spec.source.name: Not found: \"dummy-vm\"",
                      Details: nil,
                      Code: 422,
                  },
              }
              admission webhook "default.validating.virtualmachinepublishrequest.vmoperator.vmware.com" denied the request: spec.source.name: Not found: "dummy-vm"
          occurred
    
          /home/runner/work/vm-operator/vm-operator/webhooks/virtualmachinepublishrequest/validation/virtualmachinepublishrequest_validator_intg_test.go:162
    ------------------------------
    I1204 21:45:02.651259   14199 logr.go:249]  "msg"="Stopping and waiting for non leader election runnables"  
    I1204 21:45:02.651369   14199 logr.go:249]  "msg"="Stopping and waiting for leader election runnables"  
    I1204 21:45:02.651467   14199 logr.go:249]  "msg"="Stopping and waiting for caches"  
    I1204 21:45:02.651751   14199 logr.go:249]  "msg"="Stopping and waiting for webhooks"  
    I1204 21:45:02.652300   14199 logr.go:249] controller-runtime/webhook "msg"="shutting down webhook server"  
    I1204 21:45:02.654116   14199 logr.go:249]  "msg"="Wait completed, proceeding to shutdown the manager"  
    
    
    Summarizing 1 Failure:
    
    [Fail] Integration tests Invoking Delete [BeforeEach] when delete is performed should allow the request 
    /home/runner/work/vm-operator/vm-operator/webhooks/virtualmachinepublishrequest/validation/virtualmachinepublishrequest_validator_intg_test.go:162
    
    Ran 5 of 5 Specs in 13.425 seconds
    FAIL! -- 4 Passed | 1 Failed | 0 Pending | 0 Skipped
    

I am currently on the fourth attempt; fingers crossed!

sreyasn pushed a commit to sreyasn/vm-operator that referenced this issue Dec 5, 2022
)

This change introduces a flag to enable Change Block Tracking as an
optional boolean as part of the virtual machine advanced opotions.

Testing done: `make all`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
@akutz and others