-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Future of collection CI #1746
Comments
@jillr Thanks for summarizing the current state of the CI system and explaining the problems you are facing with the integration tests for VMware vSphere (both this collection and Sounds like a tricky situation to me. You don't have subject matter experts with VMware knowledge in the team. I, on the other hand, know VMware vSphere quite well but not OpenStack/libvirt or Zuul.
It's OK to keep using 7.0.3 images for now, but sooner or later we will have to start testing with a more current version. Pitty about your limited ability to help troubleshoot, but we just have to live with it. If something does break and it looks like a vSphere problem, hopefully there will be enough information in the CI logs that enable me to help.
OK, this sounds like we shouldn't focus on your current Zuul environment. Maybe we'll find something else / better. I don't know Github Actions very much. Could we use it as we use Zuul now? That is, to spin up a vCenter, two ESXi hosts and I think a Linux VM acting as NFS server? And then run the integration tests against this setup? Is one of the other collections you are in the process of migrating to Github Actions already doing something similar, or has similar requirements?
It's not only that we had problems with missing functionality from time to time. It's also that vcsim ist stateless. So we couldn't do any idempotency tests. I don't think this has changed so far.
Well, we could run the sanity tests using the GitHub action workflow file from the Collection Template repository. The sanity tests are not the problem here, but since you want to reduce the amount that you depend on the current Zuul platform it might be a good idea, anyway. This doesn't help with the integration tests, but it's something that might help you getting away from Zuul. What do you think? To keep the collection healthy, I guess the integration tests are extremely valuable. So I think I more or less need a way to replace what we have now: CI jobs that spin up a test environment (vCenter, ESXi hosts, Linux as NFS server) and run the integration tests against it. Ideally, we could use existing images. Alternatively, I think we should find a way to update them more or less automatically so I don't have to bother you to do it when the eval license times out. BTW, here is quite an interesting collection of blog posts about running Nested ESXi. Maybe it might help with running it on other hypervisors, too. If it doesn't help anyone else, this will be at least a reminder to myself that I should have a closer look at it ;-) Maybe we coud even make use of the Nested ESXi Virtual Appliances there. They are OVAs, but it might just be possible to import them into other hypervisors as well. Thanks again for starting this discussion @jillr! |
It should be possible at least with self-hosted runners, as long as someone could provide the VM infrastructure to run the VMs for the integration tests on. It might be possible with some of the Github managed runners but I don't know if they would have enough resources.
None of our other collections require this kind of infrastructure that we need to provide. The OpenShift and Kubernetes collections use the Red Hat OpenShift test infra, and all of our other cloud collections only talk to APIs (like for AWS) so we can test in Github's standard managed Ubuntu instances.
That's unfortunate. :(
Definitely, sanity and unit tests can all be run with the standard, managed GHA resources. The Content teams are working on a collection of reusable actions and workflows (in addition to what is in the collection template) for the community in anticipation of migrating from Zuul to help with testing, releasing, and generally managing collections. They are here now, though this repo might migrate to another GH Org later. https://github.com/ansible-network/github_actions |
community.vmware: Remove sanity and unity tests Depends-On: ansible-collections/community.vmware#1747 Remove sanity and unity tests from community.vmware. ansible-collections/community.vmware#1746 Reviewed-by: Jill R
For the record: #1747, #1748 and ansible/ansible-zuul-jobs#1803 |
I don't have a good idea how or where to run the integration tests yet. But while there's no chance to test vSphere 8, we should at least start testing with ansible-core 2.15 now that it's out: ansible/ansible-zuul-jobs#1805 What do you think, @jillr? |
@mariolenz and @jillr, if moving away from the above mentioned OpenStack/Zuul environment is an option, I might suggest you take a look at the SDDC.Lab (dev-v6) Github project that myself and Rutger Blom have created. It deploys a fully nested and configured SDDC.Lab environment (vCenter Server, ESXi Hosts, NSX, vRLI, etc.) in a little over an hour. It's entirely Ansible based, and supports deploying vSphere 6.7 thru 8, as well as NSX in both stand-alone and Federation. Additionally, we only leverage shipping OVA's and ISO's rather than customer OVA's, allowing the quick deployment of new versions as they are released by VMware. Finally, SDDC.Lab (dev-v6) also supports multiple Pod's, which could aid in your testing as each Pod could be deployed in a different configuration (ex. Pod 10=vSphere 6.7, Pod 20=vSphere 7.0u3, Pod 30=vSphere 8, Pod 40=vSphere 7.0u3 w/NSX (stand-alone), Pod 50,60,70=vSphere 7.0u3 with 3-site Federation). NSX overlay networking also supports dual-stack (IPv4/IPv6). Even though the above SDDC.Lab links reference a dev-v6 branch, it's very stable. We should probably roll it into the master branch, but just haven't. Besides the README.md, we also maintain a CHANGELOG.md file to track all of our changes. |
@luischanu Thanks, this looks really interesting! I'll try to find the time and have a look at it! |
We are still investigating options internally at Red Hat, but I don't have any new updates yet |
Hi, not sure if I'm helping any by piling on here, but I've noticed (as part of PR #1756 ) for the build I am now seeing the error:
https://ansible.softwarefactory-project.io/zuul/build/392c226f672447da99bf819896c5f841 |
@nikatbu I've already pinged people about the |
The VCSA image build takes a very long time, it completed after I logged off yesterday. It will take a while to upload the ~5GB image to the image store but I will get the uploads done today. |
@jillr No problem. Thanks for working on fixing this! |
New images are uploaded, I've reached out to the software factory support folks about the node failures affecting @nikatbu 's PR |
Thanks very much @jillr! |
We are also having issues manually spawning instances in both regions of the underlying OpenStack environment. I've opened a support case with the hosting provider (Vexxhost) and will update when I hear back. |
Thanks @jillr. Nothing is ever easy apparently! Have a great weekend. |
@jillr Now I'm seeing a lot of |
@mariolenz Following up to my irc reply with more detail. I've been going back and forth with the hosting provider. They fixed some things (it's not clear to me what was done) but routes on the public subnet are incomplete, which is causing problems for network traffic out of the instances. I'm not aware of anyone on the Ansible side modifying these networks so I believe it's on the provider side and am waiting for a reply to my latest question to them (which was made yesterday). |
The hosting provider has resolved the underlying issues that were blocking all CI jobs for all collections, but we're still getting timeouts on vmware nodes (these timeout after 5 minutes for the vsphere api to become available). I will continue investigating this, I am so sorry for the ongoing problems. |
@jillr: Greetings! Thanks for taking the time to open this issue. In order for the community to handle your issue effectively, we need a bit more information. Here are the items we could not find in your description:
Please set the description of this issue with this template: |
Thanks for the update Jill. Still seeing the node failures (on PR #1793) this morning. I'll do a few rechecks as I think in the past some of these node failures had been temporary and occasionally a recheck will work. |
It seems like the vmware images (and only the vmware images) are still booting more slowly, causing Zuul timeouts. What I don't understand is why. The timeout is set for all Ansible jobs, I don't know of a way to increase it just for vmware jobs. I'm debating trying rebuilding the images again since it seems like this all started when I refreshed the images, though I don't know why something in the images would be causing slow instance creation. |
Hard to know I think without looking at the boot logs to see where it's spending its time and getting delayed. Potentially there could be timeouts trying to get some service from something (DNS?) that doesn't exist or is blocked by a firewall; goes through a timeout then fails back to something which does exist. Just a theory though. It's possible a refresh of the image picks up the current and working external service reference whereas an older image (or maybe an image built at a certain point of time) may have older external service dependencies which we take the timeout/delay hit on. If it's straightforward enough to refresh the images, that might be a good quick/easy thing to try, but possibly if it's still slow it may involve having to look at the VMware logs to see why it's taking so long to start services. Just my 2 cents, I'm new here, so I could be way off base. :) But, thanks again @jillr for your continued support pushing this forward! |
@jillr Has this always been 5 minutes? If so, I wonder that it worked in the past. vCenter is quite a complex beast and I should say 5 minutes isn't that much time for it to come online. OK, maybe the API alone might be faster... when we reboot a vCenter (due to updates or any other reason) I'm interested in the time until it's completely back online, and that's usually more than 5 minutes. I haven't checked the time until the API alone is available. On the other hand, our vCenters are managing quite some ESXi hosts and VMs, I guess this makes a difference. In this case, the vCenter should be "empty". BTW: How does this test look like? I wasn't able to find it. I agree with @nikatbu, if it isn't too much work for you it would be worth a try to rebuild the images. If it is too much time, or if this doesn't fix the CI, it would be really helpful to get more information. Like the logs of the machine that causes the timeout. Note to myself: Have a look at https://docs.ansible.com/ansible/latest/dev_guide/testing_integration.html and see if this helps in finding an alternative way to run the integration tests. |
The timeout has been 300s since Jan 2022: https://softwarefactory-project.io/r/c/config/+/23741/1/nodepool/providers/mkProviderForAnsible.dhall It's not really clear to me how zuul decides if the boot was successful when declaring a timeout. The log I guess is here: https://softwarefactory-project.io/nodepool-launcher/zs.softwarefactory-project.io/launcher.log Here is a snippet in case the log rolls and there are no errors present when someone looks at it:
Later, zuul should run setup tasks to prepare the node before running integration tests based on this config: https://github.com/ansible/ansible-zuul-jobs/blob/master/zuul.d/ansible-cloud-jobs.yaml#L3-L45 My understanding of what node failures are though is that we don't even get that far. I should have a permission now that will let me hold the VMs after the build fails for troubleshooting but I'm going to have to experiment with it a little bit to see how to get it working. I'm rebuilding the images now. I can leave the build process running while I'm afk so it's not too bad. :) |
It looks like the VCSA image that's currently uploaded has disk corruption. And the build process is failing on my system when I try to rebuild the images. I'll see if someone else on the team can try refreshing the images while I troubleshoot virt-lightning on my end. |
I've created #1807 to test the CI. It doesn't change any code, but should trigger the integration tests. Maybe it's easier to troubleshoot using this in comparison to a "real" PR. |
@alinabuzachis @jillr I'm still seeing |
Regarding the ongoing NODE_FAILURES, I'm inclined to think it's a problem on the Vexxhost side but I don't have any hard evidence of that. Several of us now have looked at what logs and data we have available and are frankly stumped. We have extremely limited visibility into what happens on the OpenStack hypervisor, but also don't see these failures when we manually boot VMs on Vexxhost. And there are no reported cluster-wide or tenant-wide issues on the Zuul/Software Factory cluster (and there are other tenants using other Vexxhost accounts, so it seems at least unlikely that it's a systemic failure between Vexxhost and Zuul APIs or similar). It's quite challenging because we're not sure how to proceed for finding about a solution without significant dedicated effort. :( |
Thanks for the update Jill. Do we have support on Vexxhost who could help diagnose things a bit more, open a ticket, etc.? Bigger picture, where does this leave us with regard to updates to the community.vmware Ansible? Are we halted as there is no way to do validation testing? I've started to look at the VMware REST modules, but unfortunately since our vCenter is older, it means an older VMware REST module and that old a level is not very functional (especially when compared to coming from community.vmware). |
Sorry to interfere, I have a question - if Vexxhost is an OpenStack provider and we have problems with CI on it, then maybe it makes sense to transfer CI to vmware cloud provider hosting? And move the CI process to github actions? |
We've just gotten an escalation contact in the software factory team to see if they can help us diagnose the issue better (we have no access to any diagnostics in the SF/Zuul cluster or the underlying Vexxhost hypervisor that Zuul boots instances on which is making it especially difficult to troubleshoot). @ihumster No worries, it's a reasonable question. We had tried using worldstream.nl in the past as a hosted VMware provider, to avoid the "booting VMware on top of KVM/another virtualization platform" complexity that we have today. These platforms, when we last investigated (around 2019/2020), are not designed to be used for things like CI with frequent and rapid deployments and teardowns. We put a lot of engineering effort into trying to make it work by building custom tooling around provisioning, configuring, and deprovisioning VMware clusters but did not reach a solution that we felt to be adequate and maintainable. When we've talked to folks at VMware unofficially through the community about this use-case in the past they shared that this was not an area that VMware was interested in helping solve (or especially, licensing for). The landscape of options may have changed, and there may be completely viable hosts that could support this use case, however I don't know of any offhand. |
Thanks Jill, much appreciated for the status update and your continued work chasing this down. Hopefully the escalation contact has deeper access that will show better logging or some better visibility on the underlying components to see why things aren't starting up correctly. Just curious, mainly because I'm feeling a bit guilty continuing to push on this issue (because I want to get my feature into community.vmware), but what's your "official" involvement here? Are you a community volunteer, or do you work for Red Hat providing misc. support and assistance for various ansible community collections to make sure that the ecosystem is robust, viable, etc.? Regardless, thanks again for helping escalate this to various other SMEs to help get issues resolved. -nik |
@nikatbu I'm a Red Hat Ansible employee. My team manages the Red Hat maintained cloud collections (including vmware.vmware_rest) and we help support and enable community collections in the cloud domain. |
Ah, cool, thanks for the info. And thanks very much for what you're doing in the Ansible space, especially helping with the community ecosystem. We find that adds a lot of value to the product above and beyond the direct work you and your team puts into the Red Hat maintained collections. In terms of what the community is using for CI, vs. what Red Hat might be using internally to validate the cloud collections, is there a disparity there such that the community might need to consider rearchitecting to align more with Red Hat best practices? Or is it more a case that since the community might not have the same resources, it needs to leverage different technology for CI, hence the challenges we are running into that you may not be running into at Red Hat? |
Historically, we've been able to use the same exact CI system for both VMware-related collections, the requirements are very similar. That system was previously maintained by engineers on my team and provided to the community. But the team has lost our VMware subject matter experts.
We're still looking for ways to address this loss of domain expertise, but this CI system does require ongoing maintenance and the situation has been aggravated by a series of outages and problems on the hosted CI platform and its underlying infrastructure (Software Factory is the CI tool, which runs on a hosting provider Vexxhost). |
Thanks Jill. Hopefully your management and up can appreciate the risk (especially if the same CI system is used for both VMware-related collections) and are willing to dedicate more resources to the situation. If you need customer "business needs"(?), we've been using Ansible for a while in the Linux space and have been very happy, and are just starting to leverage it for other things such as VMware, F5, NetApp, etc. |
Thanks @nikatbu. If you have an AAP subscription you can always submit an RFE (request for feature) through the access.redhat.com portal. We do look at every one of those that comes in with product management! |
We are in the process of setting up a new vSphere environment, and we need to ascertain whether our current automation is compatible with vSphere 8. Is there any update regarding the inclusion of vSphere 8 in the CI pipeline? @jillr we will ask RH support to submit a RFE (but I'm not sure if they will help, because it's a community collection) |
@keilr I do not have any new information, my apologies. Yes please, if you are an AAP customer who is using VMware please submit an RFE with your use case. |
I found an excellent description of the internals of a VMware environment. https://goneri.lebouder.net/2020/08/21/vmware-ci-of-ansible-modules-a-retrospective/ |
Hi all, just a quick update, a recent recheck on PR #1793 now shows that the ZUUL validation works!!! #1793 (comment) IDK if I just got lucky. :) |
Following up to the conversation in #1723.
I'll summarize the current state of the CI system to make sure everyone has the same information. And as a general disclaimer, none of this constitutes a change in support statement about any supported or certified collection on behalf of Red Hat. :)
The Ansible Cloud Content engineering team is maintaining the images and tooling that deploy a nested-virtualization vSphere 7.x cluster on top of OpenStack/libvirt using a Zuul cluster (https://softwarefactory-project.io/) hosted inside Red Hat. This infrastructure is used by both the vmware.vmware_rest and community.vmware collections and requires routine (minimum bi-monthly) maintenance for the images.
In the past, the Ansible team had two subject matter experts with VMware knowledge, one of whom was also an expert for Zuul. While both of those developers still work on Ansible neither of them is involved with collection development any more. We have no one on the team today who has any significant VMware knowledge or experience. In part because of this and also just generally given the team's current roadmap priorities we've paused feature development on the vmware.vmware_rest collection.
In the very immediate term we can continue to maintain the 7.0.3u images but we're not able to do updates for vSphere 8 and if anything breaks our ability to help troubleshoot will be limited. We are also in the process of migrating the other collections the team manages to Github Actions and while we don't have a migration plan for vmware.vmware_rest's CI yet we will need to evaluate this in the long term. The complexity of the Zuul platform is not meeting our needs for other collections and we'd like to reduce the amount that we depend on this system and have to maintain expertise in it.
We are currently trying to find engineers with VMware knowledge elsewhere in Red Hat who might be able to maintain vmware.vmware_rest going forward. I definitely won't have any information about that until at least after Red Hat Summit this week, and probably for a little while after that. If we can find engineers with VMware expertise that can help we will certainly let them know that the CI is currently shared with the community in case they can take it over. If we can't find additional VMware expertise to support the collections though we will need to reduce the CI infrastructure support that we provide the community.
In the past the team did investigate other options for integration testing VMware without success (such as using hosted vmware providers like worldstream.nl). While it's possible the situation has changed, we don't have a low effort / low cost alternative to the custom Zuul deployment that we have today that we can readily suggest if we can't find someone to support the CI better. My understanding is that vcsim/govcsim do not provide the full functionality of the vSphere API but maybe that has changed?
@mariolenz I think you know the most about the history of the collection and what it needs to be successful. If we couldn't find a team to take over providing CI with ongoing maintenance and updates, what would you need to keep the collection healthy?
cc @p-fruck @Nina2244 who expressed interest in the CI
The text was updated successfully, but these errors were encountered: