-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink Operator Loses Job Manager Contact during EKS upgrade #683
Comments
Hi @guruguha Also, the serviceAccount with which the |
@live-wire thanks for responding! Yes, we have enabled HA for all our Flink clusters. All of them are Application Clusters requiring a job submitter to run/get deployed. Yes, we see the config maps created as well in the k8s namespace as well. |
@guruguha would you be able to try this version https://github.com/spotify/flink-on-k8s-operator/releases/tag/v0.5.1-alpha.2? We recently addressed issues related to HA that might be impacting you. |
@regadas We recently upgraded our operator to v0.5.0 release. The issue still persists. Let me check with v0.5.1-alpha.2 tag. |
@regadas We tried with the above release - our application specific config map got deleted and all the pods were terminated during a node roll on EKS. |
Hey @guruguha The configmap is what helps the job recover. That shouldn't be deleted so the job can recover. |
@live-wire Thanks for responding. We have a brief on this issue here: #690 At a high level, these are our HA configs:
Other configs:
|
Also, there should be no job-submitter pod when you use the Application mode. I noticed you mentioned you need a jobSubmitter? |
@live-wire I might have misunderstood the application mode with per-job mode. I’ll share the flinkcluster yaml shortly. We do have this in our HA settings:
|
We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!
When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.
Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.
The text was updated successfully, but these errors were encountered: