Flink Operator Loses Job Manager Contact during EKS upgrade #683

guruguha · 2023-05-11T02:16:09Z

We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!

When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.

Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.

live-wire · 2023-05-23T18:53:07Z

Hi @guruguha
Are you sure you set all the high availability related flink properties?

Also, the serviceAccount with which the FlinkCluster runs should have permissions to create and edit configmaps in the corresponding namespace. You can verify that you have all the right properties set and the serviceaccount has all the roles needed by checking if there exists a configmap in the same namespace as the FlinkCluster called: <YOUR_FLINK_CLUSTER>-cluster-config-map when the FlinkCluster starts.

guruguha · 2023-05-23T21:24:20Z

@live-wire thanks for responding! Yes, we have enabled HA for all our Flink clusters. All of them are Application Clusters requiring a job submitter to run/get deployed. Yes, we see the config maps created as well in the k8s namespace as well.

regadas · 2023-05-23T23:01:17Z

@guruguha would you be able to try this version https://github.com/spotify/flink-on-k8s-operator/releases/tag/v0.5.1-alpha.2? We recently addressed issues related to HA that might be impacting you.

guruguha · 2023-05-23T23:10:26Z

@regadas We recently upgraded our operator to v0.5.0 release. The issue still persists. Let me check with v0.5.1-alpha.2 tag.

guruguha · 2023-05-24T22:06:27Z

@regadas We tried with the above release - our application specific config map got deleted and all the pods were terminated during a node roll on EKS.

live-wire · 2023-05-30T11:01:27Z

Hey @guruguha The configmap is what helps the job recover. That shouldn't be deleted so the job can recover.
Can you try to delete the job manager pod for a running job and see if it recovers? Also please share your FlinkCluster yaml so we can try to reproduce this scenario for you.

guruguha · 2023-05-30T21:16:54Z

@live-wire Thanks for responding. We have a brief on this issue here: #690
Also, it would be great if you could share ideal / must have HA configs for a Flink cluster so we can compare with what we have right now.

At a high level, these are our HA configs:

flinkProperties:
    high-availability.storageDir: s3://PATH_TO_S3_DIR/
    s3.iam-role: ROLE_TO_ACCESS_S3

Other configs:

  job:
    autoSavepointSeconds: 3600
    savepointsDir: dummy_path
    restartPolicy: FromSavepointOnFailure
    maxStateAgeToRestoreSeconds: 21600
    takeSavepointOnUpdate: true

live-wire · 2023-05-31T15:02:04Z

Hey @guruguha
These are the required HA flinkProperties for 1.16: link
(Depends on the Flink version)

kubernetes.cluster-id: <cluster-id>
high-availability: kubernetes
high-availability.storageDir: hdfs:///flink/recovery

live-wire · 2023-05-31T15:16:50Z

Also, there should be no job-submitter pod when you use the Application mode. I noticed you mentioned you need a jobSubmitter?
Can you share your FlinkCluster yaml?

guruguha · 2023-05-31T15:59:50Z

@live-wire I might have misunderstood the application mode with per-job mode. I’ll share the flinkcluster yaml shortly.

We do have this in our HA settings:

    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    kubernetes.cluster-id: flink-ingestion-std-ha

live-wire mentioned this issue Jul 25, 2023

Job Manager is not brought back up #709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink Operator Loses Job Manager Contact during EKS upgrade #683

Flink Operator Loses Job Manager Contact during EKS upgrade #683

guruguha commented May 11, 2023

live-wire commented May 23, 2023 •

edited

Loading

guruguha commented May 23, 2023 •

edited

Loading

regadas commented May 23, 2023

guruguha commented May 23, 2023 •

edited

Loading

guruguha commented May 24, 2023

live-wire commented May 30, 2023

guruguha commented May 30, 2023 •

edited

Loading

live-wire commented May 31, 2023

live-wire commented May 31, 2023

guruguha commented May 31, 2023 •

edited

Loading

Flink Operator Loses Job Manager Contact during EKS upgrade #683

Flink Operator Loses Job Manager Contact during EKS upgrade #683

Comments

guruguha commented May 11, 2023

live-wire commented May 23, 2023 • edited Loading

guruguha commented May 23, 2023 • edited Loading

regadas commented May 23, 2023

guruguha commented May 23, 2023 • edited Loading

guruguha commented May 24, 2023

live-wire commented May 30, 2023

guruguha commented May 30, 2023 • edited Loading

live-wire commented May 31, 2023

live-wire commented May 31, 2023

guruguha commented May 31, 2023 • edited Loading

live-wire commented May 23, 2023 •

edited

Loading

guruguha commented May 23, 2023 •

edited

Loading

guruguha commented May 23, 2023 •

edited

Loading

guruguha commented May 30, 2023 •

edited

Loading

guruguha commented May 31, 2023 •

edited

Loading