-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS kubeflow cannot run fairing. #770
Comments
This guide has been deprecated and archived, so some of the resources/links might not work. Do you mind checking the new Amazon EKS Workshop is now available at www.eksworkshop.com? |
Did you visit that site? There's no kubeflow examples and the guide is working. The point is that when using the fairing library, push to ECR works but pull does not, so k8s's "Job" not work. I know that EKS 1.25 uses containerd, not docker. So, I'm not exactly sure if the problem is the fairing library or the eks node. sure thing is, |
It looks like you have permissions to pull and are getting "not found" responses from ECR. Double check that the region you pushed to matches the region you're pulling from. Also check that the name is correct. Can you paste the log messages where the successful ECR push is logged? I think that's all in the 02_01_fairing_introduction notebook. |
@raykrueger Remote training# Authenticate ECR
# This command retrieves a token that is valid for a specified registry for 12 hours,
# and then it prints a docker login command with that authorization token.
# Then we executate this command to login ECR
REGION='ap-northeast-3'
!eval $(aws ecr get-login --no-include-email --region=$REGION) WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/jovyan/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded # Create an ECR repository in the same region
# If you receive "RepositoryAlreadyExistsException" error, it means the repository already
# exists. You can move to the next step
!aws ecr create-repository --repository-name fairing-job --region=$REGION {
"repository": {
"repositoryArn": "arn:aws:ecr:ap-northeast-3:468063208806:repository/fairing-job",
"registryId": "468063208806",
"repositoryName": "fairing-job",
"repositoryUri": "468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job",
"createdAt": 1689651359.0,
"imageTagMutability": "MUTABLE",
"imageScanningConfiguration": {
"scanOnPush": false
}
}
} # Setting up AWS Elastic Container Registry (ECR) for storing output containers
# You can use any docker container registry istead of ECR
AWS_ACCOUNT_ID=fairing.cloud.aws.guess_account_id()
AWS_REGION='ap-northeast-3'
DOCKER_REGISTRY = '{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION)
fairing.config.set_builder('append', base_image='tensorflow/tensorflow:1.15.0-py3', registry=DOCKER_REGISTRY, push=True)
fairing.config.set_deployer('job')
if __name__ == '__main__':
remote_train = fairing.config.fn(train)
remote_train() [I 230718 03:36:02 config:125] Using preprocessor: <kubeflow.fairing.preprocessors.function.FunctionPreProcessor object at 0x7f40fd46e198>
[I 230718 03:36:02 config:127] Using builder: <kubeflow.fairing.builders.append.append.AppendBuilder object at 0x7f40abb7eb00>
[I 230718 03:36:02 config:129] Using deployer: <kubeflow.fairing.deployers.job.job.Job object at 0x7f40abb7eb70>
[W 230718 03:36:02 append:50] Building image using Append builder...
[I 230718 03:36:02 base:107] Creating docker context: /tmp/fairing_context_n5goe_j3
[W 230718 03:36:02 base:94] /usr/local/lib/python3.6/dist-packages/kubeflow/fairing/__init__.py already exists in Fairing context, skipping...
[I 230718 03:36:02 docker_creds_:234] Loading Docker credentials for repository 'tensorflow/tensorflow:1.15.0-py3'
[W 230718 03:36:04 append:54] Image successfully built in 1.585705624995171s.
[W 230718 03:36:04 append:94] Pushing image 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E...
[I 230718 03:36:04 docker_creds_:234] Loading Docker credentials for repository '468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E'
[W 230718 03:36:04 append:81] Uploading 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E
[I 230718 03:36:04 docker_session_:284] Layer sha256:295d41931b472fa7d61e363497149a301fea37ab5ec9f0ea9916c86791c70b9c pushed.
[I 230718 03:36:04 docker_session_:284] Layer sha256:6fdd1bedaf2e49c66538fcc4e18b1f91d6fd4ba6e09886d242f4a217299f9e7a pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:c58094023a2e61ef9388e283026c5d6a4b6ff6d10d4f626e866d38f061e79bb9 pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:ac66bd508effe7f728663c81ae23e8a4f34ba7f707cea469e5c242ca544fe464 pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:079b6d2a1e53c648abc48222c63809de745146c2ee8322a1b9e93703318290d6 pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:11048ebae90883c19c9b20f003d5dd2f5bbf5b48556dabf06c8ea5c871c8debe pushed.
[I 230718 03:36:06 docker_session_:284] Layer sha256:094a8f5dd2cbe7e1bb8e970b4cf475516e3ecdbbdf673aeb454c6db226971e10 pushed.
[I 230718 03:36:06 docker_session_:284] Layer sha256:f5de9bda32bda66c3c4e1bef463925c0f649f1f8d9b20fdce4fcc4b761c50fab pushed.
[I 230718 03:36:06 docker_session_:284] Layer sha256:138c908b7d99825147bf3df37d6bca03cf4e0a48aded6b1dc14708adaa110f35 pushed.
[I 230718 03:36:07 docker_session_:284] Layer sha256:22e816666fd6516bccd19765947232debc14a5baf2418b2202fd67b3807b6b91 pushed.
[I 230718 03:36:08 docker_session_:284] Layer sha256:fb153ade6d147fb3ecf01f9cc24b489e684885946290c54503fa9667e0b587ac pushed.
[I 230718 03:36:12 docker_session_:284] Layer sha256:0db1490606495fd4e18934a9a6a645048f0e39869c1cb1c9e3a70141cb981878 pushed.
[I 230718 03:36:46 docker_session_:284] Layer sha256:354ee6535f236e958ab05585cfb532b2c20a9f18e2b45148896b0fcf77b819b5 pushed.
[I 230718 03:36:46 docker_session_:334] Finished upload of: 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E
[W 230718 03:36:46 append:99] Pushed image 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E in 42.38909610599512s.
[W 230718 03:36:46 job:90] The job fairing-job-spqxz launched.
[W 230718 03:36:48 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:36:48 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:36:48 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:36:50 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:01 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:12 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:24 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:36 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:50 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:38:31 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:38:45 manager:255] Waiting for fairing-job-spqxz-swmk8 to start... The pod description is the same as the additional information above. |
It looks like, you've successfully pushed the image to Update your notebook to pull from |
I'm not sure if this is what you want. |
https://github.com/kubeflow/fairing/releases has not been updated since 3 years, is it a good use of time to try using it? FYI: We dont test fairing as part of release |
I think there is a problem with the current eks node. As mentioned above, push the image to the ecr using fairing [ec2-user@ip-99-0-2-107 ~]$ ECR_PW=$(aws ecr get-login-password --region $ECR_REGION) So I don't know the exact cause.
In fact, the current problem has not been resolved at all, so I am using eks 1.23 and kubeflow 1.7. However, when the kubernetes version goes up, we have to upgrade as well, so I think it's a pretty serious problem. |
Looking for the similar case : Using microk8s > 1.13 will hit this error since it uses microk8s.ctr and dockerd is replaced with containerd. 'Append builder' calls Layer Class method originally from containerregistry, however fairing has an older version. See append_.py code difference: I've tried to change 'mediaType' to 'docker_http.LAYER_MIME' in fairing code, but still not work. The image manifest or digest seems not compatible. Need to check with containerregistry if containerd style image is supported and can be built with Layer Class method. |
Unfortunately there is some limitation on AWS kubeflow as below : |
Describe the bug
I tried "kubeflow fairing" example provided by eksworkshop.
Create ECR, push succeeds, but pull image from ECR fails. (NOT FOUND)
So, the fairing job fails.
The point is, the login and push to ECR succeeded, but the pull failed.
So, it doesn't appear to be an authentication or permission issue.
The creation of eks and kubeflow was created by referring to the link below.
https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.2/docs/deployment/rds-s3/guide-terraform/
Steps To Reproduce
The steps can be found at the link below.
https://archive.eksworkshop.com/advanced/420_kubeflow/fairing/
jupyter notebook image: 527798164940.dkr.ecr.us-west-2.amazonaws.com/tensorflow-1.15.2-notebook-cpu:1.0.0
Environment
kubernetes version: 1.25
Using EKS: YES, 1.25
Kubeflow version: v1.7.0-aws-b1.0.2
Aditional Information
The text was updated successfully, but these errors were encountered: