Skip to content
This repository was archived by the owner on Jan 29, 2026. It is now read-only.

Add ppc64le & multiarch support#136

Open
sdmonov wants to merge 75 commits intoIBM:masterfrom
sdmonov:add_ppc64le_multiarch_support
Open

Add ppc64le & multiarch support#136
sdmonov wants to merge 75 commits intoIBM:masterfrom
sdmonov:add_ppc64le_multiarch_support

Conversation

@sdmonov
Copy link

@sdmonov sdmonov commented Sep 13, 2018

This code is adding ppc64le and multiarch support for FfDL.

  • Adds new make target to the main Makefile: docker-create-manifest
    that creates the multiarch manifests for the services.

  • arch (architecture) parameter was added to values.yaml and
    storage-plugin/values.yaml

make docker-build will generate all the services with a -${ARCHITECTURE}
suffix. During build an ARCH argument is sent to the Dockerfiles in order
to have architecture specific implementation inside it.

Building docker images for ppc64le should look like this:

make docker-build
make docker-push
make docker-create-manifest
.....
helm install storage-plugin --set arch=ppc64le,namespace=$NAMESPACE
helm install . --set arch=ppc64le,lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS,namespace=$NAMESPACE

Fixes #115

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

whummer and others added 30 commits February 9, 2018 19:41
[don't merge yet] Fix package names and travis config
Update instructions with Cloud Object Storage
* arch diag

* arch diag

* Update README.md

* adding specs

* adding specs
* update none VM_TYPE

* polish export commands
* Extent make test-submit waiting time
…BM#22)

* assign default edit role to lcm

* add helm value options for 1.7 and below
* adding prereqs and bumping user guide in front

* adding prereqs and bumping user guide in front
…ors) (IBM#24)

* add caffe2 and pytorch cpu support

* update LCM, learner config file, and example jobs

* fix pytorch example bug

* Update gpu-guide.md

* Update gpu-guide.md

* merge CPU and GPU examples into a single example

* add more tf framework versions

* fix typo

* add S3 prereq
* adding contributors

* Update README.md
* Updating maintainers file

* Update MAINTAINERS.md
* add converting script

* update converter readme and update tensorflow version

* update troubleshooting

* Update README.md

* Update gpu-guide.md

* Update README.md

* Update README.md

* Update README.md
* Adding references to Watson Studio

* Update README.md

* Rename README.md to ffdl-wml.md

* Update README.md

* Create train-deploy-wml.md

* Update train-deploy-wml.md

* Update README.md

* Update ffdl-wml.md

* Update ffdl-wml.md

* Update ffdl-wml.md

* Update README.md

* Update README.md

* update WML instructions

* revert tf example

* update caffe manifest
* Update feature-gates for k8s 1.9.4 and above

* Update troubleshooting

* Update README.md
)

* update learner entrypoint command

* update troubleshooting

* Update train.sh path to match with DLaaS.
…ild (IBM#51)

* * Add codebase configuration for device plugin and custom learner images
* Add developer guide for those who want to do a custom FfDL build

* update developer-guide

* fix declare type
* Creating CLA

* Update CLA.md

* Update CONTRIBUTING.md
Tomcli and others added 22 commits June 20, 2018 14:33
* add detailed H2O instructions

* add detailed H2O instructions
* h20 arch image

* h20 arch image

* h20 arch image
* upload chinese readme file .

upload chinese readme file .

* add chinese readme file hyperlinks on readme file.

add chinese readme file hyperlinks on readme file.

* add chinese readme file hyperlinks on readme file.

add chinese readme file hyperlinks on readme file.

* modify

modify
Pre-0.1 release: Add Object Storage mount and other enhancements.
* Architecture Details

* Architecture Details

* Architecture Details
* init commit for horovod patch

* update examples and docs

* update docs and converter script

* update example readme

* update example readme

* modify horovod examples with real workload

* modify horovod examples

* update sed syntax to be more visual friendly

* add troubleshooting for dind cluster

* remove deprecated instructions
* horovod

* horovod

* horovod

* horovod

* horovod

* horovod

* horovod
* Remove 4 minute timeout for log follow process (IBM#106)

The process that follows the training logs of an ongoing training job
should not timeout after 4 minutes. Instead the log follow process
should complete after the training job itself is finished.

This behavior is necessary to enable chaining up commands to create
machine learning pipelines, where subsequent commands require the output
data of the training job whose logs are being "followed" like in our
ART notebook.

This commit reinstates the log follow behavior prior merge of PR IBM#79

* Updates suggested by sboagibm

Intention was to not rely on a long term stream being held open, 
but to be able to re-open a new stream starting from where the 
old left off, if the connection terminates.
* Update ART Notebook after PR IBM#79

- Load cluster configuration from environment variables
- Require PUBLIC_IP and KUBECONFIG instead of CLUSTER_NAME and VM_TYPE
- Use storage type "mount_cos" (s3fs) instead of "s3_datastore"

* Update ART demo notebook after PR IBM#79

- Load cluster configuration from environment variables
- Require PUBLIC_IP and KUBECONFIG instead of CLUSTER_NAME and VM_TYPE
- Use storage type "mount_cos" (s3fs) instead of "s3_datastore"
* update dl framework versions

* update examples with new framework tags
* update fashion mnist example with seldon 0.2

* fix readme
* Pointed travis testing to do hostmount minikube

* Debugging permissions error.

* Fix to mkdir problems.

* Fixed Makefile syntax.

* Printing debugging information about pods.

* Printing debugging information about pods.

* Printing debugging information about pods.

* Printing debugging information incl kubectl get pod.

* Enabled debug mode.

* Again.

* Set debug as default.

* tracing from the trainer to lcm

* more debugging

* added lower level logging

* dist: xenial

* Update .travis.yml

* fix typo

* Trying to fix Travis issue.

* Fixed Travis issue.

* Followed Tommy's request and increased resource limits to values from before. Might break CI.

* Parameterized memory values like Tommy requested.

* Attempt to fix CI.

* Removed excessive debug statements and cleaned comments. Probably breaks code.

* DLaaS pull june 14, with security mods

* fixed glide problem

* Added Image.go etc. files, deleted learner_test.go

* temporarily disable framework validation

* FIXME: Disable validation check for bucket until conditionalize for s3fs vs.  option.

* fixed two bugs related to volume mounting

* I think mostly just logging changes

* basic success

* Add FfDL.iml to .gitignore

* removed docker ref to csf_env.properties

* Test for mount_cos before attempting s3 validation

* fixed hostmount by pre-setup of model code in Makefile

* fixed missing import

* log HELM_DEPLOY_DIR, add a bunch of logging for the ci test

* Added create-volumes to jenkins file, more verbose docker build for ui

* Wound back Angular to 6.0.8

* Quiet docker-build-ui docker build

* merged bin/create_static_volumes_config2.sh into bin/create_static_volumes_config.sh
* update prebuild image version, update helm chart to 0.1.1

* fix make deploy bug
…ce (IBM#110)

* make helm charts and scripts compatible to deploy FfDL on any namespace

* allow users to export all the enviornment variables in a txt file

* Update readme with new notice

* Fix typo

* Update static volumes config v2 namespace parameter

* capitalize NAMESPACE, update Makefile, developer guide, and trobleshooting.
LGTM. Ran fine / fixed statsd issue on Ubuntu 18.04 Vagrant VM.
* Simplifying README

* Simplifying README

* Create detailed-installation-instructions.md

* Update README.md

* Update detailed-installation-instructions.md

* Update and rename detailed-installation-instructions.md to detailed-installation-guide.md

* Update detailed-installation-guide.md

* Update detailed-installation-guide.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md
This code is adding ppc64le and multiarch support for FfDL.

- Adds new make target to the main Makefile: **docker-create-manifest**
  that creates the multiarch manifests for the services.

- arch (architecture) parameter was added to values.yaml and
  storage-plugin/values.yaml

**make docker-build** will generate all the services with a -${ARCHITECTURE}
suffix. During build an ARCH argument is sent to the Dockerfiles in order
to have architecture specific implementation inside it.

Building docker images should look like this:

make docker-build
make docker-push
make docker-create-manifest
@sdmonov
Copy link
Author

sdmonov commented Sep 13, 2018

Hi All,

This code adds support for ppc64le in FfDL. It does adds a suffix to the built services with the architecture and also adds a target in the main Makefile to create/ammend the needed manifest lists to support the multiple architectures.

There are two services that are needed by the code prom/pushgateway and localstack/localstack (optional) that does not have a ppc64le version in the public registries. I did generate custom build ppc64le version and pushed it to my own registry smonov. They will be pulled from there for now until published on a more official registry.

values.yaml Outdated
expose_node_port: true
docker:
registry: docker.io
registry: ffdl.ibm.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes our official Docker image location.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my mistake. This was not supposed to be there. Will fix it.

@sdmonov
Copy link
Author

sdmonov commented Sep 14, 2018

Found few issues. Fixing and testing them now:

  • fixed issue with the repository in the values.yaml reported by @Tomcli
  • added new target to the Makefile: docker-tag-local. It will tag the arch dependent image name to arch independent name. For example: ffdl/ffdl-ui-ppc64le:v0.1.1 will be tagged to ffdl/ffdl-ui:v.0.1.1. This is used for local deployments only that do not require pushing to registry. make docker-push will push only the arch dependent images (with -${ARCHITECTURE} suffix) only and then a manifest lists will be created for these image to support multiarch (by invoking make docker-create-manifest).
  • updated .travis.yml to invoke docker-tag-local.

@animeshsingh
Copy link
Contributor

Thanks @sdmonov. What is the target for travis if any?

@seelam
Copy link

seelam commented Sep 14, 2018

helm install storage-plugin --set arch=ppc64le,namespace=$NAMESPACE
What if you have a mixed x86, POWER cluster, what happens then?

Could we not put the multi-arch storage plugin in a private repo and use it?

The public travis ci is now multi-arch so you should do builds with it. I would like to see the changes in Travis.yaml file.

We have to deal with learner images... is that addressed in a different PR?

- added docker-tag-local target in Makefile
- fixed few issues in docker-create-manifest target in Makefile
- small fix in docker-push to not have duplicate code
@sdmonov
Copy link
Author

sdmonov commented Sep 17, 2018

@seelam:

  • About the storage plugin I totally agree with you we need to push multi-arch images and create a manifest list. I have no access to push images at the moment. If I can be given access (@animeshsingh) I can push both ppc64le and x86_64 images and create the manifest lists to support the multi-arch.
  • at the moment travil.yml (as far as I can tell looking into it) is doing testing only. It is not pushing images to the repos
  • learner images will go separate PR. will commit soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Power support in FfDL