-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support detection, allocation and resetting of GPU partitions in CDNA cards #54
Comments
We have a couple of AS -8125GS-TNMR2 machines with mi300x and suffer greatly due to this as well. The only major thing lacking in NVidia's implementation is allocation of MIG instances on demand - they are all statically allocated which is a serious PITA and not elastic at all. They should be created when requested (eg nvidia.com/mig-1g.5gb: 1) and destroyed when pod is done and when nvidia.com/gpu: 1 is requested full gpu should be attached to a pod and this should be possible all at the same time (of course nvidia.com/mig-1g.5gb and nvidia.com/gpu should be completely different physical gpus if requested at the same time). This would/might create scheduling issues (fragmentation) but nevertheless should be available as an option as this has potential to better utilize available resources and doesn't require administrator to be omniscient when statically allocating MIGs. |
Very interesting deep dive about what is and what is going to be possible on NVidia hardware with kubernetes: |
Suggestion Description
This is more of a question at this point. CDNA3 MI300x supports up to 8 x partitions per card via SR-IOV. Can
k8s-device-plugin
Operating System
No response
GPU
CDNA, MI300x
ROCm Component
k8s-device-plugin
The text was updated successfully, but these errors were encountered: