-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for provisioning nvidia gpu agents #211
base: master
Are you sure you want to change the base?
Conversation
@@ -42,7 +42,7 @@ | |||
$defaults = { | |||
'ensure' => 'present', | |||
} | |||
create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults) | |||
# create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be reverted, only here to be able to launch the code without keys.
cb9aaa9
to
2a29851
Compare
* Use a local copy of the nvidia sources file. Since http/https sources weren't introduced until Puppet 4.4, we can't use them. And the wget approach is a decent fallback but I'd rather just vendor the sources file and deal with linkrot of the apt repositories rather than linkrot of the sources and the apt repositories. * Notify Exec['apt_update'] from the apt package after the source has been added. This should resolve race conditions between the apt source, apt key, and apt update execution and means that the nvidia-docker2 package need only depend on the proper source being configured.
Going forward we should store files in the module they're used.
These ordering arrows are redundant with the explicit `require` directive in each resource. In general I prefer to use ordering arrows only in one-liners rather than verbose resources as they get quite hard to follow, especially when comments are interspersed.
Notifying a service triggers a refresh or restart depending on the services capabilities. This should be sufficient to trigger a restart.
The linux-aws kernel is only necessary when using an EC2 instance. In other systems that kernel may not be reasonable. I'm not sure if other virtual systems will have specific kernel requirements as I don't have any nvidia GPU + virtualizable systems to test on. But that can be work for future PRs.
This dependency is established using both before and require on the respective resources. Where both resources are unconditionally declared require is preferred.
When two resources are both installed unconditionally. I arbitrarily prefer declaring the dependency with `require` rather than `before`. Using `before` is still needed in cases where a dependency needs to be declared on a resource that is only installed conditionally. An example is the linux-aws kernel needing to be available before nvidia-375 when on an EC2 instance.
2a29851
to
d17584c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a few changes that constitute base feedback. The commit messages include rational and I'm happy to explain any of them further or get countervailing feedback.
for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded
@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.
I usually install |
Please note that given the permission setup by |
When trying to run the code today in a new instance, the inclusion of
|
I found more problems using puppet 3.x from system on Xenial. @nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general? diff --git a/modules/agent_files/templates/xorg.conf.erb b/modules/agent_files/templates/xorg.conf.erb
index a53865d..8daaac3 100644
--- a/modules/agent_files/templates/xorg.conf.erb
+++ b/modules/agent_files/templates/xorg.conf.erb
@@ -37,7 +37,7 @@ Section "Device"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GRID K520"
- BusID "<%= @facts['gpu_device_bus_id'] %>"
+ BusID "<%= @gpu_device_bus_id %>"
EndSection
Section "Screen"
diff --git a/modules/facts/lib/facter/busid.rb b/modules/facts/lib/facter/busid.rb
deleted file mode 100644
index e05ca3a..0000000
--- a/modules/facts/lib/facter/busid.rb
+++ /dev/null
@@ -1,5 +0,0 @@
-Facter.add(:gpu_device_bus_id) do
- setcode do
- Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info | grep BusID | sed "s/.*PCI:/PCI:/g"')
- end
-end
diff --git a/modules/facts/lib/facter/gpu_device_bus_id.rb b/modules/facts/lib/facter/gpu_device_bus_id.rb
new file mode 100644
index 0000000..e05ca3a
--- /dev/null
+++ b/modules/facts/lib/facter/gpu_device_bus_id.rb
@@ -0,0 +1,5 @@
+Facter.add(:gpu_device_bus_id) do
+ setcode do
+ Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info | grep BusID | sed "s/.*PCI:/PCI:/g"')
+ end
+end |
Puppet 3.8 using the |
Facts are collected before puppet runs. Which means that I don't think we can use a fact that comes from a command installed by puppet. I can think of a couple of really dirty solutions (like shelling out in the embedded Ruby template 😱) but it may be that we need to make it a part of the hiera configuration. |
ERB files are evaluated before puppet manages the dependency resolution of the manifests so no nvidia-xconfig is available when the ERB file is being parsed
I've back to test this PR today and everything seems to work fine except the fact that a |
This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the |
I moved the notify clause from The only workaround I can see is to run the restart command in +if [[ $buildfarm_role == 'agent_gpu' ]]; then
+ service lightdm restart
+fi |
The PR implements the support for provisioning Nvidia GPU agents particularly the AWS nodes. Some details:
agent_gpu
based onagent
display-setup-script
feature to run an script to set proper permissionsxorg.conf
customized to run on GRID K520 cards in a headless mode (more generic configurations did not work for me during my tests)nvidia-docker2
Step to test it at AWS cloud:
sudo apt-get update && sudo apt-get install -y librarian-puppet git mesa-utils && git clone https://github.com/j-rivero/buildfarm_deployment -b agent_gpu && git clone https://github.com/j-rivero/buildfarm_deployment_config -b agent_gpu && sed -i '51d' buildfarm_deployment_config/reconfigure.bash && sudo mv buildfarm_* /root/ && sudo su -
cd buildfarm_deployment_config
./reconfigure.bash agent-gpu
TODO: for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded. It is necessary to restart the service to get it working correctly. I've tried to define the require clause explicitly to end with the
service_lightdm_restart
but no luck. Need your help here @nuclearsandwich