Support for provisioning nvidia gpu agents #211

j-rivero · 2019-03-14T19:55:22Z

The PR implements the support for provisioning Nvidia GPU agents particularly the AWS nodes. Some details:

Add a new role called agent_gpu based on agent
Use lightdm to manage the launch of X server
Use lightdm display-setup-script feature to run an script to set proper permissions
Use custom xorg.conf customized to run on GRID K520 cards in a headless mode (more generic configurations did not work for me during my tests)
Install nvidia-docker2

Step to test it at AWS cloud:

Launch AMI 16.04 >> GPU instances g2.2xlarge
ssh -X ...
sudo apt-get update && sudo apt-get install -y librarian-puppet git mesa-utils && git clone https://github.com/j-rivero/buildfarm_deployment -b agent_gpu && git clone https://github.com/j-rivero/buildfarm_deployment_config -b agent_gpu && sed -i '51d' buildfarm_deployment_config/reconfigure.bash && sudo mv buildfarm_* /root/ && sudo su -
cd buildfarm_deployment_config
./reconfigure.bash agent-gpu

TODO: for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded. It is necessary to restart the service to get it working correctly. I've tried to define the require clause explicitly to end with the service_lightdm_restart but no luck. Need your help here @nuclearsandwich

modules/agent_files/files/etc/lightdm/xhost.sh

j-rivero · 2019-03-14T19:56:25Z

modules/profile/manifests/ros/base.pp

@@ -42,7 +42,7 @@
    $defaults = {
      'ensure' => 'present',
    }
-    create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults)
+    # create_resources(ssh_authorized_key, hiera('ssh_keys'), $defaults)


This needs to be reverted, only here to be able to launch the code without keys.

* Use a local copy of the nvidia sources file. Since http/https sources weren't introduced until Puppet 4.4, we can't use them. And the wget approach is a decent fallback but I'd rather just vendor the sources file and deal with linkrot of the apt repositories rather than linkrot of the sources and the apt repositories. * Notify Exec['apt_update'] from the apt package after the source has been added. This should resolve race conditions between the apt source, apt key, and apt update execution and means that the nvidia-docker2 package need only depend on the proper source being configured.

Going forward we should store files in the module they're used.

These ordering arrows are redundant with the explicit `require` directive in each resource. In general I prefer to use ordering arrows only in one-liners rather than verbose resources as they get quite hard to follow, especially when comments are interspersed.

Notifying a service triggers a refresh or restart depending on the services capabilities. This should be sufficient to trigger a restart.

The linux-aws kernel is only necessary when using an EC2 instance. In other systems that kernel may not be reasonable. I'm not sure if other virtual systems will have specific kernel requirements as I don't have any nvidia GPU + virtualizable systems to test on. But that can be work for future PRs.

This dependency is established using both before and require on the respective resources. Where both resources are unconditionally declared require is preferred.

When two resources are both installed unconditionally. I arbitrarily prefer declaring the dependency with `require` rather than `before`. Using `before` is still needed in cases where a dependency needs to be declared on a resource that is only installed conditionally. An example is the linux-aws kernel needing to be available before nvidia-375 when on an EC2 instance.

nuclearsandwich

I've made a few changes that constitute base feedback. The commit messages include rational and I'm happy to explain any of them further or get countervailing feedback.

for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded

@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.

modules/profile/files/jenkins/agent_gpu/etc/X11/xorg.conf

modules/profile/files/jenkins/agent_gpu/etc/lightdm/xhost.sh

modules/profile/manifests/jenkins/agent_gpu.pp

j-rivero · 2019-04-04T22:22:08Z

@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.

I usually install mesa-utils and run glxinfo . If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.

…onfig

j-rivero · 2019-04-08T16:15:05Z

I usually install mesa-utils and run glxinfo . If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.

Please note that given the permission setup by xhost.sh (called by lightdm) you need to run glxinfo from the jenkins-agent account.

modules/facts/lib/facter/busid.rb

modules/profile/manifests/jenkins/agent_gpu.pp

j-rivero · 2019-04-29T12:23:59Z

When trying to run the code today in a new instance, the inclusion of nvidia-xconfig inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):

Could not retrieve fact='gpu_device_bus_id', resolution='<anonymous>': Could not execute 'nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"': command not found

modules/profile/manifests/jenkins/agent_gpu.pp

j-rivero · 2019-04-29T13:04:37Z

I found more problems using puppet 3.x from system on Xenial. @nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?

diff --git a/modules/agent_files/templates/xorg.conf.erb b/modules/agent_files/templates/xorg.conf.erb
index a53865d..8daaac3 100644
--- a/modules/agent_files/templates/xorg.conf.erb
+++ b/modules/agent_files/templates/xorg.conf.erb
@@ -37,7 +37,7 @@ Section "Device"
     Driver         "nvidia"
     VendorName     "NVIDIA Corporation"
     BoardName      "GRID K520"
-    BusID          "<%= @facts['gpu_device_bus_id'] %>"
+    BusID          "<%= @gpu_device_bus_id %>"
 EndSection
 
 Section "Screen"
diff --git a/modules/facts/lib/facter/busid.rb b/modules/facts/lib/facter/busid.rb
deleted file mode 100644
index e05ca3a..0000000
--- a/modules/facts/lib/facter/busid.rb
+++ /dev/null
@@ -1,5 +0,0 @@
-Facter.add(:gpu_device_bus_id) do
-  setcode do
-    Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"')
-  end
-end
diff --git a/modules/facts/lib/facter/gpu_device_bus_id.rb b/modules/facts/lib/facter/gpu_device_bus_id.rb
new file mode 100644
index 0000000..e05ca3a
--- /dev/null
+++ b/modules/facts/lib/facter/gpu_device_bus_id.rb
@@ -0,0 +1,5 @@
+Facter.add(:gpu_device_bus_id) do
+  setcode do
+    Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"')
+  end
+end

nuclearsandwich · 2019-04-29T14:14:08Z

@nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?

Puppet 3.8 using the --future parser so that we're using the updated puppet language parser.

nuclearsandwich · 2019-04-29T14:55:10Z

the inclusion of nvidia-xconfig inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):

Facts are collected before puppet runs. Which means that I don't think we can use a fact that comes from a command installed by puppet.

I can think of a couple of really dirty solutions (like shelling out in the embedded Ruby template 😱) but it may be that we need to make it a part of the hiera configuration.

modules/agent_files/templates/xorg.conf.erb

ERB files are evaluated before puppet manages the dependency resolution of the manifests so no nvidia-xconfig is available when the ERB file is being parsed

j-rivero · 2019-05-16T14:58:55Z

I've back to test this PR today and everything seems to work fine except the fact that a service lightdm restartis needed after provisioning. I think that the xhost.sh script is being executed (the touch command in there leaves a file in tmp) but for some reason the X server does not have the right permissions.

nuclearsandwich · 2019-05-16T15:35:18Z

service lightdm restart

This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm] in the /etc/lightdm/lightdm.conf resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.

j-rivero · 2019-05-24T16:15:46Z

service lightdm restart

This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm] in the /etc/lightdm/lightdm.conf resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.

I moved the notify clause from lightdm.conf to xhost.sh file and make this last one to depend on the first one. No good result.

The only workaround I can see is to run the restart command in reconfigure.sh to restart lightdm after provisioning:

+if [[ $buildfarm_role == 'agent_gpu' ]]; then
+  service lightdm restart
+fi

j-rivero commented Mar 14, 2019

View reviewed changes

modules/agent_files/files/etc/lightdm/xhost.sh Outdated Show resolved Hide resolved

j-rivero commented Mar 14, 2019

View reviewed changes

nuclearsandwich self-requested a review March 14, 2019 20:05

j-rivero added 10 commits April 3, 2019 08:41

Implement agent_gpu profile extending agent

be3ad7c

Fix jenkins username

780f9c4

Fixes to get real support on AWS machines

7908d42

Be sure that xorg.conf is not overwritten

d81a613

Fix syntax typo

f49be15

Do not install ubuntu-drivers-common

95add6e

Use require in lightdm service

5840a7a

Need to restart the service

b8a27f6

Explicit order for lightdm

cf8f873

Add debug to the script

d092226

nuclearsandwich force-pushed the agent_gpu branch from cb9aaa9 to 2a29851 Compare April 3, 2019 18:41

nuclearsandwich added 7 commits April 3, 2019 12:06

Avoid use of legacy 'agent_files' module.

cc7d29e

Going forward we should store files in the module they're used.

Notify service directly rather than via exec.

1ce7cb7

Notifying a service triggers a refresh or restart depending on the services capabilities. This should be sufficient to trigger a restart.

Remove double-declared dependency.

03d45d9

This dependency is established using both before and require on the respective resources. Where both resources are unconditionally declared require is preferred.

nuclearsandwich force-pushed the agent_gpu branch from 2a29851 to d17584c Compare April 3, 2019 20:20

nuclearsandwich reviewed Apr 3, 2019

View reviewed changes

modules/profile/files/jenkins/agent_gpu/etc/X11/xorg.conf Outdated Show resolved Hide resolved

modules/profile/files/jenkins/agent_gpu/etc/lightdm/xhost.sh Outdated Show resolved Hide resolved

nuclearsandwich reviewed Apr 4, 2019

View reviewed changes

modules/profile/manifests/jenkins/agent_gpu.pp Show resolved Hide resolved

j-rivero added 3 commits April 5, 2019 18:13

Transform xorg.conf into a template to get busid value from nvidia-xc…

8724994

…onfig

Fix facter implementation

18e0640

value needs quotes

927037d

nuclearsandwich reviewed Apr 8, 2019

View reviewed changes

modules/facts/lib/facter/busid.rb Outdated Show resolved Hide resolved

modules/profile/manifests/jenkins/agent_gpu.pp Show resolved Hide resolved

Change busid by gpu_device_bus_id

3f32d61

j-rivero commented Apr 29, 2019

View reviewed changes

modules/profile/manifests/jenkins/agent_gpu.pp Show resolved Hide resolved

This was referenced Apr 29, 2019

Add trusted_node_data = true to puppet config during reconfigure. ros-infrastructure/buildfarm_deployment_config#49

Open

Fix 'master' proxy vhost config (avoid: "proxy setup broken") #215

Merged

nuclearsandwich reviewed Apr 30, 2019

View reviewed changes

modules/agent_files/templates/xorg.conf.erb Outdated Show resolved Hide resolved

Revert the use of facter to get the nvidia-xconfig value

b92729a

ERB files are evaluated before puppet manages the dependency resolution of the manifests so no nvidia-xconfig is available when the ERB file is being parsed

remove debug touch command

220a30f

j-rivero requested a review from nuclearsandwich January 2, 2020 17:57

j-rivero mentioned this pull request Jan 2, 2020

Add support for agent_gpu in reconfigure.bash ros-infrastructure/buildfarm_deployment_config#51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for provisioning nvidia gpu agents #211

Support for provisioning nvidia gpu agents #211

j-rivero commented Mar 14, 2019 •

edited

Loading

j-rivero Mar 14, 2019

nuclearsandwich left a comment

j-rivero commented Apr 4, 2019

j-rivero commented Apr 8, 2019

j-rivero commented Apr 29, 2019

j-rivero commented Apr 29, 2019

nuclearsandwich commented Apr 29, 2019

nuclearsandwich commented Apr 29, 2019

j-rivero commented May 16, 2019

nuclearsandwich commented May 16, 2019

j-rivero commented May 24, 2019

Support for provisioning nvidia gpu agents #211

Are you sure you want to change the base?

Support for provisioning nvidia gpu agents #211

Conversation

j-rivero commented Mar 14, 2019 • edited Loading

j-rivero Mar 14, 2019

Choose a reason for hiding this comment

nuclearsandwich left a comment

Choose a reason for hiding this comment

j-rivero commented Apr 4, 2019

j-rivero commented Apr 8, 2019

j-rivero commented Apr 29, 2019

j-rivero commented Apr 29, 2019

nuclearsandwich commented Apr 29, 2019

nuclearsandwich commented Apr 29, 2019

j-rivero commented May 16, 2019

nuclearsandwich commented May 16, 2019

j-rivero commented May 24, 2019

j-rivero commented Mar 14, 2019 •

edited

Loading