Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add documentation on /etc/machine-id volume mapping to support Folding At Home client v8.x #23

Closed
1 task done
tylerbrockett opened this issue Aug 3, 2024 · 17 comments · Fixed by #24
Closed
1 task done
Assignees
Labels
enhancement New feature or request

Comments

@tylerbrockett
Copy link

tylerbrockett commented Aug 3, 2024

Is this a new feature request?

  • I have searched the existing issues

Wanted change

There should be a blurb discussing the need to map /etc/machine-id from the docker host into the docker container for the Folding At Home client v8.x to work properly (or copy it in the Dockerfile, but that is more inflexible if we wanted to mock a different machine ID for whatever reason)

Reason for change

Folding at Home client V8.x requires creating an account and associating your machine to the account to be able to view folding progress. This is due to the way the FaH team changed the web client control to be a public server that the folders/contributors/machines report to instead of each machine running its own web client control.

The machine ID is indirectly linked to each work unit in such a way that if the machine ID changes, it breaks all associations. This is a problem because each new docker container gets a randomly generated /etc/machine-id. Each time the container is recreated, the web client control would no longer see my machine (it would show "Disconnected"). See the following log entries:

E :Machine ID changed, generating new client ID
Generating RSA key.....
F@H ID = <new-client-id>
I3:WUXXXX:Loading work unit XXXX with ID <work-unit-id>
E :WU with client ID <old-client-id> does not belong client <new-cliend-id>

I traced this machine ID coming from this line of code in FaH client which calls this line of code in the cbang/os/SystemInfo library to get the Machine ID from /etc/machine-id on linux systems.

By mapping the /etc/machine-id into the container, I am able to destroy and recreate the container as needed, and it continues to work with the web client control as expected.

Proposed code change

No response

@tylerbrockett tylerbrockett added the enhancement New feature or request label Aug 3, 2024
Copy link

github-actions bot commented Aug 3, 2024

Thanks for opening your first issue here! Be sure to follow the relevant issue templates, or risk having this issue marked as invalid.

@kbernhagen
Copy link

What you are calling “web client” is and always has been called “web control”. The name of the repository is a historical issue and is more of a tag soup. There was an experimental “web client” that ran in Chrome using google native client code.

@kbernhagen
Copy link

It is true that fah-client acts as both client and server. It is also true that the web app connects to a fah-client and a couple servers and is therefore a client of sorts. However, the user guide and other docs call it web control, and I hope to reduce confusion with users by restricting use of “client” to mean fah-client.

@tylerbrockett
Copy link
Author

@kbernhagen understood, I am still fairly new to FaH, so I appreciate the clarification.

@kbernhagen
Copy link

kbernhagen commented Aug 3, 2024

Note that the unit dumping from changed machine-id is not a problem if one always sets finish on work.

I realize there is no easy way to do this yet.
Better container support

Ironically, the behavior was requested by someone using cloned containers. I believe.
Cloned machines have same F@H ID which leads to conflicts

@Roxedus
Copy link
Member

Roxedus commented Aug 3, 2024

Mapping it wouldn't be the correct way to handle it. If machine-id is required, we can generate one and store it in /config for persistency.

@aptalca aptalca self-assigned this Aug 3, 2024
@tylerbrockett
Copy link
Author

Mapping it wouldn't be the correct way to handle it. If machine-id is required, we can generate one and store it in /config for persistency.

That sounds much better, thanks!

@aptalca
Copy link
Member

aptalca commented Aug 3, 2024

Each time the container is recreated, the web client control would no longer see my machine (it would show "Disconnected").

I can't reproduce the issue as described.

When I recreate the container with the same env vars and config folder, it is detected as the same machine by the web app.

This is a problem because each new docker container gets a randomly generated /etc/machine-id.

My containers contain a blank /etc/machine-id, which never changes.

$ docker exec test cat /etc/machine-id
$
$ docker exec test ls -al /etc/machine-id
-rw-r--r-- 1 root root 0 Jun  5 02:05 /etc/machine-id

Is this issue only affecting resuming partial jobs?

@aptalca
Copy link
Member

aptalca commented Aug 3, 2024

I can't reproduce it with resuming either. I recreate the container and it picks up from where it left off.

Completed 4252 out of 500000 steps
I let it run for a bit and after a recreation of the container, it started from step 4252 as expected. Of course the percentage meter resets in the web app because the total number of steps will be recalculated to show the remaining steps. But that's the same behavior after doing a pause/resume in the web app.

@tylerbrockett
Copy link
Author

tylerbrockett commented Aug 3, 2024

TL;DR - it's a "me issue", root cause is some form of missing dependency (on my side?) for Core 23 WUs. I apologize for the false alarm.

My containers contain a blank /etc/machine-id, which never changes.

My FaH containers have been generating new machine-ids, but your comment there helped me dig more into why, thank you @aptalca.

When I was running the old containers prior to the major v8 changes, I was getting stuck in a WU_STALLED loop, where it would download the WU but never actually start folding.

I remember doing a lot of digging into the issue at the time, tons of forum threads mentioning the issue, no solutions worked for me though. I somehow found that installing systemd in a custom custom-cont-init.d init script fixed a missing dependency or something. Truly, I wish I could remember the logic there, but I was desperate to get this working, and it's been running fine since then. Anyways, systemd must be the one generating the different machine-id each time I recreate the container.

Removing that init script today fixes the machine-id and "Disconnected" issue I was getting, but I was back to getting the WU_STALLED errors for Core 23 WUs, since it did finally start running successfully on a Core 22 WU after 25 dumped WUs in a row for Core 23. Replacing installing systemd in the container with ocl-icd-opencl-dev also "solves" the problem, but I would have to imagine that it's a missing dependency on my host machine if everyone else is running this image fine with NVidia GPUs. I'll try to dig some more into the issue.

@aptalca
Copy link
Member

aptalca commented Aug 3, 2024

Nvidia just needs the container toolkit and the Nvidia drivers on host. All the opencl bits should be injected via nvidia docker runtime.

@tylerbrockett
Copy link
Author

tylerbrockett commented Aug 3, 2024

I have the nvidia-driver-XXX, nvidia-container-toolkit (replaces nvidia-container-runtime), and nvidia-cuda-toolkit installed on the host machine.

Inside the containers I can get the following output:

# nvidia-smi
Sat Aug  3 16:14:45 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     Off |   00000000:01:00.0 Off |                  N/A |
| 74%   78C    P0            199W /  200W |     992MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Is it possible the Dockerfiles in this repo should be installing ocl-icd-opencl-dev instead of intel-opencl-icd for the x64 Dockerfile?

I notice this was changed from ocl-icd-libopencl1 to intel-opencl-icd as part of the PR to support FaH v8. The ARM64 version still lists ocl-icd-libopencl1. Neither of those packages on their own work with my system for GPU FaH folding after modifying the Dockerfile and building locally. However, ocl-icd-opencl-dev works for CPU and GPU folding for me.

I am not familiar with OpenCL/GPU libs, but it seems like ocl-icd-opencl-dev is geared towards multiple device types (CPUs, GPUs, etc), whereas the intel package is geared just towards intel GPUs, and it might be that ocl-icd-libopencl1 isn't sufficient for NVidia GPUs for whatever reason.

I have seen ocl-icd-opencl-dev mentioned in several places now when discussing OpenCL runtimes, libOpenCL.so, and FaH specifically too. ocl-icd-opencl-dev also provides the libOpenCL.so file exactly where it's needed, so the ln -s libOpenCL.so.1 /usr/lib/x86_64-linux-gnu/libOpenCL.so command is no longer needed as well.

@aptalca
Copy link
Member

aptalca commented Aug 4, 2024

Dev package has the headers that are necessary to build other packages dependent on this. It shouldn't be needed runtime.

When I tested nvidia with this image a while back, all that was needed inside the image was this pointer file:
https://github.com/linuxserver/docker-foldingathome/blob/master/root/etc/OpenCL/vendors/nvidia.icd
The file listed in there should be injected by the nvidia runtime

@tylerbrockett
Copy link
Author

I thought the same, that -dev packages wouldn't be needed for runtime, but even the official FaH container Dockerfiles (didn't know these existed) install that package. Under a different base image and look outdated, but still seemed noteworthy.

https://github.com/FoldingAtHome/containers/blob/80d15f5870df9e45f62d5dfe3ac4d9ed82544992/fah-gpu/Dockerfile#L9

Could the core be doing some build/compilation that Core 22 wasn't doing? I tried running the debug FAHClient but it didn't provide anything useful.

Everything I see points me at that package for some reason though. Still doesn't explain why something like systemd would've also solved my problem, maybe some kind of shared dependency?

Any idea what Nvidia driver / Cuda version combo you were using in your testing? I can try to give those a shot.

@tylerbrockett
Copy link
Author

tylerbrockett commented Aug 6, 2024

I think I've further narrowed it down to the libexpat1 package for parsing XML in C. That was being installed during the systemd install as well as the ocl-icd-opencl-dev install, which explains the correlation. It was also something I remember coming up while researching, but I don't recall the context.

When installing just the libexpat1 package in the Dockerfile, everything works as expected. Still trying to figure out where/how/why that's being used.

@aptalca
Copy link
Member

aptalca commented Aug 10, 2024

@tylerbrockett
Can you test this image and confirm it works?
lspipepr/foldingathome:amd64-8.3.18-pkg-0a5de763-dev-6303e76f273ada8ef6a0d2c335daea1d9b6d29e5-pr-24

built from PR #24

@tylerbrockett
Copy link
Author

tylerbrockett commented Aug 10, 2024

@aptalca - I tried on that branch as well as the specific image version mentioned, and both are working as expected now for Core 23 WUs. Thank you so much!

(PPD is gradually increasing and should stabilize around 4.3m or so)

image

@LinuxServer-CI LinuxServer-CI moved this from Issues to Done in Issue & PR Tracker Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants