[FEAT] Add documentation on /etc/machine-id volume mapping to support Folding At Home client v8.x #23

tylerbrockett · 2024-08-03T07:24:48Z

Is this a new feature request?

I have searched the existing issues

Wanted change

There should be a blurb discussing the need to map /etc/machine-id from the docker host into the docker container for the Folding At Home client v8.x to work properly (or copy it in the Dockerfile, but that is more inflexible if we wanted to mock a different machine ID for whatever reason)

Reason for change

Folding at Home client V8.x requires creating an account and associating your machine to the account to be able to view folding progress. This is due to the way the FaH team changed the web ~~client~~ control to be a public server that the folders/contributors/machines report to instead of each machine running its own web ~~client~~ control.

The machine ID is indirectly linked to each work unit in such a way that if the machine ID changes, it breaks all associations. This is a problem because each new docker container gets a randomly generated /etc/machine-id. Each time the container is recreated, the web ~~client~~ control would no longer see my machine (it would show "Disconnected"). See the following log entries:

E :Machine ID changed, generating new client ID
Generating RSA key.....
F@H ID = <new-client-id>
I3:WUXXXX:Loading work unit XXXX with ID <work-unit-id>
E :WU with client ID <old-client-id> does not belong client <new-cliend-id>

I traced this machine ID coming from this line of code in FaH client which calls this line of code in the cbang/os/SystemInfo library to get the Machine ID from /etc/machine-id on linux systems.

By mapping the /etc/machine-id into the container, I am able to destroy and recreate the container as needed, and it continues to work with the web ~~client~~ control as expected.

Proposed code change

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-03T07:25:10Z

Thanks for opening your first issue here! Be sure to follow the relevant issue templates, or risk having this issue marked as invalid.

kbernhagen · 2024-08-03T07:37:24Z

What you are calling “web client” is and always has been called “web control”. The name of the repository is a historical issue and is more of a tag soup. There was an experimental “web client” that ran in Chrome using google native client code.

kbernhagen · 2024-08-03T07:57:06Z

It is true that fah-client acts as both client and server. It is also true that the web app connects to a fah-client and a couple servers and is therefore a client of sorts. However, the user guide and other docs call it web control, and I hope to reduce confusion with users by restricting use of “client” to mean fah-client.

tylerbrockett · 2024-08-03T08:00:31Z

@kbernhagen understood, I am still fairly new to FaH, so I appreciate the clarification.

kbernhagen · 2024-08-03T08:07:32Z

Note that the unit dumping from changed machine-id is not a problem if one always sets finish on work.

I realize there is no easy way to do this yet.
Better container support

Ironically, the behavior was requested by someone using cloned containers. I believe.
Cloned machines have same F@H ID which leads to conflicts

Roxedus · 2024-08-03T12:23:00Z

Mapping it wouldn't be the correct way to handle it. If machine-id is required, we can generate one and store it in /config for persistency.

tylerbrockett · 2024-08-03T16:39:07Z

Mapping it wouldn't be the correct way to handle it. If machine-id is required, we can generate one and store it in /config for persistency.

That sounds much better, thanks!

aptalca · 2024-08-03T16:48:07Z

Each time the container is recreated, the web client control would no longer see my machine (it would show "Disconnected").

I can't reproduce the issue as described.

When I recreate the container with the same env vars and config folder, it is detected as the same machine by the web app.

This is a problem because each new docker container gets a randomly generated /etc/machine-id.

My containers contain a blank /etc/machine-id, which never changes.

$ docker exec test cat /etc/machine-id
$

$ docker exec test ls -al /etc/machine-id
-rw-r--r-- 1 root root 0 Jun  5 02:05 /etc/machine-id

Is this issue only affecting resuming partial jobs?

aptalca · 2024-08-03T16:53:33Z

I can't reproduce it with resuming either. I recreate the container and it picks up from where it left off.

Completed 4252 out of 500000 steps
I let it run for a bit and after a recreation of the container, it started from step 4252 as expected. Of course the percentage meter resets in the web app because the total number of steps will be recalculated to show the remaining steps. But that's the same behavior after doing a pause/resume in the web app.

tylerbrockett · 2024-08-03T19:57:57Z

TL;DR - it's a "me issue", root cause is some form of missing dependency (on my side?) for Core 23 WUs. I apologize for the false alarm.

My containers contain a blank /etc/machine-id, which never changes.

My FaH containers have been generating new machine-ids, but your comment there helped me dig more into why, thank you @aptalca.

When I was running the old containers prior to the major v8 changes, I was getting stuck in a WU_STALLED loop, where it would download the WU but never actually start folding.

I remember doing a lot of digging into the issue at the time, tons of forum threads mentioning the issue, no solutions worked for me though. I somehow found that installing systemd in a custom custom-cont-init.d init script fixed a missing dependency or something. Truly, I wish I could remember the logic there, but I was desperate to get this working, and it's been running fine since then. Anyways, systemd must be the one generating the different machine-id each time I recreate the container.

Removing that init script today fixes the machine-id and "Disconnected" issue I was getting, but I was back to getting the WU_STALLED errors for Core 23 WUs, since it did finally start running successfully on a Core 22 WU after 25 dumped WUs in a row for Core 23. Replacing installing systemd in the container with ocl-icd-opencl-dev also "solves" the problem, but I would have to imagine that it's a missing dependency on my host machine if everyone else is running this image fine with NVidia GPUs. I'll try to dig some more into the issue.

aptalca · 2024-08-03T23:00:17Z

Nvidia just needs the container toolkit and the Nvidia drivers on host. All the opencl bits should be injected via nvidia docker runtime.

tylerbrockett · 2024-08-03T23:24:18Z

I have the nvidia-driver-XXX, nvidia-container-toolkit (replaces nvidia-container-runtime), and nvidia-cuda-toolkit installed on the host machine.

Inside the containers I can get the following output:

# nvidia-smi
Sat Aug  3 16:14:45 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     Off |   00000000:01:00.0 Off |                  N/A |
| 74%   78C    P0            199W /  200W |     992MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Is it possible the Dockerfiles in this repo should be installing ocl-icd-opencl-dev instead of intel-opencl-icd for the x64 Dockerfile?

I notice this was changed from ocl-icd-libopencl1 to intel-opencl-icd as part of the PR to support FaH v8. The ARM64 version still lists ocl-icd-libopencl1. Neither of those packages on their own work with my system for GPU FaH folding after modifying the Dockerfile and building locally. However, ocl-icd-opencl-dev works for CPU and GPU folding for me.

I am not familiar with OpenCL/GPU libs, but it seems like ocl-icd-opencl-dev is geared towards multiple device types (CPUs, GPUs, etc), whereas the intel package is geared just towards intel GPUs, and it might be that ocl-icd-libopencl1 isn't sufficient for NVidia GPUs for whatever reason.

I have seen ocl-icd-opencl-dev mentioned in several places now when discussing OpenCL runtimes, libOpenCL.so, and FaH specifically too. ocl-icd-opencl-dev also provides the libOpenCL.so file exactly where it's needed, so the ln -s libOpenCL.so.1 /usr/lib/x86_64-linux-gnu/libOpenCL.so command is no longer needed as well.

1 - FaH Forums - Linux: Failed to open dynamic library 'libOpenCL.so'
2 - AskUbuntu How to install libOpenCL.so on ubuntu
3 - Reddit - /r/OpenCL subreddit - this one mentions it in combination with intel-opencl-icd

aptalca · 2024-08-04T01:51:45Z

Dev package has the headers that are necessary to build other packages dependent on this. It shouldn't be needed runtime.

When I tested nvidia with this image a while back, all that was needed inside the image was this pointer file:
https://github.com/linuxserver/docker-foldingathome/blob/master/root/etc/OpenCL/vendors/nvidia.icd
The file listed in there should be injected by the nvidia runtime

tylerbrockett · 2024-08-05T02:36:41Z

I thought the same, that -dev packages wouldn't be needed for runtime, but even the official FaH container Dockerfiles (didn't know these existed) install that package. Under a different base image and look outdated, but still seemed noteworthy.

https://github.com/FoldingAtHome/containers/blob/80d15f5870df9e45f62d5dfe3ac4d9ed82544992/fah-gpu/Dockerfile#L9

Could the core be doing some build/compilation that Core 22 wasn't doing? I tried running the debug FAHClient but it didn't provide anything useful.

Everything I see points me at that package for some reason though. Still doesn't explain why something like systemd would've also solved my problem, maybe some kind of shared dependency?

Any idea what Nvidia driver / Cuda version combo you were using in your testing? I can try to give those a shot.

tylerbrockett · 2024-08-06T07:01:13Z

I think I've further narrowed it down to the libexpat1 package for parsing XML in C. That was being installed during the systemd install as well as the ocl-icd-opencl-dev install, which explains the correlation. It was also something I remember coming up while researching, but I don't recall the context.

When installing just the libexpat1 package in the Dockerfile, everything works as expected. Still trying to figure out where/how/why that's being used.

aptalca · 2024-08-10T14:13:49Z

@tylerbrockett
Can you test this image and confirm it works?
lspipepr/foldingathome:amd64-8.3.18-pkg-0a5de763-dev-6303e76f273ada8ef6a0d2c335daea1d9b6d29e5-pr-24

built from PR #24

tylerbrockett · 2024-08-10T19:45:40Z

@aptalca - I tried on that branch as well as the specific image version mentioned, and both are working as expected now for Core 23 WUs. Thank you so much!

(PPD is gradually increasing and should stabilize around 4.3m or so)

tylerbrockett added the enhancement New feature or request label Aug 3, 2024

LinuxServer-CI added this to Issue & PR Tracker Aug 3, 2024

LinuxServer-CI moved this to Issues in Issue & PR Tracker Aug 3, 2024

aptalca self-assigned this Aug 3, 2024

aptalca mentioned this issue Aug 10, 2024

Add libexpat1 for Nvidia support #24

Merged

aptalca closed this as completed in #24 Aug 11, 2024

LinuxServer-CI moved this from Issues to Done in Issue & PR Tracker Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add documentation on /etc/machine-id volume mapping to support Folding At Home client v8.x #23

[FEAT] Add documentation on /etc/machine-id volume mapping to support Folding At Home client v8.x #23

tylerbrockett commented Aug 3, 2024 •

edited

Loading

github-actions bot commented Aug 3, 2024

kbernhagen commented Aug 3, 2024

kbernhagen commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024

kbernhagen commented Aug 3, 2024 •

edited

Loading

Roxedus commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024

aptalca commented Aug 3, 2024

aptalca commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024 •

edited

Loading

aptalca commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024 •

edited

Loading

aptalca commented Aug 4, 2024

tylerbrockett commented Aug 5, 2024

tylerbrockett commented Aug 6, 2024 •

edited

Loading

aptalca commented Aug 10, 2024

tylerbrockett commented Aug 10, 2024 •

edited

Loading

[FEAT] Add documentation on /etc/machine-id volume mapping to support Folding At Home client v8.x #23

[FEAT] Add documentation on /etc/machine-id volume mapping to support Folding At Home client v8.x #23

Comments

tylerbrockett commented Aug 3, 2024 • edited Loading

Is this a new feature request?

Wanted change

Reason for change

Proposed code change

github-actions bot commented Aug 3, 2024

kbernhagen commented Aug 3, 2024

kbernhagen commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024

kbernhagen commented Aug 3, 2024 • edited Loading

Roxedus commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024

aptalca commented Aug 3, 2024

aptalca commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024 • edited Loading

aptalca commented Aug 3, 2024

tylerbrockett commented Aug 3, 2024 • edited Loading

aptalca commented Aug 4, 2024

tylerbrockett commented Aug 5, 2024

tylerbrockett commented Aug 6, 2024 • edited Loading

aptalca commented Aug 10, 2024

tylerbrockett commented Aug 10, 2024 • edited Loading

tylerbrockett commented Aug 3, 2024 •

edited

Loading

kbernhagen commented Aug 3, 2024 •

edited

Loading

tylerbrockett commented Aug 3, 2024 •

edited

Loading

tylerbrockett commented Aug 3, 2024 •

edited

Loading

tylerbrockett commented Aug 6, 2024 •

edited

Loading

tylerbrockett commented Aug 10, 2024 •

edited

Loading