Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch cuda dev tools (#736) #743

Merged
merged 11 commits into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion scripts/install_R_source.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
set -e

R_VERSION=${1:-${R_VERSION:-"latest"}}
PURGE_BUILDDEPS=${PURGE_BUILDDEPS=-"TRUE"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bools in env vars should be "true" or "false", not "TRUE" or "FALSE".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think env vars are untyped so 1 and "1" and "TRUE" and "True" and "true" are all the same. Unless, of course, the receiving end makes a difference when it parses. Otherwise the check is commonly 'set or not set'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a common practice in shell script?
I saw in bash:

$ if "true"; then echo "true"; fi
true

$ if "false"; then echo "true"; fi

$ if "1"; then echo "true"; fi
bash: 1: command not found

$ if "TRUE"; then echo "true"; fi
bash: TRUE: command not found

$ if "True"; then echo "true"; fi
bash: True: command not found

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but as you I did this a string comparison since booleans in bash always confuse me. Anyway I've changed these to lower-case now. Thanks for catching this! let me know if you'd recommend further changes here.


# shellcheck source=/dev/null
source /etc/os-release
Expand Down Expand Up @@ -157,7 +158,9 @@ rm -rf "R.tar.gz"
cp /usr/bin/checkbashisms /usr/local/bin/checkbashisms

# shellcheck disable=SC2086
apt-get remove --purge -y ${BUILDDEPS}
if [ "${PURGE_BUILDDEPS}" == "TRUE" ]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eitsupi you said:

it would be better to use the same script with the flag added to install_R_source.sh and skipped the "build R" steps so that we don't have to have almost identical copies in two files.

I am adding a flag to install_R_source.sh right here though. what two files are you referring two as almost identical copies?

Copy link
Member

@eitsupi eitsupi Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the assumptions were not engaged. I had imagined something like the following:

FROM rocker/r-ver:4.3.2 as builder

FROM FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
COPY --from=builder /foo/bar
COPY scripts/install_R_source.sh /rocker_scripts/install_R_source.sh

# should install the necessary packages, but should skip building R
ENV SKIP_BUILD_R="true"
RUN /rocker_scripts/install_R_source.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see what you mean. That looks rather more complicated though.

  • It requires using the multi-stage build, and i'm not sure how easy that is to add to our existing build stack.
  • Copying /foo/bar isn't so clean, because our standard build installs R_HOME into /usr/local, with binaries in /usr/local/bin, etc, rather than freestanding. Yes, I proposed a solution around this in my previous proposal, but it is a significant change that could break some user expectations.
  • SKIP_BUILD_R would have to wrap around thin install-time deps, building of R and and clean up and purge of dependencies. That's fine, but it feels more natural to me to have an option that just avoids the apt remove step. After all, given that apt remove has unexpected consequences, it seems reasonable to be able to opt out of that without opting out of the rest of it. and intuitively, it's easier I think to guess what install_R_source.sh does behaves in response to a option of PURGE_BUILDDEPS=FALSE than it is to guess what install_R_source.sh does in response to an option of SKIP_BUILD_R.

What's the downside of the approach here? I still don't see what's been duplicated.

apt-get remove --purge -y ${BUILDDEPS}
fi
apt-get autoremove -y
apt-get autoclean -y
rm -rf /var/lib/apt/lists/*
Expand Down
2 changes: 1 addition & 1 deletion scripts/install_python.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ python3 -m pip --no-cache-dir install --upgrade \
# Make the venv owned by the staff group, so users can install packages
# without having to be root
chown -R root:staff "${VIRTUAL_ENV}"
chmod g+ws "${VIRTUAL_ENV}"
chmod -R g+ws "${VIRTUAL_ENV}"

install2.r --error --skipmissing --skipinstalled -n "$NCPUS" reticulate

Expand Down
5 changes: 4 additions & 1 deletion scripts/setup_R.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
set -e

CRAN=${1:-${CRAN:-"https://cran.r-project.org"}}
PURGE_BUILDDEPS=${PURGE_BUILDDEPS=-"TRUE"}

ARCH=$(uname -m)

Expand Down Expand Up @@ -68,7 +69,9 @@ if [ ! -x "$(command -v r)" ]; then

# Clean up
# shellcheck disable=SC2086
apt-get remove --purge -y ${BUILDDEPS}
if [ "${PURGE_BUILDDEPS}" == "TRUE" ]; then
apt-get remove --purge -y ${BUILDDEPS}
fi
apt-get autoremove -y
apt-get autoclean -y
fi
Expand Down
4 changes: 3 additions & 1 deletion stacks/4.3.2.json
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,9 @@
"NVBLAS_CONFIG_FILE": "/etc/nvblas.conf",
"PYTHON_CONFIGURE_OPTS": "--enable-shared",
"RETICULATE_AUTOCONFIGURE": "0",
"PATH": "${PATH}:${CUDA_HOME}/bin"
"PURGE_BUILDDEPS": "FALSE",
"VIRTUAL_ENV": "/opt/venv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! will be nice to have /opt/venv work on these images.

"PATH": "${PATH}:${VIRTUAL_ENV}/bin:${CUDA_HOME}/bin"
},
"COPY_a_script": "scripts/install_R_source.sh /rocker_scripts/install_R_source.sh",
"RUN_a_script": "/rocker_scripts/install_R_source.sh",
Expand Down
6 changes: 4 additions & 2 deletions stacks/devel.json
Original file line number Diff line number Diff line change
Expand Up @@ -148,15 +148,17 @@
"org.opencontainers.image.title": "rocker/cuda",
"org.opencontainers.image.description": "NVIDIA CUDA libraries added to Rocker image."
},
"FROM": "nvidia/cuda:11.8.0-cudnn8-devel-ubuntu24.04",
"FROM": "nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this value is updated by the script, so don't update it here manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this when the script added it. Are we sure a) the production cycle for Ubuntu 24.04 has started and b) that NVidia already publishes to it? I was under the impression NVidia was particularly 'laggy' ie the gcc version implied by nvcc was always ages behind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no nvidia/cuda:11.8.0-cudnn8-devel-ubuntu24.04. There are

  • nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
  • nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
  • nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04 (EOL)

See https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.8.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all. @eitsupi can you clarify? If this field is not read because it is provided by the script then maybe we can drop it or stick in a dummy value? It seems needlessly confusing to reference a value that isn't going to be used and isn't actually a valid image name but just happens to look almost like a valid image name, right?

Copy link
Member

@eitsupi eitsupi Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is "not read", but "don't set it manually, because if the script is execute by CI, it will be updated".

The priority here is to use the same Ubuntu version as rocker/r-ver. I haven't found an easy way to check if that image actually exists (This is definitely possible using buildx, but I couldn't figure out how to check without logging into the Container Registry.), so no such processing is implemented.
If you try to build it, it will simply fail.

Currently pointing to Ubuntu 24.04 is not as intended, I really wanted it to be 22.04, which is the same as the current ubuntu:latest. In other words, this is a bug in the script. (The 24.04 release date is later than today, so it should be filtered out)

"FROM": "ubuntu:latest",

# Update the cuda base image
template$stack[[9]]$FROM <- .cuda_baseimage_tag(dplyr::last(df_ubuntu_lts$ubuntu_series))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it now reads 22.04, and I think should work as intended. let me know if anything needs to be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it now reads 22.04, and I think should work as intended. let me know if anything needs to be changed?

Run the scripts to update the files. It should be back.

## Automatic update of container definition files
The Dockerfiles for pre-build images will be generated by executing the scripts in the following order. It will also generate the `docker-bake.json` files that can be used to build the container images. `make-matrix.R` will generate files containing the matrix used in GitHub Actions.
1. `./build/make-stacks.R`
2. `./build/make-dockerfiles.R`
3. `./build/make-bakejson.R`
4. `./build/make-matrix.R`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eitsupi ok I see what you mean, make-stacks.R tries to figure this out automatically and it's coming up with the nonsense answer of 24.04 as the answer on the devel image. I still think that's confusing (doesn't that mean we're just failing to build a devel flavor for the whole cuda stack?) but I agree it's a separate issue and not one that needs to be solved in this PR.

I've just pushed an edit to roll that back.

I thought I fixed whitespace but am still struggling with the linter :-(

"ENV": {
"R_VERSION": "devel",
"R_HOME": "/usr/local/lib/R",
"TZ": "Etc/UTC",
"NVBLAS_CONFIG_FILE": "/etc/nvblas.conf",
"PYTHON_CONFIGURE_OPTS": "--enable-shared",
"RETICULATE_AUTOCONFIGURE": "0",
"PATH": "${PATH}:${CUDA_HOME}/bin"
"PURGE_BUILDDEPS": "FALSE",
"VIRTUAL_ENV": "/opt/venv",
"PATH": "${PATH}:${VIRTUAL_ENV}/bin:${CUDA_HOME}/bin"
},
"COPY_a_script": "scripts/install_R_source.sh /rocker_scripts/install_R_source.sh",
"RUN_a_script": "/rocker_scripts/install_R_source.sh",
Expand Down
Loading