-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rapids] removed spark tests, updated to a more recent rapids release #1219
Conversation
I prefer this to #1218 |
/gcbrun |
/gcbrun |
2 similar comments
/gcbrun |
/gcbrun |
Should we increase the machine type from n1-standard-4 to n1-standard-16 |
cuda11 has been manually tested with all versions. |
/gcbrun |
1 similar comment
/gcbrun |
tests are failing for
|
/gcbrun |
1 similar comment
/gcbrun |
[edit: this was a misconfiguration in the systemd unit] It looks like the dask infrastructure is out of date and I'll have to target 2023.12 instead.
|
I also need to reduce the python abi to 3.10 |
/gcbrun |
6 similar comments
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
All versions aside from 2.2-debian12 succeeded. Build ID: 36ac2daa-1c2d-483c-8f0b-b3359fee06b9 Commit: e8a44fe 2.2-debian12 failed because of a conda mirror failure. I've opened conda/infrastructure#1051 to track. It was running a dask install, which it shouldn't have been doing. So I've removed the delta from master of the dask/ directory |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
Ach! Finally! |
Hooray! 🥳 Thank you for your persistence here CJ! 🙏 👏 |
Thanks for noticing! This reminds me of a tool song. I feel like an excess of optimism is the only thing that keeps me going sometimes ;-) |
okay, result of trying dask>=2024.8 instead of dask>=2024.5:
|
I'll use "rapids>=${RAPIDS_VERSION}" instead of "rapids=${RAPIDS_VERSION}" as the rapids_spec ; that way we'll pick up a newer version if it can be solved for. Nothing newer than 24.08 for rapids, but dask picked up 2024.7 rather than 2024.5, so I'll encode that into the current state of the patch. |
… dask>=2024.7 ; correctly capturing retval of installer program
/gcbrun |
Two in a row! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not able to fully understand the PR tbh, there are too many changes clubbed together without reasoning why we changed the behavior, PR description is not helping either. We should have split this up.
@@ -70,6 +70,7 @@ determine_tests_to_run() { | |||
changed_dir="${changed_dir%%/*}/" | |||
# Run all tests if common directories modified | |||
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then | |||
continue # remove this before squash/merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
boot_disk_size="200GB", | ||
timeout_in_minutes=30) | ||
boot_disk_size="50GB", | ||
timeout_in_minutes=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too long timeout, is 30 mins not enough for a cluster to be up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it takes much time as rapids requires gpu driver as well. So, in order to get rapids up and running we first need to install gpu drivers and that takes up quite some time.
("STANDARD", ["m"], GPU_T4, "standalone")) | ||
|
||
@parameterized.parameters( | ||
# If a new version of dask-yarn is released, add this test back in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these not supported anymore? Also, let's update the init actions README as well.
if [[ -z "${IMAGE_VERSION}" ]] ; then | ||
IMAGE_VERSION="$(jq -r .IMAGE_VERSION env.json)" ; fi ; export IMAGE_VERSION | ||
|
||
#declare -a TESTS_TO_RUN=('dask:test_dask' 'rapids:test_rapids') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code, let's remove this
if [[ "${ROLE}" == "Master" ]]; then | ||
systemctl restart hadoop-yarn-resourcemanager.service | ||
# Restart NodeManager on Master as well if this is a single-node-cluster. | ||
if systemctl status hadoop-yarn-nodemanager; then | ||
if systemctl list-units | grep hadoop-yarn-nodemanager; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Why was this change needed?
@@ -0,0 +1,17 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these files committed by mistake?
local python_ver="3.10" | ||
else | ||
local python_ver="3.9" | ||
function configure_knox_for_dask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to configure knox now, earlier it was not needed?
Tested with CUDA=11 and CUDA=12