Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gcp-dataproc] Dataproc open source component integration tests flakeyness - regional mirroring? #1051

Open
2 tasks done
cjac opened this issue Oct 25, 2024 · 19 comments
Open
2 tasks done
Labels
type::feature request for a new feature or capability

Comments

@cjac
Copy link

cjac commented Oct 25, 2024

Checklist

  • I added a descriptive title
  • I searched open requests and couldn't find a duplicate

What is the idea?

Hello folks,

I've been maintaining the github.com/GoogleCloudDataproc/initialization-actions repository for a bit now, and I'm seeing some flakey tests. The tests are installing dask from conda.anaconda.org. Would we be able to avoid this by using a regional GCP mirror of the conda packages? How complex is it to maintain a mirror with CVE updates?

+ /opt/conda/default/bin/mamba create -m -n dask -y --no-channel-priority -c conda-forge -c nvidia 'cuda-version>=12,<=12.5' 'dask>=2024.5' dask-bigquery dask-ml dask-sql python=3.10
Download error (28) Timeout was reached [https://conda.anaconda.org/conda-forge/noarch/repodata.json.zst]
Failed to connect to conda.anaconda.org port 443 after 262119 ms: Couldn't connect to server

Why is this needed?

reduce load on the global mirrors and keep installer's resources locally to GCP

What should happen?

mirror with CVE updates created for each GCP region

Additional Context

Tests were run during work on this pull request.

GoogleCloudDataproc/initialization-actions#1219

@jakirkham
Copy link
Member

Both conda-forge and nvidia channels should be available by CDN via Cloudflare. Am curious why in this case it appears to be going to Anaconda.org directly?

@cjac
Copy link
Author

cjac commented Oct 26, 2024 via email

@jakirkham
Copy link
Member

What I mean is this should already be happening by default. For example note the last line in the output below

$ curl -I https://conda.anaconda.org/conda-forge 
HTTP/2 302 
date: Sat, 26 Oct 2024 01:49:17 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
…
server: cloudflare

The fact that the query above is not getting through suggests there is some other kind of network issue. Not sure if that is somewhere within CI or some other infrastructure between that build and the CDN (like some security protocol?)

It might be worth trying some simple network diagnostics at this point outside of Conda to isolate issues like this

@cjac
Copy link
Author

cjac commented Oct 31, 2024

Looking for: ["cuda-version[version='>=12,<13']", "rapids[version='>=24.08']", "dask[version='>=2024.7']", 'dask-bigquery', 'dask-ml', 'dask-sql', 'cudf', 'numba', "python[version='>=3.11']"]

conda-forge/linux-64      
+ sync
+ [[ 1 == \0 ]]
+ test -d /opt/conda/miniconda3/envs/dask-rapids
+ /opt/conda/miniconda3/bin/conda config --set channel_priority flexible
+ for installer in "${mamba}" "${conda}"
+ /opt/conda/miniconda3/bin/conda create -m -n dask-rapids -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai 'cuda-version>=12,<13' 'rapids>=24.08' 'dask>=2024.7' dask-bigquery dask-ml
 dask-sql cudf numba 'python>=3.11'

real    1m19.604s
user    0m0.326s
sys     0m0.048s
+ retval=1
+ cat /mnt/shm/install.log
Collecting package metadata (current_repodata.json): ...working... failed

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/current_repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https://conda.anaconda.org/conda-forge/linux-64'

@cjac
Copy link
Author

cjac commented Oct 31, 2024

Hello folks, it looks like this is becoming a problem. I'm sorry for swamping your service. Let's get a regional conda mirror set up as part of the product I'm producing. Can you please direct me to the best instructions on mirroring the full conda archive? I will work on bringing up a load balancer to direct the traffic to our local mirror and take that load off of your infrastructure.

@jakirkham
Copy link
Member

Were you able to run the command suggested above ( #1051 (comment) )?

It would be good to know if Cloudflare (the CDN provider used for conda-forge) is actually used in your case or not

@cjac
Copy link
Author

cjac commented Oct 31, 2024

oops! Sorry, I think I missed that.

@cjac
Copy link
Author

cjac commented Oct 31, 2024

curl -I https://conda.anaconda.org/conda-forge

Oh, sorry! I didn't know you were asking me to run that command from the context of one of the cluster nodes being installed to. Here is that output now.

cjac@cluster-1718310842-m:~$ curl -I https://conda.anaconda.org/conda-forge 
HTTP/2 302 
date: Thu, 31 Oct 2024 23:09:00 GMT
content-type: text/html; charset=utf-8
location: https://anaconda.org/conda-forge/repo?type=conda&label=main
cf-ray: 8db74f6c1bcb3101-LAX
cf-cache-status: DYNAMIC
strict-transport-security: max-age=15552000
content-security-policy: frame-ancestors 'self';
referrer-policy: no-referrer
x-content-type-options: nosniff
x-download-options: noopen
set-cookie: __cf_bm=.Is3CsF554BOaHnScWmISSkVQpl6Bnrsas5J5UFGXA0-1730416140-1.0.1.1-frOx3IudLF.K9RCGwdQgrurX.DlFsI1LpQNoPNEVzapNXoP9UU6rFC_QbyLo8sSWoJo_WsjrXuKfy9c8eZNFr2JQAS9.bH7bdHdxG0ZAoGw; path=/; expires=Thu, 31-Oct-24 23:39:00 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
server: cloudflare

@cjac
Copy link
Author

cjac commented Oct 31, 2024

Do these channels make a difference? Are those mirrored as well?

-c conda-forge -c nvidia -c rapidsai

@cjac
Copy link
Author

cjac commented Oct 31, 2024

This looks like it might be what I need:

https://pypi.org/project/conda-mirror/

@jakirkham
Copy link
Member

Sorry for being unclear. Thanks for the info! 🙏

Ok so you are able to reach the CDN through curl. Would think conda should as well. IOW it doesn't look like a networking issue

Both conda-forge and nvidia are on the CDN

Currently rapidsai is not, but we plan to fix that: #1055

Let's see if someone can help before going down the mirroring route

@jezdez could you please help us look into this?

@cjac
Copy link
Author

cjac commented Nov 1, 2024

okay. I started the mirroring route because it might be faster to have a local copy. Let me compare and let you know whether it's too much effort to maintain a mirror for use with my reproduction environment.

I've got a couple of files in my example. sync-mirror.sh is run on an instance created using create-conda-mirror.sh.

Please pardon the mess. I re-used some code I was using for a different purpose. The docs that I read about mirrors suggested that attaching GPUs to the mirror host might help accelerate things, too, so I used the latest rapids image and attached 4x T4s.

@cjac
Copy link
Author

cjac commented Nov 1, 2024

wow. It looks like I got cut off.

Image

root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json

+ /opt/conda/miniconda3/bin/conda-mirror -v --upstream-channel=conda-forge --upstream-channel=rapidsai --upstream-channel=nvidia --upstream-channel=defaults --platform=linux-64 --temp-directory=/mnt/shm --target-directory=/var/www/html --num-threads=7
Log level set to WARNING
Traceback (most recent call last):
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/conda_mirror/conda_mirror.py", line 635, in get_repodata
    resp.raise_for_status()
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/repodata.json

@cjac
Copy link
Author

cjac commented Nov 1, 2024

It looks like I was attempting to mirror portions of the repo that I don't need and won't help our cache.

The current implementation looks promising. The first one resulted in a mirror with size of ~120GB. I think it may have been the nvidia channel alone. I attempted to pass multiple instances of the --upstream-channel argument, and it took only the last.

After learning from this mistake, I have bifurcated the previous, simple, and incorrect single conda-mirror call into concurrent conda-mirror calls in their own screen tabs. Since this is a long-running process, it's probably best not to have it fail when a terminal is detached. And once all of the tabs have completed, the screen session will terminate and return control to the sync-mirror.sh shell process.

I am about 20 minutes into this latest run. It picked up in the mirroring where it had left off despite the deletion of the previous VM that had been running it. I increased the memory and CPU count so that it can accommodate three concurrent conda-mirror processes. Here's a snapshot of disk usage.

root@dpgce-conda-mirror-us-west4:~# df -h /var/www/html
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         15T  130G   15T   1% /var/www/html

@cjac
Copy link
Author

cjac commented Nov 1, 2024

This question moved to a different forum

@jezdez
Copy link
Member

jezdez commented Nov 1, 2024

wow. It looks like I got cut off.

Image

root@dpgce-conda-mirror-us-west4:~# links https://conda.anaconda.org/defaults/linux-64/repodata.json

+ /opt/conda/miniconda3/bin/conda-mirror -v --upstream-channel=conda-forge --upstream-channel=rapidsai --upstream-channel=nvidia --upstream-channel=defaults --platform=linux-64 --temp-directory=/mnt/shm --target-directory=/var/www/html --num-threads=7
Log level set to WARNING
Traceback (most recent call last):
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/conda_mirror/conda_mirror.py", line 635, in get_repodata
    resp.raise_for_status()
  File "/opt/conda/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/repodata.json

https://conda.anaconda.org/main/linux-64/repodata.json is the correct repodata URL for Anaconda Distribution

@jezdez
Copy link
Member

jezdez commented Nov 1, 2024

@cjac I'm not aware of any throttling from GCP. The original issue seems to have been a transient connection error, is this really still happening from GCP? The channels are hosted on Cloudflare CDN.

For the other questions, if this relates to commercial support for GCP related services, this isn't the right repo to raise an issue, please reach out through your Anaconda support channels instead.

@cjac
Copy link
Author

cjac commented Nov 1, 2024

I have not tried to reproduce the issue yet. I'm going to finish building a mirror and use a locally mounted filesystem with the packages on it to provide the conda-forge, rapidsai and nvidia channels.

Once the mirror is up, probably by monday, I will try the build of the rapids image again, this time using file:///var/www/html/«channel» instead of https://conda.anaconda.org/«channel»

I can then share the example instruction on how to build and utilize a conda mirror, and close this issue.

@cjac
Copy link
Author

cjac commented Nov 12, 2024

The mirror has been built, but it seems conda does an extra write of ~15G to the temp directory, much of which could be skipped when the source is on a file:// path.

In any case, the code which I used to build the anaconda mirror can be found here:

https://github.com/cjac/dataproc-repro/blob/conda-mirror-20241031/lib/mirror/sync-conda.pl

On a 96 core machine, I believe that it could mirror the channels we use in about 8 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature request for a new feature or capability
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants