Skip to content

FIX Propagate random state to numpy rng in make_classification #6518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

betatim
Copy link
Member

@betatim betatim commented Apr 8, 2025

When generating a classification problem a mix of cupy and numpy random generators is used. The random state needs to be propagated to the numpy generator which was not happening.

I'm not sure what the best way of doing the "propagation" is. There is prior art for using randint and then using that as a seed. We don't need the sequences to be the same, we just need a way to initialise the Numpy random generator.

An alternative would be to use the cupy rng throughout and combine and it with moving the data to a numpy array.

closes #6510

@betatim betatim requested a review from a team as a code owner April 8, 2025 07:24
@betatim betatim requested review from vyasr and jcrist April 8, 2025 07:24
Copy link

copy-pr-bot bot commented Apr 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Apr 8, 2025
@betatim betatim added the bug Something isn't working label Apr 8, 2025
@betatim betatim force-pushed the fix-make_classification-random_state branch 2 times, most recently from ff50d69 to 00eb330 Compare April 8, 2025 07:36
@betatim
Copy link
Member Author

betatim commented Apr 8, 2025

/ok to test

When generating a classification problem a mix of cupy and numpy random
generators is used. The random state needs to be propagated to the numpy
generator.
@betatim betatim force-pushed the fix-make_classification-random_state branch from 00eb330 to 94bfe15 Compare April 8, 2025 07:48
@betatim
Copy link
Member Author

betatim commented Apr 8, 2025

(Force pushed a few times until I figure out why my commits weren't signed)

Copy link
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just small comment

@betatim betatim added the breaking Breaking change label Apr 8, 2025
@betatim
Copy link
Member Author

betatim commented Apr 9, 2025

Ok, I'll leave it like this for now.

This is ready for re-running the CI (when the cudf problems have been solved) and then merge.

@betatim betatim force-pushed the fix-make_classification-random_state branch from 360d603 to dccf46b Compare April 28, 2025 07:47
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any hypothesis as to why we are using a mix of numpy and cupy in the first place? Why not use cupy also to generate the hypercube?

rapids-bot bot pushed a commit that referenced this pull request Apr 28, 2025
Mark the following tests as flaky in cuml.accel test suite since they are showing flaky behavior:

- sklearn.cluster.tests.test_k_means::test_score_max_iter[42-KMeans]
- sklearn.manifold.tests.test_t_sne::test_optimization_minimizes_kl_divergence

Spun off from #6518 for immediate merge.

Authors:
  - Simon Adorf (https://github.com/csadorf)
  - Tim Head (https://github.com/betatim)

Approvers:
  - Jim Crist-Harif (https://github.com/jcrist)

URL: #6598
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question, but could be addressed in a follow-up.

@betatim
Copy link
Member Author

betatim commented Apr 29, 2025

/merge

@rapids-bot rapids-bot bot merged commit 5add36e into rapidsai:branch-25.06 Apr 29, 2025
74 checks passed
@betatim betatim deleted the fix-make_classification-random_state branch April 29, 2025 08:44
@betatim
Copy link
Member Author

betatim commented Apr 29, 2025

Do you have any hypothesis as to why we are using a mix of numpy and cupy in the first place? Why not use cupy also to generate the hypercube?

Good question, from looking at increasingly old "git blame"s I landed at d0a0e72 which seems to be when this was added. It already used Numpy back then. I couldn't work out how to find the PR related to this commit (I totally rely on the PR number being in the commit :(), but maybe if we can find the PR we can see if there was any discussion?

My suspicion is that something that _generate_hypercube needs is missing from cupy or would be tricky to implement (maybe sample_without_replacement?)

@csadorf
Copy link
Contributor

csadorf commented Apr 29, 2025

Ok, I'm happy to let that be for now. Once we start more rigorous profiling we can identify the low hanging fruit for optimization like that.

Ofek-Haim pushed a commit to Ofek-Haim/cuml that referenced this pull request May 13, 2025
Mark the following tests as flaky in cuml.accel test suite since they are showing flaky behavior:

- sklearn.cluster.tests.test_k_means::test_score_max_iter[42-KMeans]
- sklearn.manifold.tests.test_t_sne::test_optimization_minimizes_kl_divergence

Spun off from rapidsai#6518 for immediate merge.

Authors:
  - Simon Adorf (https://github.com/csadorf)
  - Tim Head (https://github.com/betatim)

Approvers:
  - Jim Crist-Harif (https://github.com/jcrist)

URL: rapidsai#6598
Ofek-Haim pushed a commit to Ofek-Haim/cuml that referenced this pull request May 13, 2025
…idsai#6518)

When generating a classification problem a mix of cupy and numpy random generators is used. The random state needs to be propagated to the numpy generator which was not happening.

I'm not sure what the best way of doing the "propagation" is. There is prior art for using `randint` and then using that as a seed. We don't need the sequences to be the same, we just need a way to initialise the Numpy random generator. 

An alternative would be to use the cupy rng throughout and combine and it with moving the data to a numpy array.

closes rapidsai#6510

Authors:
  - Tim Head (https://github.com/betatim)
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Simon Adorf (https://github.com/csadorf)

URL: rapidsai#6518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working Cython / Python Cython or Python issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Non deterministic make_classification.
3 participants