-
Notifications
You must be signed in to change notification settings - Fork 574
FIX Propagate random state to numpy rng in make_classification
#6518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Propagate random state to numpy rng in make_classification
#6518
Conversation
ff50d69
to
00eb330
Compare
/ok to test |
When generating a classification problem a mix of cupy and numpy random generators is used. The random state needs to be propagated to the numpy generator.
00eb330
to
94bfe15
Compare
(Force pushed a few times until I figure out why my commits weren't signed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just small comment
Ok, I'll leave it like this for now. This is ready for re-running the CI (when the cudf problems have been solved) and then merge. |
360d603
to
dccf46b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any hypothesis as to why we are using a mix of numpy and cupy in the first place? Why not use cupy also to generate the hypercube?
Mark the following tests as flaky in cuml.accel test suite since they are showing flaky behavior: - sklearn.cluster.tests.test_k_means::test_score_max_iter[42-KMeans] - sklearn.manifold.tests.test_t_sne::test_optimization_minimizes_kl_divergence Spun off from #6518 for immediate merge. Authors: - Simon Adorf (https://github.com/csadorf) - Tim Head (https://github.com/betatim) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: #6598
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one question, but could be addressed in a follow-up.
/merge |
Good question, from looking at increasingly old "git blame"s I landed at d0a0e72 which seems to be when this was added. It already used Numpy back then. I couldn't work out how to find the PR related to this commit (I totally rely on the PR number being in the commit :(), but maybe if we can find the PR we can see if there was any discussion? My suspicion is that something that |
Ok, I'm happy to let that be for now. Once we start more rigorous profiling we can identify the low hanging fruit for optimization like that. |
Mark the following tests as flaky in cuml.accel test suite since they are showing flaky behavior: - sklearn.cluster.tests.test_k_means::test_score_max_iter[42-KMeans] - sklearn.manifold.tests.test_t_sne::test_optimization_minimizes_kl_divergence Spun off from rapidsai#6518 for immediate merge. Authors: - Simon Adorf (https://github.com/csadorf) - Tim Head (https://github.com/betatim) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6598
…idsai#6518) When generating a classification problem a mix of cupy and numpy random generators is used. The random state needs to be propagated to the numpy generator which was not happening. I'm not sure what the best way of doing the "propagation" is. There is prior art for using `randint` and then using that as a seed. We don't need the sequences to be the same, we just need a way to initialise the Numpy random generator. An alternative would be to use the cupy rng throughout and combine and it with moving the data to a numpy array. closes rapidsai#6510 Authors: - Tim Head (https://github.com/betatim) - Simon Adorf (https://github.com/csadorf) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6518
When generating a classification problem a mix of cupy and numpy random generators is used. The random state needs to be propagated to the numpy generator which was not happening.
I'm not sure what the best way of doing the "propagation" is. There is prior art for using
randint
and then using that as a seed. We don't need the sequences to be the same, we just need a way to initialise the Numpy random generator.An alternative would be to use the cupy rng throughout and combine and it with moving the data to a numpy array.
closes #6510