Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of chunks is the same for different number of threads. #16512

Open
adallak opened this issue Feb 3, 2025 · 1 comment
Open

Number of chunks is the same for different number of threads. #16512

adallak opened this issue Feb 3, 2025 · 1 comment
Labels

Comments

@adallak
Copy link

adallak commented Feb 3, 2025

I am using H2O version 3.46.0.6 in Python. According to the H2O reproducibility page (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/reproducibility.html) the parallelization level (number of cores, nthreads) is supposed to control how the dataset is partitioned in memory (into "chunks"). However, I noticed that regardless of the number of threads I specify when initializing the cluster, the number of chunks remains the same.

ParseSetup heuristic: cloudSize: 1, cores: 28, numCols: 10, maxLineLength: 42, totalSize: 8758450, localParseSize: 8758450, chunkSize: 78201, numChunks: 111, numChunks * cols: 1110

This behavior seems inconsistent with the documentation. I am aware that the number of chunks can affect reproducibility. Does this mean that even if I explicitly control nthreads(), reproducibility is not guaranteed, as different machines with varying numbers of cores may produce different results?

@adallak adallak added the bug label Feb 3, 2025
@maurever
Copy link
Contributor

maurever commented Feb 3, 2025

Hi @adallak. Thanks for reporting this behavior. Please provide a minimal working example of how you set up a cluster. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants