Number of chunks is the same for different number of threads. #16512

adallak · 2025-02-03T00:02:37Z

I am using H2O version 3.46.0.6 in Python. According to the H2O reproducibility page (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/reproducibility.html) the parallelization level (number of cores, nthreads) is supposed to control how the dataset is partitioned in memory (into "chunks"). However, I noticed that regardless of the number of threads I specify when initializing the cluster, the number of chunks remains the same.

ParseSetup heuristic: cloudSize: 1, cores: 28, numCols: 10, maxLineLength: 42, totalSize: 8758450, localParseSize: 8758450, chunkSize: 78201, numChunks: 111, numChunks * cols: 1110

This behavior seems inconsistent with the documentation. I am aware that the number of chunks can affect reproducibility. Does this mean that even if I explicitly control nthreads(), reproducibility is not guaranteed, as different machines with varying numbers of cores may produce different results?

maurever · 2025-02-03T07:42:52Z

Hi @adallak. Thanks for reporting this behavior. Please provide a minimal working example of how you set up a cluster. Thank you.

adallak added the bug label Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of chunks is the same for different number of threads. #16512

Number of chunks is the same for different number of threads. #16512

adallak commented Feb 3, 2025

maurever commented Feb 3, 2025

Number of chunks is the same for different number of threads. #16512

Number of chunks is the same for different number of threads. #16512

Comments

adallak commented Feb 3, 2025

maurever commented Feb 3, 2025