The random number generator is slightly biased toward returning 0 #1434

mrcslws · 2018-08-02T05:58:12Z

This code looks wrong.

https://github.com/numenta/nupic.core/blob/69b9c79cf34a9c7ab90c8c0f4616504da40edee7/src/nupic/utils/Random.cpp#L192

I think this line should be while (sample >= smax);.

Here's a quick explanation of this code. It generates a random number between 0 and n - 1 by generating a random integer mod n. The sequence 0 mod n, 1 mod n, 2 mod n, ... will cycle through the ring of integers mod n forever, so each return value is almost equally likely. But this sequence will generally end in the middle of this ring of return values, except when 2^32 - 1 happens to be -1 mod n. So this code correctly finds the highest integer that is divisible by n. At this point, it should subtract 1 from this number, or it should do the >= check that I suggested above. Instead, it does a > check.

So, for example, with the default max 2^32 - 1, if the underlying random number generator does return 2^32 - 1, the function will return 0, when it ought to discard this return value and try again. That means that when you call getUInt32() with no max, it's twice as likely to return 0 than any other number. The problem is much less severe for other use cases like getUInt32(1024), but the issue is still there.

This code would be correct if the underlying RandomImpl never returns zero. But I ran a quick test, and it does in fact return zero. For example, with seed 19, it returns 0 on the 1009865459th call.

The text was updated successfully, but these errors were encountered:

breznak · 2018-08-02T10:12:06Z

Thank you @mrcslws ! Now I understand your motivation, although I still don't think the logic is correct.
Some notes:

I don't like the do { ...} while (sample > smax) for it adds non-determinism (in terms of RT-execution, not random values).
- so I removed the loop and all tests still work https://github.com/htm-community/nupic.cpp/pull/41/files#diff-737a75596163e5aa8085fd74441d2bd9L190
  - did the debug and currently the loop is always called exactly once!
    - Because: the undocumented function RandomImpl->getUInt32(void) https://github.com/htm-community/nupic.cpp/pull/41/files#diff-737a75596163e5aa8085fd74441d2bd9L227 does not do what one'd expect (uniform random numbers 0..UInt32:max), the samples just a HALF https://github.com/htm-community/nupic.cpp/pull/41/files#diff-737a75596163e5aa8085fd74441d2bd9L236
      - therefore the loop runs exactly once all the time.
could we use a standard uniform random distribution sampler from c++ instead? What is specific about its (impl_->getUInt32(void)) cryptic code?
there should be a test for the phenomenon you describe "random biased towards zero"
I did one:

/* when getUInt32() called without arg,
* the RNG is slightly biased towards zero (0) (2x likely than
* any other value). This is most visible with the empty
* (UInt32::MAX) arg, less with any other (1024, etc),
* Unlikely to run in CI, as current 3bil runs take ~6mins
*/
TEST(RandomTest, testGetUIntBiasToZero) {
Random r1(421);
UInt32 rnd;
const UInt64 RUNS = 3* 10000000000;  //30 bil calls
UInt32 zeroCount = 1; //both counts set to 1 to avoid division-by-zero
UInt32 otherCount = 1; //say other =1

for(UInt64 i=0; i< RUNS; i++) {
  rnd = r1.getUInt32();
  if(rnd==0) zeroCount++;
  else if(rnd==1) otherCount++;
}
NTA_WARN << "Random's bias towards zero: "
        << "zeros:" << zeroCount-1 << " ones: " << otherCount-1;
ASSERT_NEAR(zeroCount/(Real)otherCount, 1.0f, 0.01f);
}

even though, is it worth it?

For example, with seed 19, it returns 0 on the 1009865459th call.

so zero is 2* 1/billion more likely?

mrcslws · 2018-08-02T14:17:15Z

RandomImpl->getUInt32(void) does not do what one'd expect (uniform random numbers 0..UInt32:max), the samples just a HALF

Nice catch! This means that the random number generator is biased in the same way as simply using rand() % n. The code that tries to eliminate that bias isn't doing anything.

So, to summarize the bugs:

Either RandomImpl::getUInt32 needs to return numbers up to ~4 billion, or Random::getUInt32 needs to do all of its current logic on ~2 billion, not ~4 billion. Until then, our logic that tries to remove bias isn't doing anything.
After fixing that, there will be a bias toward 0, as described in my first comment. That's also easy to fix.

Other replies:

(For these replies, assume RandomImpl::getUInt32 returns numbers up to ~4 billion.)

The do...while nondeterminism isn't so bad. smax is usually near to MAX32 (i.e. 2^32 - 1), and so the probability of needing to retry is low. In the worst case, smax is MAX32 / 2, and the probability is 0.5. So in the worst case, the expected number of tries is 2, and it's usually much closer to 1.
We have the underlying random number generator RandomImpl because we want one that is consistent across compilers / platforms. For a given seed, it returns the same sequence of numbers on all platforms.
The experiment I ran with RandomImpl and seed 19 doesn't tell us anything about probability. To quantify the bias, you just need to think about the outer Random::getUInt32(max). As I mentioned in my first comment, if you call it with max=MAX32, the default value, 0 is twice as likely. If you call it with max=1024, then 0 is about 1.000001 times as likely, i.e. 1 millionth more likely.

breznak · 2018-08-02T15:49:14Z

Until then, our logic that tries to remove bias isn't doing anything.

Can you verify my test is correct? I have the "half" RandomImpl.getUInt() and "biased" Random.getUInt and I cannot reproduce the bias in the testcase.

nondeterminism isn't so bad.

yes, it's statistically 1 (maybe 2) passes in the loop, but for compiler optimization it's an unknown-length loop. (I don't know how performance-critical this codepath is)

RandomImpl because we want one that is consistent across compilers / platforms. For a given seed, it returns the same sequence of numbers on all platforms.

Wouldn't all (serious) offer that? https://en.cppreference.com/w/cpp/numeric/random/mersenne_twister_engine
At the very bottom it says which numbers you can expect! I think the std:: version would be much safer/faster to use, compared to home-baked. (ie the "half" problem).

I guess you wouldn't have to do the "mitigate bias towards 0 after modulo" with the uniform distribution which offers given range! (though I cannot reproduce bias even now)

breznak · 2018-11-29T10:00:40Z

Update, in the community repo, we've replaced Random with std:: version, you might like it.
https://github.com/htm-community/nupic.cpp/

mrcslws mentioned this issue Aug 2, 2018

arithmetic exception at nupic::Random::getUInt32(unsigned int) #1410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The random number generator is slightly biased toward returning 0 #1434

The random number generator is slightly biased toward returning 0 #1434

mrcslws commented Aug 2, 2018 •

edited

breznak commented Aug 2, 2018

mrcslws commented Aug 2, 2018

breznak commented Aug 2, 2018

breznak commented Nov 29, 2018

The random number generator is slightly biased toward returning 0 #1434

The random number generator is slightly biased toward returning 0 #1434

Comments

mrcslws commented Aug 2, 2018 • edited

breznak commented Aug 2, 2018

mrcslws commented Aug 2, 2018

breznak commented Aug 2, 2018

breznak commented Nov 29, 2018

mrcslws commented Aug 2, 2018 •

edited