Skip to content
This repository has been archived by the owner on Sep 1, 2023. It is now read-only.

The random number generator is slightly biased toward returning 0 #1434

Open
mrcslws opened this issue Aug 2, 2018 · 4 comments
Open

The random number generator is slightly biased toward returning 0 #1434

mrcslws opened this issue Aug 2, 2018 · 4 comments

Comments

@mrcslws
Copy link
Contributor

mrcslws commented Aug 2, 2018

This code looks wrong.

https://github.com/numenta/nupic.core/blob/69b9c79cf34a9c7ab90c8c0f4616504da40edee7/src/nupic/utils/Random.cpp#L192

I think this line should be while (sample >= smax);.

Here's a quick explanation of this code. It generates a random number between 0 and n - 1 by generating a random integer mod n. The sequence 0 mod n, 1 mod n, 2 mod n, ... will cycle through the ring of integers mod n forever, so each return value is almost equally likely. But this sequence will generally end in the middle of this ring of return values, except when 2^32 - 1 happens to be -1 mod n. So this code correctly finds the highest integer that is divisible by n. At this point, it should subtract 1 from this number, or it should do the >= check that I suggested above. Instead, it does a > check.

So, for example, with the default max 2^32 - 1, if the underlying random number generator does return 2^32 - 1, the function will return 0, when it ought to discard this return value and try again. That means that when you call getUInt32() with no max, it's twice as likely to return 0 than any other number. The problem is much less severe for other use cases like getUInt32(1024), but the issue is still there.

This code would be correct if the underlying RandomImpl never returns zero. But I ran a quick test, and it does in fact return zero. For example, with seed 19, it returns 0 on the 1009865459th call.

@breznak
Copy link
Member

breznak commented Aug 2, 2018

Thank you @mrcslws ! Now I understand your motivation, although I still don't think the logic is correct.
Some notes:

/* when getUInt32() called without arg,
* the RNG is slightly biased towards zero (0) (2x likely than
* any other value). This is most visible with the empty
* (UInt32::MAX) arg, less with any other (1024, etc),
* Unlikely to run in CI, as current 3bil runs take ~6mins
*/
TEST(RandomTest, testGetUIntBiasToZero) {
Random r1(421);
UInt32 rnd;
const UInt64 RUNS = 3* 10000000000;  //30 bil calls
UInt32 zeroCount = 1; //both counts set to 1 to avoid division-by-zero
UInt32 otherCount = 1; //say other =1

for(UInt64 i=0; i< RUNS; i++) {
  rnd = r1.getUInt32();
  if(rnd==0) zeroCount++;
  else if(rnd==1) otherCount++;
}
NTA_WARN << "Random's bias towards zero: "
        << "zeros:" << zeroCount-1 << " ones: " << otherCount-1;
ASSERT_NEAR(zeroCount/(Real)otherCount, 1.0f, 0.01f);
}
  • even though, is it worth it?

    For example, with seed 19, it returns 0 on the 1009865459th call.

    so zero is 2* 1/billion more likely?

@mrcslws
Copy link
Contributor Author

mrcslws commented Aug 2, 2018

RandomImpl->getUInt32(void) does not do what one'd expect (uniform random numbers 0..UInt32:max), the samples just a HALF

Nice catch! This means that the random number generator is biased in the same way as simply using rand() % n. The code that tries to eliminate that bias isn't doing anything.

So, to summarize the bugs:

  • Either RandomImpl::getUInt32 needs to return numbers up to ~4 billion, or Random::getUInt32 needs to do all of its current logic on ~2 billion, not ~4 billion. Until then, our logic that tries to remove bias isn't doing anything.
  • After fixing that, there will be a bias toward 0, as described in my first comment. That's also easy to fix.

Other replies:

(For these replies, assume RandomImpl::getUInt32 returns numbers up to ~4 billion.)

  • The do...while nondeterminism isn't so bad. smax is usually near to MAX32 (i.e. 2^32 - 1), and so the probability of needing to retry is low. In the worst case, smax is MAX32 / 2, and the probability is 0.5. So in the worst case, the expected number of tries is 2, and it's usually much closer to 1.
  • We have the underlying random number generator RandomImpl because we want one that is consistent across compilers / platforms. For a given seed, it returns the same sequence of numbers on all platforms.
  • The experiment I ran with RandomImpl and seed 19 doesn't tell us anything about probability. To quantify the bias, you just need to think about the outer Random::getUInt32(max). As I mentioned in my first comment, if you call it with max=MAX32, the default value, 0 is twice as likely. If you call it with max=1024, then 0 is about 1.000001 times as likely, i.e. 1 millionth more likely.

@breznak
Copy link
Member

breznak commented Aug 2, 2018

Until then, our logic that tries to remove bias isn't doing anything.

Can you verify my test is correct? I have the "half" RandomImpl.getUInt() and "biased" Random.getUInt and I cannot reproduce the bias in the testcase.

nondeterminism isn't so bad.

yes, it's statistically 1 (maybe 2) passes in the loop, but for compiler optimization it's an unknown-length loop. (I don't know how performance-critical this codepath is)

RandomImpl because we want one that is consistent across compilers / platforms. For a given seed, it returns the same sequence of numbers on all platforms.

Wouldn't all (serious) offer that? https://en.cppreference.com/w/cpp/numeric/random/mersenne_twister_engine
At the very bottom it says which numbers you can expect! I think the std:: version would be much safer/faster to use, compared to home-baked. (ie the "half" problem).

  • I guess you wouldn't have to do the "mitigate bias towards 0 after modulo" with the uniform distribution which offers given range! (though I cannot reproduce bias even now)

@breznak
Copy link
Member

breznak commented Nov 29, 2018

Update, in the community repo, we've replaced Random with std:: version, you might like it.
https://github.com/htm-community/nupic.cpp/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants