Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choose_multiple_weighted returns unexpect probability of result #1476

Open
heavycharged opened this issue Jul 23, 2024 · 5 comments
Open

choose_multiple_weighted returns unexpect probability of result #1476

heavycharged opened this issue Jul 23, 2024 · 5 comments
Labels
A-degenerate Algorithm fails on some input X-bug Type: bug report

Comments

@heavycharged
Copy link

Summary

I am trying to choose N values from vector with weights. All weights calculated as score.powf(4.)

What i expected

I expect values will be chosen so many times, in relation to their weights. Here is the weights after powf function:

value=1 has weight=0.0000409600
value=2 has weight=0.0000000037
value=3 has weight=0.0000000207

As you can see, the lowest weight has value 2, and i expect this value will be chosen least frequent.

What i got

got following results: {2: 100000, 3: 1, 1: 99999}

As you can see, value 2 had been choosen 100k times, more than value 3, that is only chosen once. Here is the score of values 2 and 3:

2: 0.0000000037
3: 0.0000000207

Note

I noticed that this happens with very small floats. Also, i guess that the initial order is matters. Both of this not mentioned in documentation, so this is either docs issue, or bug. Also i do not exclude that i may missed something, so maybe i am wrong.

Code sample

use std::collections::HashMap;

use rand::{prelude::SliceRandom, thread_rng};

fn main() {
    let values: Vec<(i32, f64)> =
        vec![(1, 0.080), (2, 0.0078), (3, 0.012)]
            .into_iter()
            .map(|v| (v.0, f64::powf(v.1, 4.0)))
            .collect();

    for v in &values {
        println!("value={} has weight={:.10}", v.0, v.1);
    }

    let mut counts = HashMap::new();

    for i in 0..100_000 {
        let winning: Vec<i32> = values
            .choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1)
            .unwrap()
            .map(|v| v.0)
            .collect();

        for winner in winning {
            counts.entry(winner).and_modify(|v| *v += 1).or_insert(1);
        }
    }

    println!("got following results: {counts:?}");

    assert!(counts.get(&1).unwrap() > counts.get(&2).unwrap());
    assert!(counts.get(&1).unwrap() > counts.get(&3).unwrap());
    assert!(counts.get(&3).unwrap() > counts.get(&2).unwrap());
}
@heavycharged heavycharged added the X-bug Type: bug report label Jul 23, 2024
@dhardy
Copy link
Member

dhardy commented Jul 23, 2024

I can reproduce this. More specifically,

got following results:
(1, 1): 0
(1, 2): 0
(1, 3): 1
(2, 1): 0
(2, 2): 0
(2, 3): 96875
(3, 1): 3124
(3, 2): 0
(3, 3): 0

@dhardy
Copy link
Member

dhardy commented Jul 23, 2024

The problem is in sample_efraimidis_spirakis (https://doi.org/10.1016/j.ipl.2005.11.003), which calculates key = rng.random::<f64>().powf(1.0 / weight); usually the result rounds to 0 due to the very small weights used.

Note that our f64 random samples have limited precision. We have considered (and even implemented) algorithms for high-precision f64 values. This was not adopted since there is performance overhead and usually it's not necessary. Increasing precision could not be expected to solve this particular issue anyway.

More to the point, this algorithm does not appear to be suitable for the small weights used as inputs here. Pre-scaling of weights into some more suitable range may be a solution, however a key point of the algorithm used is that it requires only a single iteration over a list of weights (which are in this case generated via call to the weight function for each value). Without knowing the largest input weight in advance we would have to iterate twice over the weights (storing or regenerating). At this point a different API may be better; or we could allow the user to input the maximum weight (or a weight scaling function) as a hint.

For this particular example, a partial fix is to scale by the inverse of the largest weight in the weight fn: choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1 / largest_weight). The results are better but still likely wrong since key often rounds to zero for values 2 and 3:

got following results:
(1, 1): 0
(1, 2): 6
(1, 3): 58
(2, 1): 5272
(2, 2): 0
(2, 3): 0
(3, 1): 94664
(3, 2): 0
(3, 3): 0

@TheIronBorn have you got any suggestions? It seems that the algorithm is just a poor choice for small weights.

@dhardy
Copy link
Member

dhardy commented Jul 24, 2024

We could keep this algorithm but document its limitations while documenting its limitations.

In this case, the best (or at least easiest) alternative may be to use WeightedIndex, re-sampling in the case that identical samples are returned (assuming that there are enough items available and likely to be selected, this should be reasonably efficient).

Choose-multiple-weighted is not a problem I've put much thought into; there are probably other algorithms.

@dhardy dhardy added the A-degenerate Algorithm fails on some input label Jul 26, 2024
@dhardy
Copy link
Member

dhardy commented Jul 26, 2024

@teryror I see you have experience with this algorithm (#1365). Any thoughts here? Wikipedia suggests using ln(r) / w_i keys instead.

@dhardy dhardy mentioned this issue Jul 28, 2024
1 task
dhardy added a commit that referenced this issue Aug 12, 2024
dhardy added a commit to dhardy/rand that referenced this issue Nov 19, 2024
…ust-random#1476

Also improves choose_two_weighted_indexed time by 23% (excluding new test)
@dhardy
Copy link
Member

dhardy commented Nov 23, 2024

For the above problem, I would expect the following average results of 100'000 samples:

  • (1, 2) or (2, 1): 15147
  • (1, 3) or (3, 1): 84853
  • (2, 3) or (3, 2): 0

Above, I got:

  • (1, 2) or (2, 1): 5278
  • (1, 3) or (3, 1): 94722
  • (2, 3) or (3, 2): 0

Re-running over master now, I get:

  • (1, 2) or (2, 1): 5516
  • (1, 3) or (3, 1): 94484
  • (2, 3) or (3, 2): 0

With #1530, I get:

  • (1, 2) or (2, 1): 15142
  • (1, 3) or (3, 1): 84858
  • (2, 3) or (3, 2): 0

So seems pretty good.

(Noting also that choose_multiple_weighted does not promise anything about output order.)

Code
use rand::{prelude::IndexedRandom, rng};

#[test]
fn main() {
    let values: Vec<(i32, f64)> =
        vec![(1, 0.080), (2, 0.0078), (3, 0.012)]
            .into_iter()
            .map(|v| (v.0, f64::powf(v.1, 4.0)))
            .collect();

    for v in &values {
        println!("value={} has weight={:.10}", v.0, v.1);
    }
    let weight_sum: f64 = values.iter().map(|(_, w)| w).sum();
    let weights: Vec<f64> = values.iter().map(|(_, w)| w / weight_sum).collect();
    println!("expected: {weights:?}");

    let largest_weight = values[0].1;

    let mut results = [0u32; 9];

    for _ in 0..100_000 {
        let mut iter = values.choose_multiple_weighted(&mut rng(), 2, |a| a.1 / largest_weight).unwrap();
        let a = iter.next().unwrap().0;
        let b = iter.next().unwrap().0;
        assert!(iter.next().is_none());

        let i = (a - 1) * 3 + b - 1;
        results[i as usize] += 1;
    }

    println!("got following results:");
    for i in 0..9 {
        let i0 = i / 3 + 1;
        let i1 = i % 3 + 1;
        println!("({}, {}):\t{}", i0, i1, results[i]);
    }
    assert!(false);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-degenerate Algorithm fails on some input X-bug Type: bug report
Projects
None yet
Development

No branches or pull requests

2 participants