Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide function for binning one-dimensional data #487

Open
gvwilson opened this issue Aug 24, 2020 · 8 comments
Open

Provide function for binning one-dimensional data #487

gvwilson opened this issue Aug 24, 2020 · 8 comments

Comments

@gvwilson
Copy link
Contributor

The library has quantile functions, but unless I've missed it, there isn't a function to bin data into N equal-width buckets for creating histograms. (Yes, we can rely on plotting software to do this, but there are cases where having the empirical PDF is useful.) I propose:

  1. bin(data: Array<number>, bins: number): Array<number> returns an array of length bins (integer > 0) with the count of values in each bin.
  2. binBoundaries(data: Array<number>, bins: number): Array<number> returns an array of length bins+1 with the inter-bin boundary values.
  3. Add minmax to min and max to calculate the minimum and maximum values in an array in a single pass.

If this proposal is accepted, I volunteer to do the work.

@Yomguithereal
Copy link
Member

Yomguithereal commented Aug 24, 2020

Hello @gvwilson. Just chiming in regarding point 3: the library already has the extent function for this purpose. But I agree that it should probably be in the Basic Descriptives Stats part of the docs :)

@gvwilson
Copy link
Contributor Author

👍 thank you

@tmcw
Copy link
Member

tmcw commented Aug 25, 2020

1 seems great, and as Guillaume points out, yep, we've got 3. 2: my main question is whether the binBoundaries function should operate on data, or on binned output. Like would the ideal API look like:

let data = [1, 2, 3, 4];
let boundaries = binBoundaries(data, 2);

Or

let data = [1, 2, 3, 4];
let boundaries = binBoundaries(bin(data, 2));

My suspicion is that API 2 is closer to ideal, because folks will likely want both binned data and bin boundaries, and the binBoundaries method will in API 1 have to implement a lot of the same logic as the bin() function.

@gvwilson
Copy link
Contributor Author

Since we need boundaries in order to bin data, what about enriching the signature to bin?

  1. binBoundaries(data, N) returns N+1 bin boundary values, where r[0]] and r[r.length -1]` are guaranteed to be the min and max values in the data.

  2. bin(data, N, boundaries=null, round='down') bins the data. If no boundaries are provided, it calls binBoundaries; otherwise, it assumes (or requires) boundaries to be sorted and returns a vector of bin IDs. If a value doesn't fall into a bin because the manually-defined boundaries don't span the data, the index is -1. By default, bin puts a value that falls exactly on a boundary in the lower bin; other options are 'up' and 'random' (toin coss).

@tmcw
Copy link
Member

tmcw commented Aug 31, 2020

I'm a bit hesitant to go beyond the 3-argument range with JavaScript APIs: as silly as it sounds, JavaScript's lack of named function arguments makes positional arguments pretty sloppy at best: in situ, that call would look like bin(data, 3, null, 'down') which is not very informative and pretty easy to mistype.

@gvwilson
Copy link
Contributor Author

Removing the last argument and always binning down would get this down to three, and I think the order is reasonably intuitive - good enough?

@tmcw
Copy link
Member

tmcw commented Aug 31, 2020

👍 yep, sounds good!

@jberryman
Copy link

(FWIW I had hoped a function like this, or a more straightforward histogram function, was present here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants