DistoGram is a library that allows to compute histogram on streaming data, in distributed environments. The implementation follows the algorithms described in Ben-Haim's Streaming Parallel Decision Trees
First create a compressed representation of a distribution:
import numpy as np
import distogram
distribution = np.random.normal(size=10000)
# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
h = distogram.update(h, i)
Compute statistics on the distribution:
nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))
count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427
Compute and display the histogram of the distribution:
hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()
DistoGram is available on PyPi and can be installed with pip:
pip install distogram
You can test this library directly on this live notebook.
Distogram is design for fast updates when using python types. The following numbers show the results of the benchmark program located in the examples.
On a i7-9800X Intel CPU, performances are:
Interpreter | Operation | Numpy | Req/s |
---|---|---|---|
pypy 7.3 | update | no | 6563311 |
pypy 7.3 | update | yes | 111318 |
CPython 3.7 | update | no | 436709 |
CPython 3.7 | update | yes | 251603 |
On a modest 2014 13" macbook pro, performances are:
Interpreter | Operation | Numpy | Req/s |
---|---|---|---|
pypy 7.3 | update | no | 3572436 |
pypy 7.3 | update | yes | 37630 |
CPython 3.7 | update | no | 112749 |
CPython 3.7 | update | yes | 81005 |
As you can see, your are encouraged to use pypy with python native types. Pypy's jit is penalised by numpy native types, causing a huge performance hit. Moreover the streaming phylosophy of Distogram is more adapted to python native types while numpy is optimized for batch computations, even with CPython.
Although this code has been written by following the aforementioned research paper, some parts are also inspired by the implementation from Carson Farmer.
Thanks to John Belmonte for his help on performances and accuracy improvements.
- class DistogramFn(beam.CombineFn):
- def create_accumulator(self):
- return distogram.Distogram()
- def add_input(self, distogram_var, input):
- h = distogram.update(distogram_var, input) return h
- def merge_accumulators(self, accumulators):
h = accumulators[0] for i in range(1, split_count):
h = distogram.merge(h, accumulators[i])
return h
- def extract_output(self, distogram_var):
- return distogram_var