Public partitions as a cartesian product of dimensions #277
Labels
Good first issue 🎓
Perfect for beginners, welcome to OpenMined!
Type: New Feature ➕
Introduction of a completely new addition to the codebase
Context
Note: here is more about terminology.
Definitions (from terminology page)
A partition is a subset of the data corresponding to a given value of the aggregation criterion. Usually we want to aggregate each partition separately. For example, if we count visits to restaurants, the visits for one particular restaurant are a single partition, and the count of visits to that restaurant would be the aggregate for that partition.
Public partitions are partition keys that are publicly known and hence don’t leak any user information. An example of public partitions could be week days.
DPEngine.aggregate is API function that performs DP aggregation.
public_partitions
is an argument ofDPEngine.aggregate()
. It might be Python iterable (when it's small enough to fit in memory and to efficiently distributed among workers) or distributed collection (PCollection for beam, RDD for spark).In short, public partition selection consists of 2 stages:
public_partitions
(code which does this).public_partitions
which are not in input data (code which does this).Downsides of the current state.
Let’s consider the case when partitions are cartesian products of multiple dimensions, for example (country, date).
The user needs to do generation of cross-join: that’s additional steps from users, so more possibilities for bugs and this cross-join might be very large (as a result performance impact).
What can be done better?
The user can specify values of each dimensions, and PipelineDP internally can do join: this would be easier to use for users and it might be done much more effectively from performance point of view inside PipelideDP.
Goals
Allow to specify
public_partitions
as a product of different dimensions. Steps to implement (it might be split in sevaral PRs)1.Device a nice API for specifying public_paritions in arguments of
DPEngine.aggregate
(a separate argument, or maybe some class object which specifies product). For beginning we can assume that dimensions values are Python iterable.2. Implement steps 1 & 2 of public_partitiosn algorithm (see in section above).
3. Propagate these public partitoins in all places where they used (e.g. in Beam API, e.g in Spark API).
The text was updated successfully, but these errors were encountered: