Skip to content

Detail question regarding StreamingAggregator methods and characteristics #127

Closed
@javafanboy

Description

@javafanboy

In order to implement digest calculation of all binary data in a partitioned cache I started looking at the streaming aggregator and I found some aspects of it very suitable for the task.
When starting to develop with it I realized there where a few things I was not sure I had understood correctly.

Firstly I am not sure how some of the characteristics really work:

PARALLEL - felt fairly obvious and and an option I generally always want. In combination with BY_MEMBER one aggregator will run on each storage enabled node and with BY_PARTITION as many aggregators will run on each storage enabled node as there are partitions owned by them each. A question mark for me is however if it is the "instance itself" that have the characteristic that will be parallelized or its "sub aggregator" created by calling the supply method? Assuming the initial aggregator have options PARALLEL and PER_PARTITION what options should the aggregato(s) created by "supply" have? Just PARALLEL or?? Is characteristics only valid/obeyed on the initial top-level aggregator object? If not what is the effect when applied to member or partition level aggregator objects??

PRESENT_ONY is however a weird one - when would an aggregator EVER be passed non-existing entries? I do not get the meaning of this one...

RETAIN_ENTRIES is not clear for me either - in my case I will, while an instance of an digest aggregator is running be holding on to passed in entries to be able to to order them by binary key in a canonical order before feeding them into the digest calculation (so that I get the same result irrespective of data insert order to the cache or whatever that could result in different processing order between different caches with the same data). Am I then supposed to specify this characteristic? My guess is that this is to isolate the aggregator from other operations mutating the entries it hold on to - something that is not a cosern for me as I only intend this operation to be used when no mutation is going on....

Also I would like to know HOW the aggregators that are run on the members (storage enabled nodes) in the cluster are "created" - initially I assumed the "supply" method to be called to provide an aggregator instance that is then serialized, sent and executed as many times as was requested (BY_MEMBER or BY_PARTITION) but from my program behaves I am now guessing the original passed in aggregator object itself is serialized and once it is received by each "member" the "supply" method is called to create one or more "sub aggregators" or?

As I understand the documentation the operations described by the StreamingAgregator interface can logically be group as belonging in the top-level aggregator (supply, combine and finalizeResult) or the sub-aggregators (executed once per member or partition) i.e. accumulate, getPartialResult - is this right?

I would like to understand if the "accumulate" method passing in a "streamer" always will be the one that is called and the default implementation of the method calls the single entry "accumulate" method?! And how does the "aggregate" method relate to the "accumulate" ones?

Finally is it correct to say that accumulation will always be just two "levels" first a single top level aggregator and then a second tier running either once on each member or once per partition - or is it possible to create deeper architectures where one aggregator run per member that in turn specify the "PER_PARTITION" would be able to aggregate a third level of aggregators that are run one per partition?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions