Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for aggregation #122

Merged
merged 30 commits into from
Jun 26, 2021
Merged
Changes from 5 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
197a1d4
Add aggregation docs without FlowGroups
kidrahahjo Feb 24, 2021
61f9c08
Add documentation for aggregation with FlowGroups
kidrahahjo Feb 24, 2021
cf56bd7
Merge remote-tracking branch 'origin/master' into feature/aggregation…
kidrahahjo May 21, 2021
b65f9f3
Address reviews and update docs with current API
kidrahahjo May 21, 2021
cc7de69
Improve wordings
kidrahahjo May 21, 2021
5d4b5c6
Address code review
kidrahahjo May 22, 2021
6d19ec7
Merge remote-tracking branch 'origin/master' into feature/aggregation…
kidrahahjo Jun 24, 2021
120db74
Improvements to doc
kidrahahjo Jun 24, 2021
c527af3
Improve wording
kidrahahjo Jun 24, 2021
a2fa6d8
Update docs/source/aggregation.rst
bdice Jun 25, 2021
971d828
Update docs/source/aggregation.rst
bdice Jun 25, 2021
5a3804e
Update docs/source/aggregation.rst
bdice Jun 25, 2021
1283361
Update docs/source/aggregation.rst
bdice Jun 25, 2021
5d64f3f
Update docs/source/aggregation.rst
bdice Jun 25, 2021
76ec6d4
Update docs/source/aggregation.rst
bdice Jun 25, 2021
c82be52
Update docs/source/aggregation.rst
bdice Jun 25, 2021
7ffd383
Update docs/source/aggregation.rst
bdice Jun 25, 2021
35cbf3b
Update docs/source/aggregation.rst
bdice Jun 25, 2021
c00bb86
Update docs/source/aggregation.rst
bdice Jun 25, 2021
49c31f2
Update docs/source/aggregation.rst
bdice Jun 25, 2021
375b3cd
Update docs/source/aggregation.rst
bdice Jun 25, 2021
7d5b3d5
Update docs/source/aggregation.rst
bdice Jun 25, 2021
573eb10
Update docs/source/aggregation.rst
bdice Jun 25, 2021
16b1c97
Explain that operations are like aggregate operations acting on aggre…
bdice Jun 26, 2021
8047341
Add aggregation to table of contents.
bdice Jun 26, 2021
d16f43a
Rename section to match FlowGroup.
bdice Jun 26, 2021
9cd9995
Unitalicize.
bdice Jun 26, 2021
23576c6
Fix links to pre/post.
bdice Jun 26, 2021
abd7544
Fix intersphinx references.
bdice Jun 26, 2021
aca6752
Use :py: role prefix for consistency with other docs.
bdice Jun 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions docs/source/aggregation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
.. _aggregation:

===========
Aggregation
===========

This chapter provides information about the passing aggregates of jobs to operation functions.


.. _aggregator_definition:

aggregator
==========

An :class:`~flow.aggregator` is used as a decorator for an operation function which acts as an entry point to define the type of aggregation to perform while submitting or running the operation function.

.. code-block:: python

# project.py
from flow import FlowProject, aggregator

class Project(FlowProject):
pass

@aggregator()
@Project.operation
def op1(*jobs):
pass

@Project.operation
def op2(job):
pass

if __name__ == '__main__':
Project().main()

By default, if :class:`~flow.aggregator` is used as a decorator, aggregate of all the jobs present in the project will be created.
In the above example, ``op1`` can be referred to as an *aggregate operation* where all the jobs present in the project are passed as arbitraty arguments (or ``*args``) and ``op2`` is a *normal operation* where only a single job is passed as a parameter.


.. _types_of_aggregation:

Types of Aggregation
====================

Currently, **signac-flow** allows users to aggregate jobs by:

- Grouping them on state point key, an iterable of state point keys whose values define the groupings, or an arbitrary callable of :class:`~signac.contrib.job.Job`.
- Generating aggregates of a given size.
- Using custom aggregator function when greater flexibility is needed.

Group By
---------

:class:`~flow.aggregator.groupby` allows users to aggregate jobs by grouping them on state point key, an iterable of state point keys whose values define the groupings, or an arbitrary callable of :class:`~signac.contrib.job.Job`.

.. code-block:: python

@aggregator.groupby('temperature')
@Project.operation
def op3(*jobs):
pass

In the above example, the jobs will get aggregated based on the state point **temperature**.
So, all the jobs having the same value of **temperature** in their state point will be aggregated together.

Groups Of
---------

:class:`~flow.aggregator.groupsof` allows users to aggregate jobs by generating aggregates of a given size.

.. code-block:: python

@aggregator.groupsof(2)
@Project.operation
def op4(job1, job2=None):
pass

In the above example, the jobs will get aggregated in groups of 2 and hence, up to two jobs will be passed as parameters at once.

.. note::

In case the number of jobs in the project is odd, there will be one aggregate containing only a single job and hence users should be careful while passing non-default arguments in an *aggregate operation*.

Copy link
Member

@bdice bdice Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add more examples for some or all of the following:

  • Group by state point keys: The aggregates are grouped by multiple state point keys.
  • Group by arbitrary key function: The aggregates are grouped by keys determined by a key-function that expects an instance of :class:~.signac.contrib.job.Job and return the grouping key.
  • Using a completely custom aggregator function when even greater flexibility is needed.
  • Using sorting/selection in conjunction with other aggregator parameters.

Copy link
Member

@bdice bdice Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new issue for this. #146.

Sorting jobs for aggregation
----------------------------

**signac-flow** allows users to define the sorting order of jobs before creating the aggregates with the help of ``sort_by`` parameter and the sorting order can be defined with the help of ``sort_ascending`` parameter.
By default, when no `sort_by` parameter is specified, the order of the jobs will be decided by the order in which the jobs are iterated in a **signac** project.

.. code-block:: python

@aggregator.groupsof(2, sort_by='temperature', sort_ascending=False)
@Project.operation
def op5(job1, job2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use *jobs or job2=None to support a final aggregate with one job?
(I also worry that showing examples with job1, job2 will be more confusing than *jobs.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've made our point of using non default arguments carefully, I think we should go with *jobs here.

pass

.. note::

In the above example, all the jobs will be sorted by the state point parameter ``temperature`` in descending order and then be aggregated as groups of 2.

Selecting jobs for aggregation
------------------------------

**signac-flow** allows users to selectively choose which jobs to pass into operation functions.

.. code-block:: python

@aggregator(select=lambda job: job.sp.temperature > 0)
@Project.operation
def op6(job1, job2):
pass


.. _aggregate_id:

Aggregate ID
============

Similar to the concept of a job id, an aggregate id is a unique hash identifying an aggregate of jobs.
The aggregate id is sensitive to the order of the jobs in the aggregate.


.. note::

The id of an aggregate containing one job is that job's id.

In order to distinguish between aggregate id and a job id, for an aggregate of more than one job the aggregate id of that aggregate will always have a prefix ``agg-``.

Users can generate the aggregate id of an aggregate using :meth:`flow.get_aggregate_id`.

.. tip::

Users can also pass an aggregate id to the ``--job-id`` command-line flag provided by **signac-flow** in ``run``, ``submit``, and ``exec``.


.. _aggregation_with_flow_groups:

Aggregation with FlowGroups
===========================

In order to associate aggregator object with a :py:class:`FlowGroup`, **signac-flow** provides a ``group_aggregator`` parameter in :meth:`~flow.FlowProject.make_group`. By default, no aggregation takes place for a :py:class:`FlowGroup`.

.. note::

Currently, **signac-flow** only allows single :class:`~flow.aggregator` per group, i.e., all the operations present in a :py:class:`FlowGroup` will be using the same :class:`~flow.aggregator` object.

.. code-block:: python

# project.py
from flow import FlowProject, aggregator

class Project(FlowProject):
pass

group = Project.make_group('agg-group', group_aggregator=aggregator())

@group
@aggregator()
@Project.operation
def op1(*jobs):
pass

@group
@Project.operation
def op2(*jobs):
pass

if __name__ == '__main__':
Project().main()

In the above example, when the group ``agg-group`` is executed, all the jobs in the project are passed as arbitrary arguments for ``op1`` and ``op2``. But if only ``op2`` is executed, only a single job is passed as a parameter.