Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support requesting TaskGraph in providers #113

Closed
jl-wynen opened this issue Jan 31, 2024 · 11 comments
Closed

Support requesting TaskGraph in providers #113

jl-wynen opened this issue Jan 31, 2024 · 11 comments
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@jl-wynen
Copy link
Member

jl-wynen commented Jan 31, 2024

When building metadata for the output (e.g. in scipp/essreflectometry#6), it can be useful / necessary to inspect the pipeline or task graph. To this end, if we want to do everything in a pipeline instead of a separate step after the pipeline, we need to be able to request the task graph in a provider.

So I suggest supporting providers like this:

def make_metadata(task_graph: sciline.TaskGraph, other_args...):
    ...

where Sciline detects TaskGraph and automatically provides the current task graph.

Of course, this is only usable if we have methods to inspect task graph as discussed in #107

@jl-wynen jl-wynen added the enhancement New feature or request label Jan 31, 2024
@SimonHeybrock
Copy link
Member

Without having thought to deeply I think this is likely a dangerous idea that might cause a lot of pain. Can you explain why other solutions are not feasible?

@jl-wynen
Copy link
Member Author

Since make_metadata needs to inspect the actual task graph to figure out which corrections where applied, it cannot run as a provider right now.

The only alternative I see is doing this 'in post'. E.g.

pipeline = ...
task_graph = pipeline.get(Result)
result = task_graph.compute(Result)
metadata = make_metadata(task_graph)
save(result, metadata)

However, make_metadata needs to access intermediate results (in ORSO: wavelength range, scattering angle). So it needs to recompute those. Or we need to request all of them manually in the code above.
Also, this means that we cannot use bind_and_call for saving data but have to provide all parameters manually.

The only danger I see is that the provider could modify the task graph. But that would be just as illegal as modifying any other input.

@SimonHeybrock
Copy link
Member

The only danger I see is that the provider could modify the task graph. But that would be just as illegal as modifying any other input.

Apart from the architectural monster that we would be creating, that will be a nightmare to test, maintain, and modify in the future..

Why not a couple lines of Python?

def run_and_save(pipeline: sciline.Pipeline):
    graph = pipeline.get((Result, MetaData))
    results = graph.compute()
    metadata = assemble_metadata(results[MetaData], graph)
    save(results[Result], metadata)

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 1, 2024

Because that functions needs to be copied and adjusted in every workflow.

Can you elaborate what you mean by 'architectural monster'?

@SimonHeybrock
Copy link
Member

Because that functions needs to be copied and adjusted in every workflow.

How is that different from your suggested make_metadata above? Doesn't that also have to be copied and adjusted for every workflow?

Can you elaborate what you mean by 'architectural monster'?

I think we would need to do something like this, but probably there is more:

  • Add special special handling to prevent insertion of providers or params to provide sciline.TaskGraph
  • In the graph building logic, add special handling if providers depends on TaskGraph and insert a placeholder.
  • Once the graph is built, iterate over the graph, look for the placeholder, and replace it by a reference to the graph (or probably a copy).
  • Add special handling to the visualization code.
  • Figure out how to store such task graphs (which we plan to do), because this looks very non-standard.
  • Write lots of documentation, explaining how the mechanism works, and the pitfalls (such as infinite recursion, ...).

Given the simple solution provided above, I do not see a justification for adding such complexity.

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 1, 2024

How is that different from your suggested make_metadata above? Doesn't that also have to be copied and adjusted for every workflow?

I doubt that the automatic metadata is rarely good enough. So I was thinking of having a post-processing step that a user can hook into by adding their own provider to the graph. With your function, that would look like this:

def run_and_save(pipeline: sciline.Pipeline, post_process: Callable[[MetaData], MetaData]):
    graph = pipeline.get((Result, MetaData))
    results = graph.compute()
    metadata = assemble_metadata(results[MetaData], graph)
    metadata = post_process(metadata)
    save(results[Result], metadata)

So it is possible but builds a new customisation mechanism outside of the graph.

And then there is the question whether a single Result type makes sense or whether we compute several distinct results and write them all out to the same (set of) files which all have the same metadata.

I think we would need to do something like this, but probably there is more:

How about every pipeline starting with a provider for TaskGraph? When building the actual graph, that provider would need to be given a (weak) ref to the task graph. Everything else would stay the same.

and the pitfalls (such as infinite recursion, ...).

How would this lead to an infinite recursion?

@SimonHeybrock
Copy link
Member

How is that different from your suggested make_metadata above? Doesn't that also have to be copied and adjusted for every workflow?

I doubt that the automatic metadata is rarely good enough. So I was thinking of having a post-processing step that a user can hook into by adding their own provider to the graph. With your function, that would look like this:

def run_and_save(pipeline: sciline.Pipeline, post_process: Callable[[MetaData], MetaData]):
    graph = pipeline.get((Result, MetaData))
    results = graph.compute()
    metadata = assemble_metadata(results[MetaData], graph)
    metadata = post_process(metadata)
    save(results[Result], metadata)

So it is possible but builds a new customisation mechanism outside of the graph.

And then there is the question whether a single Result type makes sense or whether we compute several distinct results and write them all out to the same (set of) files which all have the same metadata.

I am not sure what you are trying to say here, or what would be better with the self-referential solution?

I think we would need to do something like this, but probably there is more:

How about every pipeline starting with a provider for TaskGraph? When building the actual graph, that provider would need to be given a (weak) ref to the task graph. Everything else would stay the same.

Not sure what that changes. You would then also need to add special case handling to the "check if all inputs/providers" are used, requiring modifications to the graph structure analysis, ...

and the pitfalls (such as infinite recursion, ...).

How would this lead to an infinite recursion?

Call graph.compute() in make_metadata?

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 1, 2024

Not sure what that changes. You would then also need to add special case handling to the "check if all inputs/providers" are used, requiring modifications to the graph structure analysis, ...

How so? I'm suggesting a provider like this:

from typing import NewType
import weakref

from sciline import Pipeline
from sciline.task_graph import TaskGraph

class TaskGraphProvider:
    def __init__(self) -> None:
        self._task_graph: weakref.ReferenceType | None = None

    def set_task_graph(self, tg: TaskGraph) -> None:
        self._task_graph = weakref.ref(tg)

    def __call__(self) -> TaskGraph:
        if self._task_graph is None:
            raise RuntimeError("Broken task graph: TaskGraph reference has not been initialized")
        if (tg := self._task_graph()) is None:
            raise RuntimeError("Broken task graph: TaskGraph reference has been cleared")
        return tg

RawData = NewType('RawData', int)
ReducedData = NewType('ReducedData', int)

def load() -> RawData:
    return RawData(3)

def reduce(corrected: RawData) -> ReducedData:
    return ReducedData(corrected + 3)

pl = Pipeline((load, reduce))
pl[TaskGraph] = TaskGraphProvider()

tg = pl.get(ReducedData)
find_task_graph_provider(tg).set_task_graph(tg)

result = tg.compute(ReducedData)

Except that pl[TaskGraph] = TaskGraphProvider() would be in Pipeline.__init__ and find_task_graph_provider(tg).set_task_graph(tg) in Pipeline.build. Apart from that, everything should work as is. And I think it already would if #114 was possible.

How would this lead to an infinite recursion?

Call graph.compute() in make_metadata?

Only if the requested output depends on the task graph. But then, yes, that should of course not be done. But I don't see it as a strong reason against supporting this.

@SimonHeybrock
Copy link
Member

I am not claiming it is impossible, but I still don't see any reason for significantly complicating the design/architecture and adding "unnecessary" features, given that there is a very simple solution, solving the "problem" differently.

Have you actually tried this in practice, and gained experience that such a solution is clearly not feasible?

Unless it is not, my answer would be a categorical "No, this will not be implemented".

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 5, 2024

I have not tried it yet. But this will be part of the ORSO file writer scipp/essreflectometry#27

@SimonHeybrock
Copy link
Member

I suggest we try it for 6-12 months in practice. Marking as won't fix until we have information to the contrary.

@github-project-automation github-project-automation bot moved this from Next to Done in Development Board Feb 6, 2024
@SimonHeybrock SimonHeybrock added the wontfix This will not be worked on label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
Archived in project
Development

No branches or pull requests

2 participants