Add a method for connecting providers to different inputs #192

jl-wynen · 2025-01-22T14:47:16Z

This is only a draft because implementing proper support for generics is non-trivial and I would like to get feedback before doing so. However, even in its current state, this implementation is useful.

This PR proposes a mechanism for replacing the inputs of providers with arbitrary nodes. I initially outlined the idea on Slack after our discussion about workflows on 2025-01-15. override_inputs allows side-stepping domain types and telling Sciline to use a different type instead.

Benefits

'Lying' domain types

We have some pipelines where users can switch off operations. E.g., footprint correction in ESSreflectometry. Doing so keeps a FootprintCorrectedData node in the graph which is misleading.

Given a provider like

def correct_by_footprint(raw: RawData, footprint: Footprint) -> FootprintCorrectedData:
    ...

and using

pl = sciline.Pipeline([correct_by_footprint, ...])
pl = pl.override_input(FootprintCorrectedData, RawData)

effectively removes footprint correction from the graph. The resulting pipeline still contains correct_by_footprint and FootprintCorrectedData but they are no longer ancestors to any other nodes. This means that the correction won't be used by compute. And it won't show up in visualisations or when serialising to JSON.

Choosing alternative algorithms

In powder diffraction, we allow the user to select between normalisation by proton charge and monitor. This can be achieved like so:

Create dedicated domain types ProtonChargeNormalizedRunData and HistogramMonitorNormalizedRunData and use those as return types of normalisation providers.
Create a type NormalizedRunData and use that as input type for all providers consuming normalised data.
Create a pipeline and add all providers for all normalisations.
Select a normalisation using

pl = pl.override_input(NormalizedRunData, ProtonChargeNormalizedRunData)

Any providers that request NormalizedRunData will now get ProtonChargeNormalizedRunData instead.

I implemented this in TODO

Downsides

This is a one time operation.

The providers are not removed. This means that adding providers later can reintroduce the overridden type. This can partially be ameliorated by removing the overridden nodes. Or fixed full by creating an alias instead that affects all providers added later. (A bit like aliases in coord transforms)

No type safety

We normally benefit from some level of type safety by relying on type hints. This is lost if we allow arbitrary overrides. I don't think this is a major problem given that Dask has no type safety either.

Complex implementation

Especially for generic types. This may not be worth the effort. But as I showed above, even a simple implementation fixes the most pressing issues.

SimonHeybrock · 2025-01-23T06:05:10Z

'Lying' domain types

We have some pipelines where users can switch off operations. E.g., footprint correction in ESSreflectometry. Doing so keeps a FootprintCorrectedData node in the graph which is misleading.

Given a provider like
def correct_by_footprint(raw: RawData, footprint: Footprint) -> FootprintCorrectedData:
    ...
and using
pl = sciline.Pipeline([correct_by_footprint, ...])
pl = pl.override_input(FootprintCorrectedData, RawData)
effectively removes footprint correction from the graph. The resulting pipeline still contains correct_by_footprint and FootprintCorrectedData but they are no longer ancestors to any other nodes. This means that the correction won't be used by compute. And it won't show up in visualisations or when serialising to JSON.

I need some more time to wrap my head around this, but I feel this mechanism is backwards. What you don't show above is the follow-up, namely a provider

def next_correction(da: FootprintCorrectedData) -> MoreCorrectedData: ...

There are two cases:

The type hint is correct and the provider really needs FootprintCorrectedData.
- Adding an alias with override_input would be incorrect.
The type hint is a lie, since this works also with data that does not have the previous correction.
- Adding an alias would be "correct", in the sense that it covers up the lying type hint. But the type hint is still wrong in the Python module, including in unit tests that to not operate on the Sciline level.

Taken together, the mechanism is thus either incorrect (case 1), or we have covered up one lie with a second lie (case 2). All this also only works because the types are all aliases for the same underlying data type. While I have no concrete suggestion for now, I feel it would be better to look into how the problems in the underlying code (definition of next_correction) can be addressed.

Maybe the root cause of our trouble is that what the functions (providers) actually want is to define a certain input contract. For lack of a better mechanism we have done so using domain types, but this is a "one dimensional" approach and too entangled into how Sciline builds the graphs. Is there a better way to define input contracts and relate them to a graph structure that does not have this problem?

jl-wynen · 2025-01-23T08:34:18Z

I need some more time to wrap my head around this, but I feel this mechanism is backwards. What you don't show above is the follow-up, namely a provider
def next_correction(da: FootprintCorrectedData) -> MoreCorrectedData: ...
There are two cases:
1. The type hint is correct and the provider really needs `FootprintCorrectedData`.
   
   * Adding an alias with `override_input` would be incorrect.

2. The type hint is a lie, since this works also with data that does not have the previous correction.
   
   * Adding an alias would be "correct", in the sense that it covers up the lying type hint. But the type hint is still wrong in the Python module, including in unit tests that to not operate on the Sciline level.
Taken together, the mechanism is thus either incorrect (case 1), or we have covered up one lie with a second lie (case 2). All this also only works because the types are all aliases for the same underlying data type. While I have no concrete suggestion for now, I feel it would be better to look into how the problems in the underlying code (definition of next_correction) can be addressed.

I agree that this proposal can't work in all cases. It is a global change of the graph. So it can't deal with local changes that you described. When it comes to skipping steps, I see two distinct cases:

A user wants to experiment with what would happen if a certain step was not performed. In this case, quickly rewriting the graph is convenient. But it does not matter that the final result may be wrong. It was never intended to skip the step. And regardless of whether we skip it globally or only for part of the graph, the final result will be wrong to some degree.

There is an optional step. In this case, we need to build the pipeline around this fact. One way to do this is to turn the types around. Instead of using override_input to disable a step, we can use it to enable a step. This is more akin to the example in diffraction. That is, we use

def fn_that_can_use_any_data(da: RawData) -> Processed:  ...
def fn_that_needs_footprint_corrected(da: FootprintCorrectedData) -> MoreCorrectedData: ...

and

pl = pl.override_input(RawData, FootprintCorrectedData)

This would insert the footprint correction before any use of RawData. And fn_that_needs_footprint_corrected would still get the data it expects with or without using override_inputs.

jl-wynen · 2025-05-22T13:09:00Z

Closing as there seems to be no interest in pursuing this further.

Start DataGraph.override_input

f046fd0

jl-wynen mentioned this pull request Jan 22, 2025

Select run norm via override_input scipp/essdiffraction#121

Closed

Add another test

777218b

SimonHeybrock mentioned this pull request May 1, 2025

Propose pattern for pipelines with multiple/optional correction steps #211

Draft

jl-wynen closed this May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a method for connecting providers to different inputs #192

Add a method for connecting providers to different inputs #192

Uh oh!

jl-wynen commented Jan 22, 2025 •

edited

Loading

Uh oh!

SimonHeybrock commented Jan 23, 2025 •

edited

Loading

'Lying' domain types

Uh oh!

jl-wynen commented Jan 23, 2025

Uh oh!

jl-wynen commented May 22, 2025

Uh oh!

Uh oh!

Add a method for connecting providers to different inputs #192

Add a method for connecting providers to different inputs #192

Uh oh!

Conversation

jl-wynen commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benefits

'Lying' domain types

Choosing alternative algorithms

Downsides

This is a one time operation.

No type safety

Complex implementation

Uh oh!

SimonHeybrock commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

'Lying' domain types

Uh oh!

jl-wynen commented Jan 23, 2025

Uh oh!

jl-wynen commented May 22, 2025

Uh oh!

Uh oh!

jl-wynen commented Jan 22, 2025 •

edited

Loading

SimonHeybrock commented Jan 23, 2025 •

edited

Loading