Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a method for connecting providers to different inputs #192

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jl-wynen
Copy link
Member

@jl-wynen jl-wynen commented Jan 22, 2025

This is only a draft because implementing proper support for generics is non-trivial and I would like to get feedback before doing so. However, even in its current state, this implementation is useful.

This PR proposes a mechanism for replacing the inputs of providers with arbitrary nodes. I initially outlined the idea on Slack after our discussion about workflows on 2025-01-15. override_inputs allows side-stepping domain types and telling Sciline to use a different type instead.

Benefits

'Lying' domain types

We have some pipelines where users can switch off operations. E.g., footprint correction in ESSreflectometry. Doing so keeps a FootprintCorrectedData node in the graph which is misleading.

Given a provider like

def correct_by_footprint(raw: RawData, footprint: Footprint) -> FootprintCorrectedData:
    ...

and using

pl = sciline.Pipeline([correct_by_footprint, ...])
pl = pl.override_input(FootprintCorrectedData, RawData)

effectively removes footprint correction from the graph. The resulting pipeline still contains correct_by_footprint and FootprintCorrectedData but they are no longer ancestors to any other nodes. This means that the correction won't be used by compute. And it won't show up in visualisations or when serialising to JSON.

Choosing alternative algorithms

In powder diffraction, we allow the user to select between normalisation by proton charge and monitor. This can be achieved like so:

  1. Create dedicated domain types ProtonChargeNormalizedRunData and HistogramMonitorNormalizedRunData and use those as return types of normalisation providers.
  2. Create a type NormalizedRunData and use that as input type for all providers consuming normalised data.
  3. Create a pipeline and add all providers for all normalisations.
  4. Select a normalisation using
pl = pl.override_input(NormalizedRunData, ProtonChargeNormalizedRunData)

Any providers that request NormalizedRunData will now get ProtonChargeNormalizedRunData instead.

I implemented this in TODO

Downsides

This is a one time operation.

The providers are not removed. This means that adding providers later can reintroduce the overridden type. This can partially be ameliorated by removing the overridden nodes. Or fixed full by creating an alias instead that affects all providers added later. (A bit like aliases in coord transforms)

No type safety

We normally benefit from some level of type safety by relying on type hints. This is lost if we allow arbitrary overrides. I don't think this is a major problem given that Dask has no type safety either.

Complex implementation

Especially for generic types. This may not be worth the effort. But as I showed above, even a simple implementation fixes the most pressing issues.

@SimonHeybrock
Copy link
Member

SimonHeybrock commented Jan 23, 2025

'Lying' domain types

We have some pipelines where users can switch off operations. E.g., footprint correction in ESSreflectometry. Doing so keeps a FootprintCorrectedData node in the graph which is misleading.

Given a provider like

def correct_by_footprint(raw: RawData, footprint: Footprint) -> FootprintCorrectedData:
    ...

and using

pl = sciline.Pipeline([correct_by_footprint, ...])
pl = pl.override_input(FootprintCorrectedData, RawData)

effectively removes footprint correction from the graph. The resulting pipeline still contains correct_by_footprint and FootprintCorrectedData but they are no longer ancestors to any other nodes. This means that the correction won't be used by compute. And it won't show up in visualisations or when serialising to JSON.

I need some more time to wrap my head around this, but I feel this mechanism is backwards. What you don't show above is the follow-up, namely a provider

def next_correction(da: FootprintCorrectedData) -> MoreCorrectedData: ...

There are two cases:

  1. The type hint is correct and the provider really needs FootprintCorrectedData.
    • Adding an alias with override_input would be incorrect.
  2. The type hint is a lie, since this works also with data that does not have the previous correction.
    • Adding an alias would be "correct", in the sense that it covers up the lying type hint. But the type hint is still wrong in the Python module, including in unit tests that to not operate on the Sciline level.

Taken together, the mechanism is thus either incorrect (case 1), or we have covered up one lie with a second lie (case 2). All this also only works because the types are all aliases for the same underlying data type. While I have no concrete suggestion for now, I feel it would be better to look into how the problems in the underlying code (definition of next_correction) can be addressed.

Maybe the root cause of our trouble is that what the functions (providers) actually want is to define a certain input contract. For lack of a better mechanism we have done so using domain types, but this is a "one dimensional" approach and too entangled into how Sciline builds the graphs. Is there a better way to define input contracts and relate them to a graph structure that does not have this problem?

@jl-wynen
Copy link
Member Author

I need some more time to wrap my head around this, but I feel this mechanism is backwards. What you don't show above is the follow-up, namely a provider

def next_correction(da: FootprintCorrectedData) -> MoreCorrectedData: ...

There are two cases:

1. The type hint is correct and the provider really needs `FootprintCorrectedData`.
   
   * Adding an alias with `override_input` would be incorrect.

2. The type hint is a lie, since this works also with data that does not have the previous correction.
   
   * Adding an alias would be "correct", in the sense that it covers up the lying type hint. But the type hint is still wrong in the Python module, including in unit tests that to not operate on the Sciline level.

Taken together, the mechanism is thus either incorrect (case 1), or we have covered up one lie with a second lie (case 2). All this also only works because the types are all aliases for the same underlying data type. While I have no concrete suggestion for now, I feel it would be better to look into how the problems in the underlying code (definition of next_correction) can be addressed.

I agree that this proposal can't work in all cases. It is a global change of the graph. So it can't deal with local changes that you described. When it comes to skipping steps, I see two distinct cases:

A user wants to experiment with what would happen if a certain step was not performed. In this case, quickly rewriting the graph is convenient. But it does not matter that the final result may be wrong. It was never intended to skip the step. And regardless of whether we skip it globally or only for part of the graph, the final result will be wrong to some degree.

There is an optional step. In this case, we need to build the pipeline around this fact. One way to do this is to turn the types around. Instead of using override_input to disable a step, we can use it to enable a step. This is more akin to the example in diffraction. That is, we use

def fn_that_can_use_any_data(da: RawData) -> Processed:  ...
def fn_that_needs_footprint_corrected(da: FootprintCorrectedData) -> MoreCorrectedData: ...

and

pl = pl.override_input(RawData, FootprintCorrectedData)

This would insert the footprint correction before any use of RawData. And fn_that_needs_footprint_corrected would still get the data it expects with or without using override_inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants