Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add method to get provider from type #135

Closed
wants to merge 12 commits into from
Closed

Conversation

jokasimr
Copy link
Contributor

Fixes #98 and #108

@jokasimr
Copy link
Contributor Author

I think that now after adding the type matching logic we don't need to treat subproviders separately anymore.

That would simplify adding the __getitem__ and __contains__ methods. So therefore I'm considering making that refactoring here, what do you think @SimonHeybrock ?

Comment on lines 542 to 547
matches = [
(provider, bound)
for ptype, provider in self._providers.items()
if (bound := _find_bounds_to_make_compatible_type_tuple((tp,), (ptype,)))
is not None
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces a constant-time lookup by a linear-time iteration. And this is called for every requirement, recursively. That is we can the complexity of Pipeline.get from something like O(N) to O(N^2).

I do not know how large pipelines will get, but this may be severe problem we should not simply ignore.

Can you somehow keep the rest of your refactor, but still have a constant-time lookup by default, falling back to iteration only if there is no match?

Copy link
Contributor Author

@jokasimr jokasimr Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely correct. We could group the providers by their origin type ahead of the iteration to achieve the same effect as the old constant time lookup.

However, I tested this for the esssans workflow using

%%timeit
iofq = pipeline.get(BackgroundSubtractedIofQ)

and the result is that the time goes from 2ms to 10ms.

I'd say that's not particularly significant and right now I don't think it is worth complicating the code for.
We can always add this in the future if it becomes an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I disagree. The ESSsans workflow is "tiny", and I am surprised that even there this takes 5x the time! Since the effect is quadratic I do not think we should ignore this.

Why do yousuggest we would have to pre-group? Can't we look for an exact match, simply calling self._providers.get(tp) and use that if found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we look for an exact match, simply calling self._providers.get(tp) and use that if found?

Yeah sure we can do that.

I don't agree 5x is dramatic. Consider that if the effect was truly O(n^2) we could expect a slowdown proportional to the number of providers in the pipeline, that is, >50.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we look for an exact match, simply calling self._providers.get(tp) and use that if found?

Yeah sure we can do that.

I don't agree 5x is dramatic. Consider that if the effect was truly O(n^2) we could expect a slowdown proportional to the number of providers in the pipeline, that is, >50.

I would not expect 50x, since the code also does other things, apart from provider lookup.

But to clarify:

  • Switching to an algorithm with much worse scaling (here O(N^2) instead ofO(N) should never be done lightheartedly and without great care, in particular if we are not in full control over all the contexts where the algorithm is used (see next point).
  • Sciline is a generic package that is meant to be useful well beyond the context of the neutron-scattering data-reduction workflows that we are writing. Therefore, the benchmark you ran above is not a sufficient data point (or rather, it shows that we should not make the change).

Comment on lines 542 to 544
if tp in self._providers:
# Optimization to quickly find non-generic providers
matches = [(self._providers[tp], {})]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but you replaced a single lookup by two lookups. Can we use the walrus approach like the original implementation?

@jokasimr jokasimr requested a review from jl-wynen February 20, 2024 09:03
@jokasimr jokasimr self-assigned this Feb 20, 2024
@jokasimr jokasimr marked this pull request as ready for review February 20, 2024 09:04
@jokasimr jokasimr changed the title refactor: remove subprovider concept feat: add method to get provider from type Feb 20, 2024
@@ -434,6 +433,18 @@ def __setitem__(self, key: Type[T], param: T) -> None:
)
self._set_provider(key, Provider.parameter(param))

def __getitem__(self, tp: Union[Type[T], Item[T]]) -> Provider:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes Provider part of the public interface. So can you import it as well as ArgSpec, ProviderLocation, ProviderKind, ToProvider in the top level init.py? (ToProvider maybe should be exposed in typing instead)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm maybe we want to avoid that actually.
How about we make the interface such that it returns the same type as was passed into the pipeline.
That is, if the tp matches a provider with kind='function' it returns a function, and if tp matches a provider with kind='parameter' then it returns the value?

I'm not sure how the parameter tables should fit here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the conceptual issues, I was wrong stating that this makes Pipeline a Mapping. It does not because we would have to also implement __iter__ and __len__. And actually, since there is a __setitem__, if anything, it should be MutableMapping, thus we would also need __delitem__.
However, Pipeline implements part of the interface (and also the interface of Sequence). I'm not entirely sure how it fits in.

More concretely, implementing __getitem__ makes Python think that pipelines are iterable with integer indices:

import sciline as sl

def to_string(i: int) -> str:
    return str(i)

pl = sl.Pipeline([to_string], params={int: 3})
list(pl)

raises

sciline.handler.UnsatisfiedRequirement: ('No provider found for type', 0)

This not great. So you would also need to implement __iter__. And since you would be going for a dict-like interface, you would also need keys, values, and items.

And that leaves us with the asymmetry between __getitem__ and __setitem__. For generic providers, the former currently returns different values and accepts different keys than the latter. Even if __setitem__ were extended to support providers on top of parameters.
Also, what about parameter tables or sentinels? Would __getitem__ return those special, internal providers? Or how would it represent them?

With the proposed semantics, the new functions allow checking what a pipeline can produce and how. So it does not allow treating Pipeline like a container. So named methods seem more appropriate to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes a lot of sense.
How about adding a get_provider method instead of __getitem__?

Also, what about parameter tables or sentinels? Would getitem return those special, internal providers? Or how would it represent them?

I don't know, what do you think? I'm thinking that it's likely that a user wants to access the parameter tables that a pipeline uses, but they should preferably be returned in the same form that they were inserted, i.e. as sl.ParamTable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, what do you think? I'm thinking that it's likely that a user wants to access the parameter tables that a pipeline uses, but they should preferably be returned in the same form that they were inserted, i.e. as sl.ParamTable.

Just like the discussion about using a graph data structure inside Pipeline, this shows that the implicit way in which we handle param-tables is problematic. We may need to re-think that (come up with a more explicit interface, or way the express the nesting?).

@jokasimr jokasimr requested a review from jl-wynen February 23, 2024 09:44
provider = self._get_unique_provider(tp, HandleAsBuildTimeException())[0]
if provider.kind == 'function':
return provider.func
return provider.func()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other kinds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's something we need to talk about. What other cases are there, I looked at the ProviderKinds type but didn't recognize all of them. However, I think it's unlikely that a user will request other types, so for now I'm open to just raising NotImplemented in the other cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, users might request Series. But that, along with the other kinds, may be subject to change if param tables are removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I thought as well, so I don't think it makes sense to invest time to solve that now.

parameter the method returns the value of the parameter.'''
provider = self._get_unique_provider(tp, HandleAsBuildTimeException())[0]
if provider.kind == 'function':
return provider.func
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it return the function and not the provider object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I agreed with the concern that the Provider class isn't currently exposed, and returning a Provider here would expose it. I think just returning the same type that was provided by the user keeps things simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not concerned about exposing it. This can be useful in its own right.

Your solution makes it difficult to distinguish between providers and parameters. You can't check the type of the return value of get_provider because parameters can be callable and providers can now be any callable object, not just functions. So you would have to compare the returned type to tp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern about expose Provider is that it would increase the api surface area. I agree that it's a problem that you can't generally determine if the returned provider is a parameter or a function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide access to code of providers via pipeline object
3 participants