-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add method to get provider from type #135
Conversation
I think that now after adding the type matching logic we don't need to treat subproviders separately anymore. That would simplify adding the |
src/sciline/pipeline.py
Outdated
matches = [ | ||
(provider, bound) | ||
for ptype, provider in self._providers.items() | ||
if (bound := _find_bounds_to_make_compatible_type_tuple((tp,), (ptype,))) | ||
is not None | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replaces a constant-time lookup by a linear-time iteration. And this is called for every requirement, recursively. That is we can the complexity of Pipeline.get
from something like O(N)
to O(N^2)
.
I do not know how large pipelines will get, but this may be severe problem we should not simply ignore.
Can you somehow keep the rest of your refactor, but still have a constant-time lookup by default, falling back to iteration only if there is no match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely correct. We could group the providers by their origin
type ahead of the iteration to achieve the same effect as the old constant time lookup.
However, I tested this for the esssans workflow using
%%timeit
iofq = pipeline.get(BackgroundSubtractedIofQ)
and the result is that the time goes from 2ms to 10ms.
I'd say that's not particularly significant and right now I don't think it is worth complicating the code for.
We can always add this in the future if it becomes an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I disagree. The ESSsans workflow is "tiny", and I am surprised that even there this takes 5x the time! Since the effect is quadratic I do not think we should ignore this.
Why do yousuggest we would have to pre-group? Can't we look for an exact match, simply calling self._providers.get(tp)
and use that if found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we look for an exact match, simply calling self._providers.get(tp) and use that if found?
Yeah sure we can do that.
I don't agree 5x is dramatic. Consider that if the effect was truly O(n^2)
we could expect a slowdown proportional to the number of providers in the pipeline, that is, >50.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we look for an exact match, simply calling self._providers.get(tp) and use that if found?
Yeah sure we can do that.
I don't agree 5x is dramatic. Consider that if the effect was truly
O(n^2)
we could expect a slowdown proportional to the number of providers in the pipeline, that is, >50.
I would not expect 50x, since the code also does other things, apart from provider lookup.
But to clarify:
- Switching to an algorithm with much worse scaling (here
O(N^2)
instead ofO(N)
should never be done lightheartedly and without great care, in particular if we are not in full control over all the contexts where the algorithm is used (see next point). - Sciline is a generic package that is meant to be useful well beyond the context of the neutron-scattering data-reduction workflows that we are writing. Therefore, the benchmark you ran above is not a sufficient data point (or rather, it shows that we should not make the change).
src/sciline/pipeline.py
Outdated
if tp in self._providers: | ||
# Optimization to quickly find non-generic providers | ||
matches = [(self._providers[tp], {})] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking, but you replaced a single lookup by two lookups. Can we use the walrus approach like the original implementation?
src/sciline/pipeline.py
Outdated
@@ -434,6 +433,18 @@ def __setitem__(self, key: Type[T], param: T) -> None: | |||
) | |||
self._set_provider(key, Provider.parameter(param)) | |||
|
|||
def __getitem__(self, tp: Union[Type[T], Item[T]]) -> Provider: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes Provider
part of the public interface. So can you import it as well as ArgSpec, ProviderLocation, ProviderKind, ToProvider in the top level init.py? (ToProvider maybe should be exposed in typing instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm maybe we want to avoid that actually.
How about we make the interface such that it returns the same type as was passed into the pipeline.
That is, if the tp
matches a provider with kind='function'
it returns a function, and if tp
matches a provider with kind='parameter'
then it returns the value?
I'm not sure how the parameter tables should fit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the conceptual issues, I was wrong stating that this makes Pipeline a Mapping. It does not because we would have to also implement __iter__
and __len__
. And actually, since there is a __setitem__
, if anything, it should be MutableMapping
, thus we would also need __delitem__
.
However, Pipeline implements part of the interface (and also the interface of Sequence
). I'm not entirely sure how it fits in.
More concretely, implementing __getitem__
makes Python think that pipelines are iterable with integer indices:
import sciline as sl
def to_string(i: int) -> str:
return str(i)
pl = sl.Pipeline([to_string], params={int: 3})
list(pl)
raises
sciline.handler.UnsatisfiedRequirement: ('No provider found for type', 0)
This not great. So you would also need to implement __iter__
. And since you would be going for a dict-like interface, you would also need keys
, values
, and items
.
And that leaves us with the asymmetry between __getitem__
and __setitem__
. For generic providers, the former currently returns different values and accepts different keys than the latter. Even if __setitem__
were extended to support providers on top of parameters.
Also, what about parameter tables or sentinels? Would __getitem__
return those special, internal providers? Or how would it represent them?
With the proposed semantics, the new functions allow checking what a pipeline can produce and how. So it does not allow treating Pipeline like a container. So named methods seem more appropriate to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes a lot of sense.
How about adding a get_provider
method instead of __getitem__
?
Also, what about parameter tables or sentinels? Would getitem return those special, internal providers? Or how would it represent them?
I don't know, what do you think? I'm thinking that it's likely that a user wants to access the parameter tables that a pipeline uses, but they should preferably be returned in the same form that they were inserted, i.e. as sl.ParamTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, what do you think? I'm thinking that it's likely that a user wants to access the parameter tables that a pipeline uses, but they should preferably be returned in the same form that they were inserted, i.e. as sl.ParamTable.
Just like the discussion about using a graph data structure inside Pipeline
, this shows that the implicit way in which we handle param-tables is problematic. We may need to re-think that (come up with a more explicit interface, or way the express the nesting?).
provider = self._get_unique_provider(tp, HandleAsBuildTimeException())[0] | ||
if provider.kind == 'function': | ||
return provider.func | ||
return provider.func() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other kinds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's something we need to talk about. What other cases are there, I looked at the ProviderKinds
type but didn't recognize all of them. However, I think it's unlikely that a user will request other types, so for now I'm open to just raising NotImplemented
in the other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, users might request Series
. But that, along with the other kinds, may be subject to change if param tables are removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I thought as well, so I don't think it makes sense to invest time to solve that now.
parameter the method returns the value of the parameter.''' | ||
provider = self._get_unique_provider(tp, HandleAsBuildTimeException())[0] | ||
if provider.kind == 'function': | ||
return provider.func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it return the function and not the provider object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I agreed with the concern that the Provider
class isn't currently exposed, and returning a Provider
here would expose it. I think just returning the same type that was provided by the user keeps things simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not concerned about exposing it. This can be useful in its own right.
Your solution makes it difficult to distinguish between providers and parameters. You can't check the type of the return value of get_provider
because parameters can be callable and providers can now be any callable object, not just functions. So you would have to compare the returned type to tp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern about expose Provider
is that it would increase the api surface area. I agree that it's a problem that you can't generally determine if the returned provider is a parameter or a function.
Fixes #98 and #108