-
Notifications
You must be signed in to change notification settings - Fork 3
Vmpy #26
base: master
Are you sure you want to change the base?
Vmpy #26
Conversation
Thanks for the PR! I will review your changes. Note to self: enable notifications on personal repositories so that I get notified about PRs... 😅 |
I added a lot of comments to your PR, don't take it as a sign that I don't like it... Before I want to consider merging I think we should align on the hierarchy and dependency between the libraries: I do not know where you want to take vmpy but I sense that there is a risk of a circular dependency where vmpy wants to use logic from athletic_pandas and vice versa. I want to have the hierarchy between these packages clear before I add the dependency on vmpy. |
Hi, how do I see the comments? I don't see anything in the review tab. Short answer: The idea is to a one general purpose library with algorithms, that can be used anywhere else. It can be either athletic_pandas that serves only WorkoutDataFrame and all algos move to vmpy or one package that serves both WDF and algos. Long answer. I think we are chasing the same goal and it would be a shame if we would do the double work or even worse, create a fragmented package space. The only difference that I see between the athletic_pandas algos and vmpy algos is that vmpy preserve the input/output data-types. Basically, if I have a power stream as a list, I'm expecting as a return a list of power-zones. This is convenient when working with REST resources. If the input is ndarray - the ndarray will be outputted, so integration with athletic_pandas is also seamless. It is an extra line of code for every function and extra testing, but I think it helps longterm in wide adoption of the package. In general, I'm rooting for creating one, largely data-types unopinionated, python package with algorithms, so other packages/apps can build on top of it. What I want to avoid is that the same algorithms are being implemented in the different libraries. Option 1: Move all algorithms to vmpy and athletic_pandas would only take care of providing a convenient tool for quick data exploration - this PR is an example of how I see this working with wrapping vmpy calls in the module.py methods. I did not do anything with ap/algorithms implementations yet and simply added a To-Do comment there to indicate an intention of eventual depreciation if we agree on this strategy. Option 2: Merge the libraries together and serve both, the algorithms and the WorkoutDataFrame. In this case, I believe, it is still a good idea to make algorithms work with any array-like datatypes and not only with pd.Series for the ease of integration. What do you think? |
|
||
Parameters | ||
---------- | ||
power: pd.Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True for now but I probably want to change to ndarray. Do you agree that's better?
Good that your are adding docstrings, I did not get to that yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are based on what it does at the moment. In general, i prefer to have it array-like {list, ndarray, pd.Series} with output having the same time as an input.
------- | ||
pd.Series | ||
|
||
TO-DO: to be replaced by vmpy.algorithms.power_duration_curve() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module does not exist yet right? What's your idea behind this module?
Also, what is the difference between your implementation and mine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, I've moved it to vmpy.metrics.power_duration_curve() the last moment. It calculates accumulated energy first and than does two arrays diff with a shift. It speeds up calculations by x4. The 8hours ride takes about 10 seconds instead of almost a minute on my PC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh fancy! Did you benchmark it to my implementation or yours?
@@ -45,10 +75,39 @@ def mean_max_bests(power, duration, amount): | |||
|
|||
|
|||
def weighted_average_power(power): | |||
"""Weighted average power | |||
|
|||
Also known as the Normalized Power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalized Power®️ (😉) is a trademarked name and cannot be used commercially without permission from Peaksware. That reason alone is enough for me to avoid using that name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, didn't know that, but knowing the behavior of Coggan on forums should have guessed it. WAP is good, fits well with Strava community.
------- | ||
number | ||
|
||
replaced by vmpy.metrics.normalized_power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem: What are the differences between your implementation and mine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NP - none, but the vmpy function also allows to calculate xPower that uses the exponentially weighted average.
|
||
|
||
class WorkoutDataFrame(BaseWorkoutDataFrame): | ||
_metadata = ['athlete'] | ||
|
||
@requires(columns=['power']) | ||
def compute_power_zones(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method touches a sensitive subject ;) : In my view the WorkoutDataFrame should not contain any business logic and should call a method from algorithms as directly as possible. That means adding a method to the algorithms module (which in turn could call vmpy) and call it from here. The WorkoutDataFrame is just a more high level way to work with these algorithms but a user should be able to use all logic in athletic_pandas outside of the WorkoutDataFrame.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, especially in light of the email I wrote yesterday. Let's see how we separate Frame from the algos in the most efficient way.
|
||
|
||
class WorkoutDataFrame(BaseWorkoutDataFrame): | ||
_metadata = ['athlete'] | ||
|
||
@requires(columns=['power']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I am not sure I still like the @requires
decorator and consider removing it, for now it should also contain athlete=['ftp']
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think the require decorator limits the usage. If you get data from Strava for example, the power stream is called 'watts' there. I would propose to ditch require in favor of sane defaults with an ability to override during the function call. Somethin along the line wdf.calculate_awesome_power(arg='power', **kwargs)
In this particular case I've removed require for an athlete to allow optionally specifying 'ftp' using keywords. It falls back on athlete['ftp'] as s default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is really similar to the discussion about accepting multiple input argument types and I am not in favor of it. I think as a developer you have a responsibility to make the interface of you library as well defined as possible, accepting multiple input types or accepting that certain data is stored in different columns makes the interface unclear and therefore more difficult to use. For me this would also make the purpose of the dataframe redundant as you cannot just use the class methods "as is" without understanding the underlying data structure.
Anyway, I am writing a longer response where I will dive into this (even) more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepting that certain data is stored in different columns makes the interface unclear and therefore more difficult to use
The problem with too simple interfaces is that they limit the use. In case of requiring 'power' only you force the user in figuring out how to rename the columns in pd.DataFrame to get it working. The power/watts duality is a one example forced on us by Strava, but with heartrate you will get a full multitude of the possibilities: {hr, bpm, heart_rate, you_name_it}
The methods can simply be written with defaults in mind as def method(self, arg='power')
, and can be called as wdf.method()
with defaults and with a user-defined name for other cases wdf.method(arg='watts')
. This pattern is in fact not that different from how pandas already exposes matplotlib methods.
Another aspect is a sole reliance on the Athlete obj. I think it is less canonical and therefore less obvious for the user to expect that only way to parametrize wdf methods is through the Athlete object. It is a great placeholder for defaults, but it is also quite natural to expect the parameters to be exposed as a function signature.
Imagine you want to compare two Wbal calculations on the same data set, but with different W' and CP values. The only way to do it now is through redefining the Athlete properties, while I would naturally expect that W' and CP' should be available as inputs to the method call.
@@ -105,8 +105,8 @@ def test_extended_critical_power(self, version, expected): | |||
power = pd.Series(range(500)) | |||
mean_max_power = algorithms.mean_max_power(power) | |||
ecp = algorithms.extended_critical_power(mean_max_power, version=version) | |||
assert ecp[1] == expected[0] | |||
assert ecp[250] == expected[1] | |||
assert ecp[1] == pytest.approx(expected[0], 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I like the pytest.approx
, did not know it! Thanks for the tip :)
wdf = models.WorkoutDataFrame( | ||
pd.read_csv('tests/example_files/workout_1_short.csv') | ||
pd.read_csv(os.path.join(current_dir, 'example_files/workout_1_short.csv')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on moving all pytest fixtures to a conftest.py
file in the root of /tests, that will remove vague file paths like this.
@@ -181,6 +212,7 @@ def test_compute_weighted_average_power_missing_power(self, wdf_big): | |||
|
|||
def test_compute_power_per_kg(self, wdf_big): | |||
# wdf_big.athlete.ftp = 300 | |||
# wdf_big.athlete.weight = 80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a reminder for myself to remember what is the actual weight. Can be removed.
I forgot to press "Submit review", now my comments should be visible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it all boils down to 3 questions:
-
Merge two packages or separate algos into a separate package?
-
Do you have any strong view on treating input/outputs as array-like that treats lists as a first-class citizens?
-
Drop @requires in favor of sane default names for expected columns.
|
||
Parameters | ||
---------- | ||
power: pd.Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are based on what it does at the moment. In general, i prefer to have it array-like {list, ndarray, pd.Series} with output having the same time as an input.
------- | ||
pd.Series | ||
|
||
TO-DO: to be replaced by vmpy.algorithms.power_duration_curve() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, I've moved it to vmpy.metrics.power_duration_curve() the last moment. It calculates accumulated energy first and than does two arrays diff with a shift. It speeds up calculations by x4. The 8hours ride takes about 10 seconds instead of almost a minute on my PC.
@@ -45,10 +75,39 @@ def mean_max_bests(power, duration, amount): | |||
|
|||
|
|||
def weighted_average_power(power): | |||
"""Weighted average power | |||
|
|||
Also known as the Normalized Power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, didn't know that, but knowing the behavior of Coggan on forums should have guessed it. WAP is good, fits well with Strava community.
------- | ||
number | ||
|
||
replaced by vmpy.metrics.normalized_power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NP - none, but the vmpy function also allows to calculate xPower that uses the exponentially weighted average.
|
||
|
||
class WorkoutDataFrame(BaseWorkoutDataFrame): | ||
_metadata = ['athlete'] | ||
|
||
@requires(columns=['power']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think the require decorator limits the usage. If you get data from Strava for example, the power stream is called 'watts' there. I would propose to ditch require in favor of sane defaults with an ability to override during the function call. Somethin along the line wdf.calculate_awesome_power(arg='power', **kwargs)
In this particular case I've removed require for an athlete to allow optionally specifying 'ftp' using keywords. It falls back on athlete['ftp'] as s default.
|
||
|
||
class WorkoutDataFrame(BaseWorkoutDataFrame): | ||
_metadata = ['athlete'] | ||
|
||
@requires(columns=['power']) | ||
def compute_power_zones(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, especially in light of the email I wrote yesterday. Let's see how we separate Frame from the algos in the most efficient way.
@@ -181,6 +212,7 @@ def test_compute_weighted_average_power_missing_power(self, wdf_big): | |||
|
|||
def test_compute_power_per_kg(self, wdf_big): | |||
# wdf_big.athlete.ftp = 300 | |||
# wdf_big.athlete.weight = 80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a reminder for myself to remember what is the actual weight. Can be removed.
I think we're aligned on most things, that's good. 👍 About this:
What do you mean exactly by "stream"? What the strava rest api returns or any list-like object? And why is this specifically convenient for rest resources? Right now I am feeling more positive about your option 2 for 2 reasons:
|
Let me clarify on array-like inputs. Strava API, for example, when request power, will return you serialized json, that contains a list with power numbers. If you are developing a Flask-based web_app/REST_API sometimes you simply want to return a serializable calculations performed on this list - example turn power numbers into respective power-zones and send it back to the client. With array-like support this would mean writing something along the line:
With support for ndarray only, all the streams (just a strava nomenclature for the time series data it returns) would need to be converted back and forth:
I've found this pattern so frequently occurring in the code of velometria that I simply added the utill routine at every output of the vmpy function to convert it to the original array-like type. It adds an overhead to testing in making sure that the types are matching, but you would need to do the same for ndarray and pd.Series support as well, isn't it? Just to make it clear - under the hood, all calculations are performed on ndarrays/pd.Streams so the only performance overhead is an extra type-conversion step. |
On the merging. I do not have emotional attachments to Structure-wise, AP already has all necessary scaffolding, but the name implies a bit too narrow scope for my liking. I like the ease of use of the WDF, accept the @require part (more on the later), but it should not be a main deliverable of the library, and it would be nice to reflect it in the name. |
Ok that sounds like a plan! |
I don't have a name in mind yet, but the one thing I learn with vmpy, is that Velo should not be a part of the name - most of the computations are valid for another endurance sports so I would keep it broad. In fact, athletic is already a pretty decent name. The new organization is a good idea - might help to remove a friction to get more contributors. |
Good point. Didn't think about keeping it broader than cycling. Let's take a few days to come up with a more catchy name? Or do you want me to create the new organization already and copy my repo there so we can start working? I think we can always rename. |
Idea about the name/organization: We can ask Mark Liversedge if we can include the repo in the GoldenCheetah organization. I think it sort of fits in and it could give some extra traction to our repo. |
Great idea with Golden Cheetah. If they don't mind accommodating this repo and fine with permissive licenses (I think the GNU they use is OK) it would be great to harness their knowledge and even give back to their ML efforts. |
See changes.rst
I did not go too deep yet... still need to wrap my head around the heavy algorithm implementations. Some things I did not want to touch before discussing them with you - will submit issues for those.
Tests are passing.