Skip to content
This repository has been archived by the owner on Mar 10, 2018. It is now read-only.

Vmpy #26

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Vmpy #26

wants to merge 8 commits into from

Conversation

sladkovm
Copy link

@sladkovm sladkovm commented Mar 2, 2018

See changes.rst

I did not go too deep yet... still need to wrap my head around the heavy algorithm implementations. Some things I did not want to touch before discussing them with you - will submit issues for those.

Tests are passing.

@AartGoossens
Copy link
Owner

Thanks for the PR! I will review your changes.

Note to self: enable notifications on personal repositories so that I get notified about PRs... 😅

@AartGoossens
Copy link
Owner

AartGoossens commented Mar 4, 2018

I added a lot of comments to your PR, don't take it as a sign that I don't like it...

Before I want to consider merging I think we should align on the hierarchy and dependency between the libraries:
I see athletic_pandas as a general purpose workout analysis library that is agnostic about where the data is coming from or where for the output is used and that contains a complete range of algorithms to analyze the data. In the future I also see an educational purpose for the documentation of the library and maybe even the source code. Using algorithms from other libraries can certainly fit in that vision.

I do not know where you want to take vmpy but I sense that there is a risk of a circular dependency where vmpy wants to use logic from athletic_pandas and vice versa. I want to have the hierarchy between these packages clear before I add the dependency on vmpy.
Where do you want to take vmpy? Do you want it to be general purpose or tailored to your needs for velometria? Do you think there is a change you want to use algorithms from athletic_pandas in the future? How can we avoid such a circular dependency?

@sladkovm
Copy link
Author

sladkovm commented Mar 4, 2018

Hi, how do I see the comments? I don't see anything in the review tab.

Short answer: The idea is to a one general purpose library with algorithms, that can be used anywhere else. It can be either athletic_pandas that serves only WorkoutDataFrame and all algos move to vmpy or one package that serves both WDF and algos.

Long answer.

I think we are chasing the same goal and it would be a shame if we would do the double work or even worse, create a fragmented package space.

The only difference that I see between the athletic_pandas algos and vmpy algos is that vmpy preserve the input/output data-types. Basically, if I have a power stream as a list, I'm expecting as a return a list of power-zones. This is convenient when working with REST resources. If the input is ndarray - the ndarray will be outputted, so integration with athletic_pandas is also seamless. It is an extra line of code for every function and extra testing, but I think it helps longterm in wide adoption of the package.

In general, I'm rooting for creating one, largely data-types unopinionated, python package with algorithms, so other packages/apps can build on top of it. What I want to avoid is that the same algorithms are being implemented in the different libraries.

Option 1: Move all algorithms to vmpy and athletic_pandas would only take care of providing a convenient tool for quick data exploration - this PR is an example of how I see this working with wrapping vmpy calls in the module.py methods. I did not do anything with ap/algorithms implementations yet and simply added a To-Do comment there to indicate an intention of eventual depreciation if we agree on this strategy.

Option 2: Merge the libraries together and serve both, the algorithms and the WorkoutDataFrame. In this case, I believe, it is still a good idea to make algorithms work with any array-like datatypes and not only with pd.Series for the ease of integration.

What do you think?


Parameters
----------
power: pd.Series
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True for now but I probably want to change to ndarray. Do you agree that's better?
Good that your are adding docstrings, I did not get to that yet...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are based on what it does at the moment. In general, i prefer to have it array-like {list, ndarray, pd.Series} with output having the same time as an input.

-------
pd.Series

TO-DO: to be replaced by vmpy.algorithms.power_duration_curve()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module does not exist yet right? What's your idea behind this module?
Also, what is the difference between your implementation and mine?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, I've moved it to vmpy.metrics.power_duration_curve() the last moment. It calculates accumulated energy first and than does two arrays diff with a shift. It speeds up calculations by x4. The 8hours ride takes about 10 seconds instead of almost a minute on my PC.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh fancy! Did you benchmark it to my implementation or yours?

@@ -45,10 +75,39 @@ def mean_max_bests(power, duration, amount):


def weighted_average_power(power):
"""Weighted average power

Also known as the Normalized Power
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalized Power®️ (😉) is a trademarked name and cannot be used commercially without permission from Peaksware. That reason alone is enough for me to avoid using that name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, didn't know that, but knowing the behavior of Coggan on forums should have guessed it. WAP is good, fits well with Strava community.

-------
number

replaced by vmpy.metrics.normalized_power
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem: What are the differences between your implementation and mine?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NP - none, but the vmpy function also allows to calculate xPower that uses the exponentially weighted average.



class WorkoutDataFrame(BaseWorkoutDataFrame):
_metadata = ['athlete']

@requires(columns=['power'])
def compute_power_zones(self, **kwargs):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method touches a sensitive subject ;) : In my view the WorkoutDataFrame should not contain any business logic and should call a method from algorithms as directly as possible. That means adding a method to the algorithms module (which in turn could call vmpy) and call it from here. The WorkoutDataFrame is just a more high level way to work with these algorithms but a user should be able to use all logic in athletic_pandas outside of the WorkoutDataFrame.
Does that make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, especially in light of the email I wrote yesterday. Let's see how we separate Frame from the algos in the most efficient way.



class WorkoutDataFrame(BaseWorkoutDataFrame):
_metadata = ['athlete']

@requires(columns=['power'])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I am not sure I still like the @requires decorator and consider removing it, for now it should also contain athlete=['ftp'].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think the require decorator limits the usage. If you get data from Strava for example, the power stream is called 'watts' there. I would propose to ditch require in favor of sane defaults with an ability to override during the function call. Somethin along the line wdf.calculate_awesome_power(arg='power', **kwargs)

In this particular case I've removed require for an athlete to allow optionally specifying 'ftp' using keywords. It falls back on athlete['ftp'] as s default.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really similar to the discussion about accepting multiple input argument types and I am not in favor of it. I think as a developer you have a responsibility to make the interface of you library as well defined as possible, accepting multiple input types or accepting that certain data is stored in different columns makes the interface unclear and therefore more difficult to use. For me this would also make the purpose of the dataframe redundant as you cannot just use the class methods "as is" without understanding the underlying data structure.
Anyway, I am writing a longer response where I will dive into this (even) more...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepting that certain data is stored in different columns makes the interface unclear and therefore more difficult to use

The problem with too simple interfaces is that they limit the use. In case of requiring 'power' only you force the user in figuring out how to rename the columns in pd.DataFrame to get it working. The power/watts duality is a one example forced on us by Strava, but with heartrate you will get a full multitude of the possibilities: {hr, bpm, heart_rate, you_name_it}

The methods can simply be written with defaults in mind as def method(self, arg='power'), and can be called as wdf.method() with defaults and with a user-defined name for other cases wdf.method(arg='watts'). This pattern is in fact not that different from how pandas already exposes matplotlib methods.

Another aspect is a sole reliance on the Athlete obj. I think it is less canonical and therefore less obvious for the user to expect that only way to parametrize wdf methods is through the Athlete object. It is a great placeholder for defaults, but it is also quite natural to expect the parameters to be exposed as a function signature.

Imagine you want to compare two Wbal calculations on the same data set, but with different W' and CP values. The only way to do it now is through redefining the Athlete properties, while I would naturally expect that W' and CP' should be available as inputs to the method call.

@@ -105,8 +105,8 @@ def test_extended_critical_power(self, version, expected):
power = pd.Series(range(500))
mean_max_power = algorithms.mean_max_power(power)
ecp = algorithms.extended_critical_power(mean_max_power, version=version)
assert ecp[1] == expected[0]
assert ecp[250] == expected[1]
assert ecp[1] == pytest.approx(expected[0], 0.1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I like the pytest.approx, did not know it! Thanks for the tip :)

wdf = models.WorkoutDataFrame(
pd.read_csv('tests/example_files/workout_1_short.csv')
pd.read_csv(os.path.join(current_dir, 'example_files/workout_1_short.csv'))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on moving all pytest fixtures to a conftest.py file in the root of /tests, that will remove vague file paths like this.

@@ -181,6 +212,7 @@ def test_compute_weighted_average_power_missing_power(self, wdf_big):

def test_compute_power_per_kg(self, wdf_big):
# wdf_big.athlete.ftp = 300
# wdf_big.athlete.weight = 80
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a reminder for myself to remember what is the actual weight. Can be removed.

@AartGoossens
Copy link
Owner

I forgot to press "Submit review", now my comments should be visible.
I will reply to you reply later today.

@AartGoossens AartGoossens self-assigned this Mar 5, 2018
Copy link
Author

@sladkovm sladkovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it all boils down to 3 questions:

  1. Merge two packages or separate algos into a separate package?

  2. Do you have any strong view on treating input/outputs as array-like that treats lists as a first-class citizens?

  3. Drop @requires in favor of sane default names for expected columns.


Parameters
----------
power: pd.Series
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are based on what it does at the moment. In general, i prefer to have it array-like {list, ndarray, pd.Series} with output having the same time as an input.

-------
pd.Series

TO-DO: to be replaced by vmpy.algorithms.power_duration_curve()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, I've moved it to vmpy.metrics.power_duration_curve() the last moment. It calculates accumulated energy first and than does two arrays diff with a shift. It speeds up calculations by x4. The 8hours ride takes about 10 seconds instead of almost a minute on my PC.

@@ -45,10 +75,39 @@ def mean_max_bests(power, duration, amount):


def weighted_average_power(power):
"""Weighted average power

Also known as the Normalized Power
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, didn't know that, but knowing the behavior of Coggan on forums should have guessed it. WAP is good, fits well with Strava community.

-------
number

replaced by vmpy.metrics.normalized_power
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NP - none, but the vmpy function also allows to calculate xPower that uses the exponentially weighted average.



class WorkoutDataFrame(BaseWorkoutDataFrame):
_metadata = ['athlete']

@requires(columns=['power'])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think the require decorator limits the usage. If you get data from Strava for example, the power stream is called 'watts' there. I would propose to ditch require in favor of sane defaults with an ability to override during the function call. Somethin along the line wdf.calculate_awesome_power(arg='power', **kwargs)

In this particular case I've removed require for an athlete to allow optionally specifying 'ftp' using keywords. It falls back on athlete['ftp'] as s default.



class WorkoutDataFrame(BaseWorkoutDataFrame):
_metadata = ['athlete']

@requires(columns=['power'])
def compute_power_zones(self, **kwargs):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, especially in light of the email I wrote yesterday. Let's see how we separate Frame from the algos in the most efficient way.

@@ -181,6 +212,7 @@ def test_compute_weighted_average_power_missing_power(self, wdf_big):

def test_compute_power_per_kg(self, wdf_big):
# wdf_big.athlete.ftp = 300
# wdf_big.athlete.weight = 80
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a reminder for myself to remember what is the actual weight. Can be removed.

@AartGoossens
Copy link
Owner

I think we're aligned on most things, that's good. 👍

About this:

The only difference that I see between the athletic_pandas algos and vmpy algos is that vmpy preserve the input/output data-types. Basically, if I have a power stream as a list, I'm expecting as a return a list of power-zones. This is convenient when working with REST resources. If the input is ndarray - the ndarray will be outputted, so integration with athletic_pandas is also seamless. It is an extra line of code for every function and extra testing, but I think it helps longterm in wide adoption of the package.

What do you mean exactly by "stream"? What the strava rest api returns or any list-like object? And why is this specifically convenient for rest resources?
Because Python is dynamically typed it is indeed possible to be agnostic of input types and return the same type as the input argument but I think type casting should not be a responsibility of the library but of the code that is using the library (separation of concerns). I am not saying I want to add type checking (although I do like the idea of mypy) and I certainly do not want to enforce input types, but I think it is good practice to write code for a specific input type so you can make use of type specific optimizations (I am thinking specifically about numpy.ndarray). This could very well mean that other list-like input types work in some cases, but not all and not in all cases. Another argument against accepting all input types is that testing can be a crime: Will we create a test case for each of numpy.ndarray, pandas.Series, list, etc.? The library could (and maybe even should) include a helper method to cast other list-like types to the required input argument type, similar to vmpy's cast_array_to_original_type().

Right now I am feeling more positive about your option 2 for 2 reasons:

  • For me the dataframe is a very important interface to the algorithms because it makes using the algorithms easy even for Python beginners. From that point of view having algorithms and dataframe in two seperate libraries makes no sense to me, which excludes option 1.
    Merging can be done in 3 ways I think: vmpy->ap (which I slightly prefer), ap->vmpy (which you probably prefer) or vmpy+ap->new_library.
    The last way is a bit more work but has an advantage: I think both vmpy and athletic_pandas in their current state are already a bit tailored to our personal needs/preferences. Creating a new library from scratch allows us to discuss the architecture of it with a clean slate, instead of having to worry/care about legacy code. The algorithms we can just copy+paste as needed.

@sladkovm
Copy link
Author

sladkovm commented Mar 5, 2018

Let me clarify on array-like inputs.

Strava API, for example, when request power, will return you serialized json, that contains a list with power numbers. If you are developing a Flask-based web_app/REST_API sometimes you simply want to return a serializable calculations performed on this list - example turn power numbers into respective power-zones and send it back to the client.

With array-like support this would mean writing something along the line:

>>> power = strava.get('power')
>>> zones = vmpy.power_zones(power, ftp=300)
>>> api_endpoint.push(zones)

With support for ndarray only, all the streams (just a strava nomenclature for the time series data it returns) would need to be converted back and forth:

>>> power = strava.get('power')
>>> nd_power = np.array(power)
>>> nd_zones = vmpy.power_zones(power, ftp=300)
>>> zones = list(nd_zones)
>>> api_endpoint.push(zones)

I've found this pattern so frequently occurring in the code of velometria that I simply added the utill routine at every output of the vmpy function to convert it to the original array-like type.

It adds an overhead to testing in making sure that the types are matching, but you would need to do the same for ndarray and pd.Series support as well, isn't it?

Just to make it clear - under the hood, all calculations are performed on ndarrays/pd.Streams so the only performance overhead is an extra type-conversion step.

@sladkovm
Copy link
Author

sladkovm commented Mar 5, 2018

On the merging. I do not have emotional attachments to vmpy. Even more, It would be better for a longevity of the project, if somebody with more hands-on dev experience would be in charge.

Structure-wise, AP already has all necessary scaffolding, but the name implies a bit too narrow scope for my liking. I like the ease of use of the WDF, accept the @require part (more on the later), but it should not be a main deliverable of the library, and it would be nice to reflect it in the name.

@AartGoossens
Copy link
Owner

Ok that sounds like a plan!
I agree that if we shift focus away from the dataframe then it should have a different name. Do you have an alternative name in mind? Shall we then also move it to a separate github organization so that it doesn't look like a personal project of mine?

@sladkovm
Copy link
Author

sladkovm commented Mar 6, 2018

I don't have a name in mind yet, but the one thing I learn with vmpy, is that Velo should not be a part of the name - most of the computations are valid for another endurance sports so I would keep it broad.

In fact, athletic is already a pretty decent name.

The new organization is a good idea - might help to remove a friction to get more contributors.

@AartGoossens
Copy link
Owner

Good point. Didn't think about keeping it broader than cycling. Let's take a few days to come up with a more catchy name? Or do you want me to create the new organization already and copy my repo there so we can start working? I think we can always rename.

@AartGoossens
Copy link
Owner

Idea about the name/organization: We can ask Mark Liversedge if we can include the repo in the GoldenCheetah organization. I think it sort of fits in and it could give some extra traction to our repo.

@sladkovm
Copy link
Author

sladkovm commented Mar 6, 2018

Great idea with Golden Cheetah. If they don't mind accommodating this repo and fine with permissive licenses (I think the GNU they use is OK) it would be great to harness their knowledge and even give back to their ML efforts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants