Vmpy #26

sladkovm · 2018-03-02T21:34:35Z

See changes.rst

I did not go too deep yet... still need to wrap my head around the heavy algorithm implementations. Some things I did not want to touch before discussing them with you - will submit issues for those.

Tests are passing.

AartGoossens · 2018-03-04T19:09:10Z

Thanks for the PR! I will review your changes.

Note to self: enable notifications on personal repositories so that I get notified about PRs... 😅

AartGoossens · 2018-03-04T20:05:08Z

I added a lot of comments to your PR, don't take it as a sign that I don't like it...

Before I want to consider merging I think we should align on the hierarchy and dependency between the libraries:
I see athletic_pandas as a general purpose workout analysis library that is agnostic about where the data is coming from or where for the output is used and that contains a complete range of algorithms to analyze the data. In the future I also see an educational purpose for the documentation of the library and maybe even the source code. Using algorithms from other libraries can certainly fit in that vision.

I do not know where you want to take vmpy but I sense that there is a risk of a circular dependency where vmpy wants to use logic from athletic_pandas and vice versa. I want to have the hierarchy between these packages clear before I add the dependency on vmpy.
Where do you want to take vmpy? Do you want it to be general purpose or tailored to your needs for velometria? Do you think there is a change you want to use algorithms from athletic_pandas in the future? How can we avoid such a circular dependency?

sladkovm · 2018-03-04T21:36:28Z

Hi, how do I see the comments? I don't see anything in the review tab.

Short answer: The idea is to a one general purpose library with algorithms, that can be used anywhere else. It can be either athletic_pandas that serves only WorkoutDataFrame and all algos move to vmpy or one package that serves both WDF and algos.

Long answer.

I think we are chasing the same goal and it would be a shame if we would do the double work or even worse, create a fragmented package space.

The only difference that I see between the athletic_pandas algos and vmpy algos is that vmpy preserve the input/output data-types. Basically, if I have a power stream as a list, I'm expecting as a return a list of power-zones. This is convenient when working with REST resources. If the input is ndarray - the ndarray will be outputted, so integration with athletic_pandas is also seamless. It is an extra line of code for every function and extra testing, but I think it helps longterm in wide adoption of the package.

In general, I'm rooting for creating one, largely data-types unopinionated, python package with algorithms, so other packages/apps can build on top of it. What I want to avoid is that the same algorithms are being implemented in the different libraries.

Option 1: Move all algorithms to vmpy and athletic_pandas would only take care of providing a convenient tool for quick data exploration - this PR is an example of how I see this working with wrapping vmpy calls in the module.py methods. I did not do anything with ap/algorithms implementations yet and simply added a To-Do comment there to indicate an intention of eventual depreciation if we agree on this strategy.

Option 2: Merge the libraries together and serve both, the algorithms and the WorkoutDataFrame. In this case, I believe, it is still a good idea to make algorithms work with any array-like datatypes and not only with pd.Series for the ease of integration.

What do you think?

AartGoossens · 2018-03-04T19:11:23Z

athletic_pandas/algorithms/__init__.py

+
+    Parameters
+    ----------
+    power: pd.Series


True for now but I probably want to change to ndarray. Do you agree that's better?
Good that your are adding docstrings, I did not get to that yet...

Comments are based on what it does at the moment. In general, i prefer to have it array-like {list, ndarray, pd.Series} with output having the same time as an input.

AartGoossens · 2018-03-04T19:13:09Z

athletic_pandas/algorithms/__init__.py

+    -------
+    pd.Series
+
+    TO-DO: to be replaced by vmpy.algorithms.power_duration_curve()


This module does not exist yet right? What's your idea behind this module?
Also, what is the difference between your implementation and mine?

It does, I've moved it to vmpy.metrics.power_duration_curve() the last moment. It calculates accumulated energy first and than does two arrays diff with a shift. It speeds up calculations by x4. The 8hours ride takes about 10 seconds instead of almost a minute on my PC.

Oh fancy! Did you benchmark it to my implementation or yours?

AartGoossens · 2018-03-04T19:16:52Z

athletic_pandas/algorithms/__init__.py

@@ -45,10 +75,39 @@ def mean_max_bests(power, duration, amount):


 def weighted_average_power(power):
+    """Weighted average power
+
+    Also known as the Normalized Power


Normalized Power®️ (😉) is a trademarked name and cannot be used commercially without permission from Peaksware. That reason alone is enough for me to avoid using that name.

Ok, didn't know that, but knowing the behavior of Coggan on forums should have guessed it. WAP is good, fits well with Strava community.

AartGoossens · 2018-03-04T19:18:33Z

athletic_pandas/algorithms/__init__.py

+    -------
+    number
+
+    replaced by vmpy.metrics.normalized_power


Idem: What are the differences between your implementation and mine?

For NP - none, but the vmpy function also allows to calculate xPower that uses the exponentially weighted average.

AartGoossens · 2018-03-04T19:29:15Z

athletic_pandas/models.py



 class WorkoutDataFrame(BaseWorkoutDataFrame):
    _metadata = ['athlete']

+    @requires(columns=['power'])
+    def compute_power_zones(self, **kwargs):


I think this method touches a sensitive subject ;) : In my view the WorkoutDataFrame should not contain any business logic and should call a method from algorithms as directly as possible. That means adding a method to the algorithms module (which in turn could call vmpy) and call it from here. The WorkoutDataFrame is just a more high level way to work with these algorithms but a user should be able to use all logic in athletic_pandas outside of the WorkoutDataFrame.
Does that make sense?

It makes sense, especially in light of the email I wrote yesterday. Let's see how we separate Frame from the algos in the most efficient way.

AartGoossens · 2018-03-04T19:31:55Z

athletic_pandas/models.py



 class WorkoutDataFrame(BaseWorkoutDataFrame):
    _metadata = ['athlete']

+    @requires(columns=['power'])


Although I am not sure I still like the @requires decorator and consider removing it, for now it should also contain athlete=['ftp'].

In general, I think the require decorator limits the usage. If you get data from Strava for example, the power stream is called 'watts' there. I would propose to ditch require in favor of sane defaults with an ability to override during the function call. Somethin along the line wdf.calculate_awesome_power(arg='power', **kwargs)

In this particular case I've removed require for an athlete to allow optionally specifying 'ftp' using keywords. It falls back on athlete['ftp'] as s default.

I think this is really similar to the discussion about accepting multiple input argument types and I am not in favor of it. I think as a developer you have a responsibility to make the interface of you library as well defined as possible, accepting multiple input types or accepting that certain data is stored in different columns makes the interface unclear and therefore more difficult to use. For me this would also make the purpose of the dataframe redundant as you cannot just use the class methods "as is" without understanding the underlying data structure.
Anyway, I am writing a longer response where I will dive into this (even) more...

accepting that certain data is stored in different columns makes the interface unclear and therefore more difficult to use

The problem with too simple interfaces is that they limit the use. In case of requiring 'power' only you force the user in figuring out how to rename the columns in pd.DataFrame to get it working. The power/watts duality is a one example forced on us by Strava, but with heartrate you will get a full multitude of the possibilities: {hr, bpm, heart_rate, you_name_it}

The methods can simply be written with defaults in mind as def method(self, arg='power'), and can be called as wdf.method() with defaults and with a user-defined name for other cases wdf.method(arg='watts'). This pattern is in fact not that different from how pandas already exposes matplotlib methods.

Another aspect is a sole reliance on the Athlete obj. I think it is less canonical and therefore less obvious for the user to expect that only way to parametrize wdf methods is through the Athlete object. It is a great placeholder for defaults, but it is also quite natural to expect the parameters to be exposed as a function signature.

Imagine you want to compare two Wbal calculations on the same data set, but with different W' and CP values. The only way to do it now is through redefining the Athlete properties, while I would naturally expect that W' and CP' should be available as inputs to the method call.

AartGoossens · 2018-03-04T19:33:06Z

tests/test_algorithms.py

@@ -105,8 +105,8 @@ def test_extended_critical_power(self, version, expected):
        power = pd.Series(range(500))
        mean_max_power = algorithms.mean_max_power(power)
        ecp = algorithms.extended_critical_power(mean_max_power, version=version)
-        assert ecp[1] == expected[0]
-        assert ecp[250] == expected[1]
+        assert ecp[1] == pytest.approx(expected[0], 0.1)


Oh I like the pytest.approx, did not know it! Thanks for the tip :)

AartGoossens · 2018-03-04T19:35:47Z

tests/test_models.py

    wdf = models.WorkoutDataFrame(
-        pd.read_csv('tests/example_files/workout_1_short.csv')
+        pd.read_csv(os.path.join(current_dir, 'example_files/workout_1_short.csv'))


I am working on moving all pytest fixtures to a conftest.py file in the root of /tests, that will remove vague file paths like this.

AartGoossens · 2018-03-04T19:38:44Z

tests/test_models.py

@@ -181,6 +212,7 @@ def test_compute_weighted_average_power_missing_power(self, wdf_big):

    def test_compute_power_per_kg(self, wdf_big):
        # wdf_big.athlete.ftp = 300
+        # wdf_big.athlete.weight = 80


It was just a reminder for myself to remember what is the actual weight. Can be removed.

AartGoossens · 2018-03-05T07:25:52Z

I forgot to press "Submit review", now my comments should be visible.
I will reply to you reply later today.

sladkovm

I think it all boils down to 3 questions:

Merge two packages or separate algos into a separate package?
Do you have any strong view on treating input/outputs as array-like that treats lists as a first-class citizens?
Drop @requires in favor of sane default names for expected columns.

sladkovm · 2018-03-05T11:16:35Z

athletic_pandas/algorithms/__init__.py

+
+    Parameters
+    ----------
+    power: pd.Series


Comments are based on what it does at the moment. In general, i prefer to have it array-like {list, ndarray, pd.Series} with output having the same time as an input.

sladkovm · 2018-03-05T11:18:51Z

athletic_pandas/algorithms/__init__.py

+    -------
+    pd.Series
+
+    TO-DO: to be replaced by vmpy.algorithms.power_duration_curve()


It does, I've moved it to vmpy.metrics.power_duration_curve() the last moment. It calculates accumulated energy first and than does two arrays diff with a shift. It speeds up calculations by x4. The 8hours ride takes about 10 seconds instead of almost a minute on my PC.

sladkovm · 2018-03-05T11:21:12Z

athletic_pandas/algorithms/__init__.py

@@ -45,10 +75,39 @@ def mean_max_bests(power, duration, amount):


 def weighted_average_power(power):
+    """Weighted average power
+
+    Also known as the Normalized Power


Ok, didn't know that, but knowing the behavior of Coggan on forums should have guessed it. WAP is good, fits well with Strava community.

sladkovm · 2018-03-05T11:22:20Z

athletic_pandas/algorithms/__init__.py

+    -------
+    number
+
+    replaced by vmpy.metrics.normalized_power


For NP - none, but the vmpy function also allows to calculate xPower that uses the exponentially weighted average.

sladkovm · 2018-03-05T11:28:24Z

athletic_pandas/models.py



 class WorkoutDataFrame(BaseWorkoutDataFrame):
    _metadata = ['athlete']

+    @requires(columns=['power'])


In general, I think the require decorator limits the usage. If you get data from Strava for example, the power stream is called 'watts' there. I would propose to ditch require in favor of sane defaults with an ability to override during the function call. Somethin along the line wdf.calculate_awesome_power(arg='power', **kwargs)

In this particular case I've removed require for an athlete to allow optionally specifying 'ftp' using keywords. It falls back on athlete['ftp'] as s default.

sladkovm · 2018-03-05T11:30:55Z

athletic_pandas/models.py



 class WorkoutDataFrame(BaseWorkoutDataFrame):
    _metadata = ['athlete']

+    @requires(columns=['power'])
+    def compute_power_zones(self, **kwargs):


It makes sense, especially in light of the email I wrote yesterday. Let's see how we separate Frame from the algos in the most efficient way.

sladkovm · 2018-03-05T11:35:18Z

tests/test_models.py

@@ -181,6 +212,7 @@ def test_compute_weighted_average_power_missing_power(self, wdf_big):

    def test_compute_power_per_kg(self, wdf_big):
        # wdf_big.athlete.ftp = 300
+        # wdf_big.athlete.weight = 80


It was just a reminder for myself to remember what is the actual weight. Can be removed.

AartGoossens · 2018-03-05T13:01:47Z

I think we're aligned on most things, that's good. 👍

About this:

The only difference that I see between the athletic_pandas algos and vmpy algos is that vmpy preserve the input/output data-types. Basically, if I have a power stream as a list, I'm expecting as a return a list of power-zones. This is convenient when working with REST resources. If the input is ndarray - the ndarray will be outputted, so integration with athletic_pandas is also seamless. It is an extra line of code for every function and extra testing, but I think it helps longterm in wide adoption of the package.

What do you mean exactly by "stream"? What the strava rest api returns or any list-like object? And why is this specifically convenient for rest resources?
Because Python is dynamically typed it is indeed possible to be agnostic of input types and return the same type as the input argument but I think type casting should not be a responsibility of the library but of the code that is using the library (separation of concerns). I am not saying I want to add type checking (although I do like the idea of mypy) and I certainly do not want to enforce input types, but I think it is good practice to write code for a specific input type so you can make use of type specific optimizations (I am thinking specifically about numpy.ndarray). This could very well mean that other list-like input types work in some cases, but not all and not in all cases. Another argument against accepting all input types is that testing can be a crime: Will we create a test case for each of numpy.ndarray, pandas.Series, list, etc.? The library could (and maybe even should) include a helper method to cast other list-like types to the required input argument type, similar to vmpy's cast_array_to_original_type().

Right now I am feeling more positive about your option 2 for 2 reasons:

For me the dataframe is a very important interface to the algorithms because it makes using the algorithms easy even for Python beginners. From that point of view having algorithms and dataframe in two seperate libraries makes no sense to me, which excludes option 1.
Merging can be done in 3 ways I think: vmpy->ap (which I slightly prefer), ap->vmpy (which you probably prefer) or vmpy+ap->new_library.
The last way is a bit more work but has an advantage: I think both vmpy and athletic_pandas in their current state are already a bit tailored to our personal needs/preferences. Creating a new library from scratch allows us to discuss the architecture of it with a clean slate, instead of having to worry/care about legacy code. The algorithms we can just copy+paste as needed.

sladkovm · 2018-03-05T20:53:41Z

Let me clarify on array-like inputs.

Strava API, for example, when request power, will return you serialized json, that contains a list with power numbers. If you are developing a Flask-based web_app/REST_API sometimes you simply want to return a serializable calculations performed on this list - example turn power numbers into respective power-zones and send it back to the client.

With array-like support this would mean writing something along the line:

>>> power = strava.get('power')
>>> zones = vmpy.power_zones(power, ftp=300)
>>> api_endpoint.push(zones)

With support for ndarray only, all the streams (just a strava nomenclature for the time series data it returns) would need to be converted back and forth:

>>> power = strava.get('power')
>>> nd_power = np.array(power)
>>> nd_zones = vmpy.power_zones(power, ftp=300)
>>> zones = list(nd_zones)
>>> api_endpoint.push(zones)

I've found this pattern so frequently occurring in the code of velometria that I simply added the utill routine at every output of the vmpy function to convert it to the original array-like type.

It adds an overhead to testing in making sure that the types are matching, but you would need to do the same for ndarray and pd.Series support as well, isn't it?

Just to make it clear - under the hood, all calculations are performed on ndarrays/pd.Streams so the only performance overhead is an extra type-conversion step.

sladkovm · 2018-03-05T21:03:03Z

On the merging. I do not have emotional attachments to vmpy. Even more, It would be better for a longevity of the project, if somebody with more hands-on dev experience would be in charge.

Structure-wise, AP already has all necessary scaffolding, but the name implies a bit too narrow scope for my liking. I like the ease of use of the WDF, accept the @require part (more on the later), but it should not be a main deliverable of the library, and it would be nice to reflect it in the name.

AartGoossens · 2018-03-06T08:07:33Z

Ok that sounds like a plan!
I agree that if we shift focus away from the dataframe then it should have a different name. Do you have an alternative name in mind? Shall we then also move it to a separate github organization so that it doesn't look like a personal project of mine?

sladkovm · 2018-03-06T08:57:27Z

I don't have a name in mind yet, but the one thing I learn with vmpy, is that Velo should not be a part of the name - most of the computations are valid for another endurance sports so I would keep it broad.

In fact, athletic is already a pretty decent name.

The new organization is a good idea - might help to remove a friction to get more contributors.

AartGoossens · 2018-03-06T10:34:41Z

Good point. Didn't think about keeping it broader than cycling. Let's take a few days to come up with a more catchy name? Or do you want me to create the new organization already and copy my repo there so we can start working? I think we can always rename.

AartGoossens · 2018-03-06T12:26:26Z

Idea about the name/organization: We can ask Mark Liversedge if we can include the repo in the GoldenCheetah organization. I think it sort of fits in and it could give some extra traction to our repo.

sladkovm · 2018-03-06T13:10:01Z

Great idea with Golden Cheetah. If they don't mind accommodating this repo and fine with permissive licenses (I think the GNU they use is OK) it would be great to harness their knowledge and even give back to their ML efforts.

msladkov and others added 8 commits March 2, 2018 10:28

CHANGES.rsr and add approx equal to tests

5024350

fix path in pytest fixtures

b0e08cd

replaced power_per_kg with wpk

47b8dcb

update changes.rst

4d52395

power zones

9fa9b8b

power zones

0dc806a

add vmpy to requirements

4dd390d

changes.rst

2e50097

AartGoossens reviewed Mar 5, 2018

View reviewed changes

AartGoossens self-assigned this Mar 5, 2018

sladkovm commented Mar 5, 2018

View reviewed changes

Vmpy #26

Are you sure you want to change the base?

Vmpy #26

Conversation

sladkovm commented Mar 2, 2018

AartGoossens commented Mar 4, 2018

AartGoossens commented Mar 4, 2018 • edited Loading

sladkovm commented Mar 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AartGoossens commented Mar 5, 2018

sladkovm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AartGoossens commented Mar 5, 2018

sladkovm commented Mar 5, 2018

sladkovm commented Mar 5, 2018

AartGoossens commented Mar 6, 2018

sladkovm commented Mar 6, 2018

AartGoossens commented Mar 6, 2018

AartGoossens commented Mar 6, 2018

sladkovm commented Mar 6, 2018

AartGoossens commented Mar 4, 2018 •

edited

Loading