-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] EHN add a model to fit and predict heart-rate from power #6
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6 +/- ##
=========================================
+ Coverage 97.04% 97.54% +0.5%
=========================================
Files 23 25 +2
Lines 440 530 +90
Branches 48 52 +4
=========================================
+ Hits 427 517 +90
Misses 6 6
Partials 7 7
Continue to review full report at Codecov.
|
@AartGoossens I modify the function which was making the estimation and prediction of the heart rate to follow the scikit-learn API. I have couple of suggestions:
I also want to improve the current test. Checking strictly floating values is a bit flaky. I still don't have a clear idea thought. Let me know if you have any comment. |
I forgot to ping @sladkovm as well, sorry. |
Nice work! And good to see that you're picking up this model first because I think there are many interesting use cases for this (thinking e.g. OpenData).
|
Regarding the output type, I really think that it would a plus to return the same type as input since we just need to reuse the index. It is coming almost for free and can avoid user to do that outside.
I see. We could resample the data using pandas in case that we get a dataframe. In case of numpy we just assume that this is 1Hz. It should be explained in the docstring thought. |
return np.sum(_model_residual(params, heart_rate, power) ** 2) | ||
|
||
|
||
class HeartRateRegressor(BaseEstimator, RegressorMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the benefit of BaseEstimator
and RegressorMixin
? I don't see it used anywhere in the fit
and predict
methods. Is there other functionality of these classes that can be useful?
Either way, I like implementing this model like a regressor class, it makes the api much more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically, it allows to use the tool as pipeline, grid-search, and cross-validation from scikit-learn.
The mixin just define a score
function which is the R^2 score while the BaseEstimator
provides get_params
and set_params
to deal with the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, I like implementing this model like a regressor class, it makes the api much more clear.
We can also the model public to the user. We just need a good name.
sksports/model/heart_rate.py
Outdated
params_init = [self.hr_rest, self.hr_max, self.hr_slope, self.hr_drift, | ||
self.rate_growth, self.rate_decay] | ||
|
||
if self.solver == 'leastsq': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this if-else? To me it seems the flow for self.solver == 'leastsq'
is redundant. I.e. why can minimize
be used for least-squares?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'leastqs
is actually using Levenverg-Marquardt algorithm. I was just playing with the different way of minimizing the problem. I am not really sure what is the best solution there. We can of course fix the solver as well.
NB: Levenvenberg-Maquardt will use the gradient will an algorithm as the simplex (Nelder-Mead) will not use it. So depending of the optimization problem and the convexity it would be more appropriate to use one or another.
sksports/model/heart_rate.py
Outdated
self.rate_growth = rate_growth | ||
self.rate_decay = rate_decay | ||
|
||
def fit(self, X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: It still itches when I see Python variables starting with a capital... 😅 (but I recognize that it's accepted for these cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is kinda of imposed by scikit-learn estimators. power
and heart_rate
would be much better.
sksports/model/heart_rate.py
Outdated
def __init__(self, solver='Nelder-Mead', hr_rest=75, hr_max=200, | ||
hr_slope=0.30, hr_drift=3e-5, rate_growth=24, rate_decay=30): | ||
self.solver = solver | ||
self.hr_rest = hr_rest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I'd like to propose renaming hr_rest
to hr_start
or hr_0
because this model parameter is not related to the actual resting heart rate of the athlete and therefore might cause confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I was confused :) hr_start
seems a good idea,
On a simple example (which will be in the tests) it seems that Nelder-Mead is the only solver leading to the right parameter. I think that we could simplify and keep the solver that you propose at the beginning. |
|
||
""" | ||
if y is None: | ||
is_df = True if hasattr(X, 'loc') else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is valid in most cases, why not just go for isinstance(X, pd.DataFrame)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to ducktype. It usually allows to not import the library (which we do anyway) and accept data structure which are inheriting from it.
sksports/model/heart_rate.py
Outdated
""" | ||
if y is None: | ||
is_df = True if hasattr(X, 'loc') else False | ||
X_ = X.resample('1s').mean() if is_df else X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this raises a TypeError
for a df with a non datetime-like index. Should we catch that here or add a check for a datetime-like index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, dataframe is not enough, we should check the index.
sksports/model/heart_rate.py
Outdated
is_df = True if hasattr(X, 'loc') else False | ||
X_ = X.resample('1s').mean() if is_df else X | ||
X_ = check_array(X_) | ||
return X_.ravel(), is_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to return y
here as the second return argument right? Else line 193 will raise a ValueError
(unpack error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is confusing, I should always return X_.ravel(), y, is_df
and ignore y
when it is None.
Right now, I am returning different thing depending if it is fit
or predict
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah now I get it. That could work indeed but I think it's less confusing if a method always returns the same number of arguments.
sksports/model/heart_rate.py
Outdated
""" | ||
check_is_fitted(self, ['hr_start_', 'hr_max_', 'hr_slope_', | ||
'hr_drift_', 'rate_growth_', 'rate_decay_']) | ||
power, is_df = self._check_inputs(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think self._check_inputs
returns 3 items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only 2 at predict
from sksports.model import exp_heart_rate_model | ||
from sksports.model import HeartRateRegressor | ||
|
||
POWER = np.array([100] * 10 + [240] * 100 + [160] * 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider moving these constants to a pytest fixture. It's a matter of preference but I think these fixtures are really nice to work with and also well reusable in other tests. But I'm fine with doing improvements like this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I was playing with the tests for the moment :)
reg = HeartRateRegressor() | ||
reg.fit(power, heart_rate) | ||
|
||
assert reg.hr_start_ == pytest.approx(100, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question than a comment: what's best practice for models like this for deciding how and which model parameters to test? Do you have any examples from scikit-learn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we run a classifier or a regressor on a known dataset and check the score. It is non-trivial to actually test parameters. Here I think that we can make it on those 2 parameters with a large enough approximation to be sure that the optimization (or any change in the future) does not lead to a complete non-sense.
Basically, I think that we could make something like:
assert reg.score(X) > 0.9
to be sure that we are fitting the data properly.
This is really interesting. What kind of results did you get? I expected other solvers to work at least reasonably well. I'm already looking forward to using OpenData to play around with different solvers... (it's time for me to get some more work into that the OpenData tooling) |
return power | ||
|
||
|
||
@pytest.fixture(params=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks interesting! If I'm correct you're not parametrizing the test but the fixture?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.I would think that fixture are good for data (parametrizing allows to build different supported type. Then parametrizing the test is useful for passing some parameters to the function or something specific to the test. This is just my thumb rule and not sure this is the best ;)
Hello @glemaitre! Thanks for updating the PR.
Comment last updated on June 03, 2018 at 09:43 Hours UTC |
ab4c1e9
to
ab457a0
Compare
63913f9
to
0a3a7fa
Compare
d2ef5e1
to
446707a
Compare
examples/model/plot_heart_rate.py
Outdated
15, 40) | ||
data['heart-rate'] = heart_rate | ||
data.plot(legend=True) | ||
plt.title('Simulated power of a cyclist and inferred heart-rate.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove final dot
power.plot(legend=True) | ||
heart_rate.plot(legend=True) | ||
heart_rate_initial.plot(legend=True) | ||
heart_rate_pred.plot(legend=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add title
@AartGoossens I was checking the results of the optimization using Nelder-Mead and it can lead to fitted parameters without any mining even if the fitting is good (e.g. Somehow I found that the best for the moment is to use Nelder-Mead at first and then refine with L-BFGS using bounds. One of the drawback is that we make the fitting slower (i.e. we are minimizing twice). I also not sure 100% if combining both methods will be always stable or if L-BFGS can give erroneous results as when we use it alone. |
Also |
That's very interesting indeed. Is combining optimizations common practice? I haven't heard of it before. If this is only about a better fit for the
In conclusion: in order to properly fit/minimize the hr_max we need activities with high power (1) but we also know that for these activities the model fit is not that good (2)... Either way, if the fit is improved by combining these optimizations it's worth it of course. |
Partially addressing #5
Implement a scikit-learn like regressor to fit and predict heart-rate data from power.