Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc notes from delivery #23

Open
alanocallaghan opened this issue Mar 26, 2024 · 3 comments
Open

Misc notes from delivery #23

alanocallaghan opened this issue Mar 26, 2024 · 3 comments

Comments

@alanocallaghan
Copy link
Contributor

episode 1: "their response the new drug" -> "their response to the new drug"

episode 2: This is a style thing but the sentence structure is often needlessly complicated. eg, "It is often the case that our data includes categorical values." can be simplified to "Datasets like these often include categorical values."
Similarly "In our case, for example, the binary outcome we are trying to predict - in hospital mortality - is recorded as “ALIVE” and “EXPIRED”." can be simplified to "In our case, the binary outcome we are trying to predict (hospital mortality) is recorded as ALIVE and EXPIRED".

It is extremely weird to drop the categorical outcome variable and use it as y, including the encoded numeric variable in x. I realise this is an intro lesson but this seems to me a coding mistake that would be common for novices

To avoid data leaking between our training and test sets, we take the median from the training set only. The training median is then used to impute missing values in the held-out test set.

This isn't really explained at all. The data imputation section generally is a bit short. It'd be good to mention why imputing with the median is a bad idea in arguably most cases

@tompollard
Copy link
Collaborator

Thanks @alanocallaghan, these are all great points.

It is extremely weird to drop the categorical outcome variable and use it as y, including the encoded numeric variable in x. I realise this is an intro lesson but this seems to me a coding mistake that would be common for novices

Yes, this is a definite bug!

@alanocallaghan
Copy link
Contributor Author

From evaluation:

metrics.plot_roc_curve(reg, x_test, y_test)

no longer works

from sklearn.metrics import RocCurveDisplay
reg_disp = RocCurveDisplay.from_estimator(reg, x_test, y_test)

@alanocallaghan
Copy link
Contributor Author

Bootstrap episode KDE plot claims to be on the test set, but it's actually using training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants