Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source of randomness in logistic regression #65

Open
dmitrijsk opened this issue Nov 20, 2021 · 0 comments
Open

Source of randomness in logistic regression #65

dmitrijsk opened this issue Nov 20, 2021 · 0 comments

Comments

@dmitrijsk
Copy link

dmitrijsk commented Nov 20, 2021

I would like to suggest to elaborate a little bit on the randomness in logistic regression when compared to linear regression. This is mentioned on page 45, lines -3 to -1 of the Draft (April 30, 2021). I think the sentence "the randomness in classification is statistically modeled by the class probability construction 𝑝(𝑦 = 𝑚 | x) instead of an additive noise 𝜀" may not be enough for the reader who reads about logistic regression for the first time. It would help to mention at least that the distribution of class labels is Bernoulli(g(x)). Then the source of randomness becomes clearer. You do mention this on page 54: "In binary logistic regression the output distribution 𝑝(𝑦 | x; 𝜽) is a Bernoulli distribution" but that is too far from the discussion of randomness in logistic regression.

If you ever consider to add exercises to your wonderful book then let me suggest one. It was highly insightful for me when I first simulated a dataset for the binary logistic regression. This practically shows where the randomness is. Here is my suggested Python code for the exercise:

import numpy as np
from sklearn.linear_model import LogisticRegression

def logistic(z):
    return 1 / (1 + np.exp(-z))

# Set random seed.
np.random.seed(0)
# True theta coefficients.
theta = np.array([4, -2])
# Number of training data points.
n = 100000
# Number of features.
p = len(theta)
# Generate feature values from U[0,1].
X = np.random.rand(n, p)
# Calculate logits.
z = X @ theta.reshape(-1, 1)
# Calculate probabilities.
prob = logistic(z)
# Generate labels by sampling from Bernoulli(prob)
y = np.random.binomial(1, prob.flatten())
# Train a Logistic regression model.
clf = LogisticRegression(fit_intercept = False, penalty = "none").fit(X, y)
# Check the coefficients - should be close to the true values.
print(f"Learnt theta: {np.round(clf.coef_, 2)} (true theta was {theta})")

# Out: Learnt theta: [[ 4.01 -2.  ]] (true theta was [ 4 -2])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant