Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controls and Offset variables in GLM #16524

Open
arunaryasomayajula opened this issue Feb 11, 2025 · 0 comments
Open

Controls and Offset variables in GLM #16524

arunaryasomayajula opened this issue Feb 11, 2025 · 0 comments
Assignees
Labels
cust-statefarm feature reporter-support Reported as a support issue by cuetomer
Milestone

Comments

@arunaryasomayajula
Copy link

arunaryasomayajula commented Feb 11, 2025

We have a couple of requests for GLMs that we were wondering if could be added to H2O’s H2OGeneralizedLinearEstimator. The requests we have are:

  1. parameter to specify control variables1 and to remove effects of these variables for prediction and calculation of model metrics
  2. additional option to remove effects of the offset column2 for prediction and calculation of model metrics

1) Context: Use of Control variables:

Control variables are sometimes used in our model builds in addition to offset. These variables are not one of the predictors we are modeling, but have effects that we would like to control for in the model. Thus, we would like to request a parameter that does the following:
When fitting the GLM, control variables are also fitted in the model, same way as a regular predictor.
After the model is fitted, when predicting with the model and calculating metrics, the control variables effects are removed.

To illustrate this request through an example, let’s say we would like to build a GLM with 3 predictors X1, X2, X3. We would also like to specify X4 and X5 as control variables. When fitting the GLM, we would like the GLM fitted on all of 5 these variables. There would be a coefficient estimated for each of the predictors (e.g. B1, B2, B3), controls (e.g. B4, B5), and the intercept (e.g. B0).

During predictions, we would like the prediction to be calculated as follows. Please note the prediction below excludes the effects of the control variables X4 and X5:

y_pred = g^-1(B0 + B1*X1 + B2*X2 + B3*X3)

*g is link function and g^-1 is the inverse of the link function

In addition to the prediction, we would like the model metrics to have the control effects removed as well. In these cases, if the model metrics use predicted values as part of the calculation, we would like the metrics to be calculated using the prediction with the control variable effects removed (as specified above).

We would also like the control variables option described above to be available for both numerical and categorical variables. If our understanding is correct, H2O one hot encodes the categorical variables behind the scenes. For example, let’s say X5 is a categorical variable with levels A, B, C, with A being the base level in the GLM. There would be a coefficient estimated for level B and one for level C (e.g. B5_B, B5_C). In this case, the process described above should still work and the prediction would be calculated with the same formula as above without the control variables effects (e.g. B5_B, B5_C would be excluded during prediction). Since level A is the base level, there is no separate coefficient for level A to exclude. However, conceptually, the effect of level A is included in the coefficient for the intercept, so this would be predicting with predictor X5 set to base level A. Can you confirm if our interpretation of prediction for categorical control variables works?

2) Context – Remove offset variable effects:

H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.

For example, let’s say we fit a GLM with 3 predictors X1, X2, X3, and an offset. If remove_offset_effect is set to False, the prediction would be calculated the same way as today with offset effect included:

y_pred = g^-1(B0 + B1*X1 + B2*X2 + B3*X3 + offset)

On the other hand, if remove_offset_effect is set to True, the prediction would be calculated with offset effect excluded:

y_pred = g^-1(B0 + B1*X1 + B2*X2 + B3*X3)

In addition to the prediction, we would like the model metrics to have the offset effects removed as well (remove_offset_effect=True).

Describe the solution you'd like
I think implementation of the (1) would take at least 3 weeks (optimistic estimate - if it'd be easy to internally use the logic for the Plug Values and deal with standardization). Realistically, I think it would be little bit over a month.
When I was looking at possible ways to implement this I found a partial workaround - it won't produce cross-validation metrics without the control variables but metrics for a given frame can be generated without the control variables. The trick is to use the PlugValues and set the values to 0. It appears that Plug Values are not standardized even when standardize=True is set (which I think should be considered a bug - it makes it really hard to impute, e.g., median since the user would first have to standardize the frame and then calculate the median value from that (without knowing if the standardization yielded the same values since we do it internally and I don't think we show the values anywhere)). But for now I'd say we can exploit this behavior of Plug Values and put zeros or base levels of categorical variables.

However, conceptually, the effect of level A is included in the coefficient for the intercept, so this would be predicting with predictor X5 set to base level A. Can you confirm if our interpretation of prediction for categorical control variables works?
Yes and if I'm not mistaken, when standardize=False, numerical variables with non-zero mean can also influence the intercept.

Here I have small example of the workaround that demonstrates that it works correctly for both standardized and non-standardized cases:

h2o_iris <- as.h2o(iris)

glm_model_plug_values_standardized <- h2o.glm(y="Sepal.Length",
                                              training_frame=h2o_iris,
                                              missing_values_handling="PlugValues",
                                              plug_values=as.h2o(data.frame(
                                                Sepal.Width=0,
                                                Petal.Length=0,
                                                Petal.Width=0,
                                                Species="versicolor")),
                                              standardize = TRUE)

glm_model_plug_values_NOT_standardized <- h2o.glm(y="Sepal.Length",
                                              training_frame=h2o_iris,
                                              missing_values_handling="PlugValues",
                                              plug_values=as.h2o(data.frame(Sepal.Width=0, Petal.Length=0, Petal.Width=0,  Species="versicolor")),
                                              standardize = FALSE)

h2o.performance(glm_model_plug_values_standardized, h2o_iris)
h2o.performance(glm_model_plug_values_NOT_standardized, h2o_iris)

h2o.performance(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])
h2o.performance(glm_model_plug_values_NOT_standardized, h2o_iris[,c(-4,-5)])

predict(glm_model_plug_values_standardized, h2o_iris)
predict(glm_model_plug_values_NOT_standardized, h2o_iris)

predict(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])
predict(glm_model_plug_values_NOT_standardized, h2o_iris[,c(-4,-5)])

# Standardized model
glm_model_plug_values_standardized@model$coefficients
# Intercept     Species.setosa Species.versicolor  Species.virginica        Sepal.Width       Petal.Length        Petal.Width 
# 1.5834912          0.5716142          0.0000000         -0.2420645          0.5199133          0.7813752         -0.3135125 
h2o_iris[,c(-4,-5)]
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7

# Intercept + b_SW*SW + b_PL*PL  (and -0.3135125 * 0 (0 is the plug value; otherwise it would be 0.2) + 0 for versicolor)
abs(1.5834912+0.5199133*3.5+0.7813752*1.4 - predict(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])[1]) # => 4.067362e-08


# Not standardized model
glm_model_plug_values_NOT_standardized@model$coefficients
# Intercept     Species.setosa Species.versicolor  Species.virginica        Sepal.Width       Petal.Length        Petal.Width 
# 1.6463719          0.5445527          0.0000000         -0.2283761          0.5138489          0.7678637         -0.3043910 
h2o_iris[,c(-4,-5)]
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7

# Intercept + b_SW*SW + b_PL*PL  (and -0.3043910 * 0 (0 is the plug value; otherwise it would be 0.2) + 0 for versicolor)
abs(1.6463719+0.5138489*3.5+0.7678637*1.4 - predict(glm_model_plug_values_NOT_standardized, h2o_iris[,c(-4,-5)])[1]) # => 1.814509e-07

(2)

H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.
This looks to me that it can be done completely without touching the GLM code so I think it could be faster to implement for all models in Model class. I think I'd be able to do that in ~3 weeks but I'm not very familiar with that part of h2o-3 so maybe Adam Valenta would have a better estimate.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered, if applicable.

H2O.ai Devs only
https://support.h2o.ai/a/tickets/110095

@arunaryasomayajula arunaryasomayajula added cust-statefarm feature reporter-support Reported as a support issue by cuetomer labels Feb 11, 2025
@maurever maurever added this to the 3.48.0.1 milestone Feb 13, 2025
@maurever maurever changed the title Feature Request for Controls and Offset variables in H2OGeneralizedLinearEstimator Controls and Offset variables in GLM Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cust-statefarm feature reporter-support Reported as a support issue by cuetomer
Projects
None yet
Development

No branches or pull requests

2 participants