Controls and Offset variables in GLM #16524

arunaryasomayajula · 2025-02-11T22:47:54Z

We have a couple of requests for GLMs that we were wondering if could be added to H2O’s H2OGeneralizedLinearEstimator. The requests we have are:

parameter to specify control variables1 and to remove effects of these variables for prediction and calculation of model metrics
additional option to remove effects of the offset column2 for prediction and calculation of model metrics

1) Context: Use of Control variables:

Control variables are sometimes used in our model builds in addition to offset. These variables are not one of the predictors we are modeling, but have effects that we would like to control for in the model. Thus, we would like to request a parameter that does the following:
When fitting the GLM, control variables are also fitted in the model, same way as a regular predictor.
After the model is fitted, when predicting with the model and calculating metrics, the control variables effects are removed.

To illustrate this request through an example, let’s say we would like to build a GLM with 3 predictors X1, X2, X3. We would also like to specify X4 and X5 as control variables. When fitting the GLM, we would like the GLM fitted on all of 5 these variables. There would be a coefficient estimated for each of the predictors (e.g. B1, B2, B3), controls (e.g. B4, B5), and the intercept (e.g. B0).

During predictions, we would like the prediction to be calculated as follows. Please note the prediction below excludes the effects of the control variables X4 and X5:

y_pred = g^-1(B0 + B1*X1 + B2*X2 + B3*X3)

*g is link function and g^-1 is the inverse of the link function

In addition to the prediction, we would like the model metrics to have the control effects removed as well. In these cases, if the model metrics use predicted values as part of the calculation, we would like the metrics to be calculated using the prediction with the control variable effects removed (as specified above).

We would also like the control variables option described above to be available for both numerical and categorical variables. If our understanding is correct, H2O one hot encodes the categorical variables behind the scenes. For example, let’s say X5 is a categorical variable with levels A, B, C, with A being the base level in the GLM. There would be a coefficient estimated for level B and one for level C (e.g. B5_B, B5_C). In this case, the process described above should still work and the prediction would be calculated with the same formula as above without the control variables effects (e.g. B5_B, B5_C would be excluded during prediction). Since level A is the base level, there is no separate coefficient for level A to exclude. However, conceptually, the effect of level A is included in the coefficient for the intercept, so this would be predicting with predictor X5 set to base level A. Can you confirm if our interpretation of prediction for categorical control variables works?

2) Context – Remove offset variable effects:

H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.

For example, let’s say we fit a GLM with 3 predictors X1, X2, X3, and an offset. If remove_offset_effect is set to False, the prediction would be calculated the same way as today with offset effect included:

y_pred = g^-1(B0 + B1*X1 + B2*X2 + B3*X3 + offset)

On the other hand, if remove_offset_effect is set to True, the prediction would be calculated with offset effect excluded:

y_pred = g^-1(B0 + B1*X1 + B2*X2 + B3*X3)

In addition to the prediction, we would like the model metrics to have the offset effects removed as well (remove_offset_effect=True).

Describe the solution you'd like
I think implementation of the (1) would take at least 3 weeks (optimistic estimate - if it'd be easy to internally use the logic for the Plug Values and deal with standardization). Realistically, I think it would be little bit over a month.
When I was looking at possible ways to implement this I found a partial workaround - it won't produce cross-validation metrics without the control variables but metrics for a given frame can be generated without the control variables. The trick is to use the PlugValues and set the values to 0. It appears that Plug Values are not standardized even when standardize=True is set (which I think should be considered a bug - it makes it really hard to impute, e.g., median since the user would first have to standardize the frame and then calculate the median value from that (without knowing if the standardization yielded the same values since we do it internally and I don't think we show the values anywhere)). But for now I'd say we can exploit this behavior of Plug Values and put zeros or base levels of categorical variables.

However, conceptually, the effect of level A is included in the coefficient for the intercept, so this would be predicting with predictor X5 set to base level A. Can you confirm if our interpretation of prediction for categorical control variables works?
Yes and if I'm not mistaken, when standardize=False, numerical variables with non-zero mean can also influence the intercept.

Here I have small example of the workaround that demonstrates that it works correctly for both standardized and non-standardized cases:

h2o_iris <- as.h2o(iris)

glm_model_plug_values_standardized <- h2o.glm(y="Sepal.Length",
                                              training_frame=h2o_iris,
                                              missing_values_handling="PlugValues",
                                              plug_values=as.h2o(data.frame(
                                                Sepal.Width=0,
                                                Petal.Length=0,
                                                Petal.Width=0,
                                                Species="versicolor")),
                                              standardize = TRUE)

glm_model_plug_values_NOT_standardized <- h2o.glm(y="Sepal.Length",
                                              training_frame=h2o_iris,
                                              missing_values_handling="PlugValues",
                                              plug_values=as.h2o(data.frame(Sepal.Width=0, Petal.Length=0, Petal.Width=0,  Species="versicolor")),
                                              standardize = FALSE)

h2o.performance(glm_model_plug_values_standardized, h2o_iris)
h2o.performance(glm_model_plug_values_NOT_standardized, h2o_iris)

h2o.performance(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])
h2o.performance(glm_model_plug_values_NOT_standardized, h2o_iris[,c(-4,-5)])

predict(glm_model_plug_values_standardized, h2o_iris)
predict(glm_model_plug_values_NOT_standardized, h2o_iris)

predict(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])
predict(glm_model_plug_values_NOT_standardized, h2o_iris[,c(-4,-5)])

# Standardized model
glm_model_plug_values_standardized@model$coefficients
# Intercept     Species.setosa Species.versicolor  Species.virginica        Sepal.Width       Petal.Length        Petal.Width 
# 1.5834912          0.5716142          0.0000000         -0.2420645          0.5199133          0.7813752         -0.3135125 
h2o_iris[,c(-4,-5)]
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7

# Intercept + b_SW*SW + b_PL*PL  (and -0.3135125 * 0 (0 is the plug value; otherwise it would be 0.2) + 0 for versicolor)
abs(1.5834912+0.5199133*3.5+0.7813752*1.4 - predict(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])[1]) # => 4.067362e-08


# Not standardized model
glm_model_plug_values_NOT_standardized@model$coefficients
# Intercept     Species.setosa Species.versicolor  Species.virginica        Sepal.Width       Petal.Length        Petal.Width 
# 1.6463719          0.5445527          0.0000000         -0.2283761          0.5138489          0.7678637         -0.3043910 
h2o_iris[,c(-4,-5)]
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7

# Intercept + b_SW*SW + b_PL*PL  (and -0.3043910 * 0 (0 is the plug value; otherwise it would be 0.2) + 0 for versicolor)
abs(1.6463719+0.5138489*3.5+0.7678637*1.4 - predict(glm_model_plug_values_NOT_standardized, h2o_iris[,c(-4,-5)])[1]) # => 1.814509e-07

(2)

H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.
This looks to me that it can be done completely without touching the GLM code so I think it could be faster to implement for all models in Model class. I think I'd be able to do that in ~3 weeks but I'm not very familiar with that part of h2o-3 so maybe Adam Valenta would have a better estimate.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered, if applicable.

H2O.ai Devs only
https://support.h2o.ai/a/tickets/110095

The text was updated successfully, but these errors were encountered:

arunaryasomayajula added cust-statefarm feature reporter-support Reported as a support issue by cuetomer labels Feb 11, 2025

arunaryasomayajula assigned arunaryasomayajula and maurever and unassigned arunaryasomayajula Feb 12, 2025

maurever added this to the 3.48.0.1 milestone Feb 13, 2025

maurever changed the title ~~Feature Request for Controls and Offset variables in H2OGeneralizedLinearEstimator~~ Controls and Offset variables in GLM Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controls and Offset variables in GLM #16524

Controls and Offset variables in GLM #16524

arunaryasomayajula commented Feb 11, 2025 •

edited by maurever

Loading

Controls and Offset variables in GLM #16524

Controls and Offset variables in GLM #16524

Comments

arunaryasomayajula commented Feb 11, 2025 • edited by maurever Loading

1) Context: Use of Control variables:

2) Context – Remove offset variable effects:

arunaryasomayajula commented Feb 11, 2025 •

edited by maurever

Loading