Unstable (wrong?) fits when using L2 norm + L1 norm of part of parameters #2132
-
ProblemHi community, I'm coming to you with a question that has been bothering me and some other people for a while now and that I haven't been able to fix myself. As I'm not an expert in convex optimization, it's possible that I'm just missing something fundamental in my problem setup, so please bear with me. I'm trying to fit a timeseries using least squares and L1 norm regularization on part of the parameters. In the design matrix, the unregularized parameters are all well-constrained and have full rank. The parameters to be regularized include linearly-dependent functions, hence the regularization, and fit transient periods. The timeseries contains jumps that are fit by unregularized step functions. When performing the fit, those jumps are fit pretty horribly, which becomes very apparent in the timeseries, as well as what "should" be the right values for the steps. Here is the plot output from an MWE including example data that I've attached below: On the left is the overall fit of three different models to the data (black dots). The only difference between the models is different weight matrices (just rescaled from the original one). On the right is the zoom-in for a period with steps in succession. The orange line is if I use the actual data variance, whereas the green line reduces the variance (increases the weight) by a factor of 100, and the blue line increases the variance (reduces the weight) by a factor of 10. Clearly, the orange and blue model fits are way off, where it's apparent to me that there isn't any reason they shouldn't fit the steps just as well as the green line, since the steps are not regularized. Since the steps are also one after another, I could easily produce a better fit to the data by manually reducing the magnitude of the model steps pairwise, which should give me a better fit to the data (reduced L2 norm of data residuals) with no change to the regularized parameters (and therefore no change in the L1 norm term). Manually changing the variance to give the better results is of course also not feasible, because the value would need to be optimized, would differ between datasets, and would change the physical meaning of the fit. So here's my question - why does this happen? Did I formulate the CVXPY problem wrong? Do I need to add an extra parameter estimating a variance scaling? Or is this a numerical issue within the convex optimizer that can or cannot be circumnavigated? Things I've already tried to no satisfactory conclusion:
I should mention that the main reason I'm using CVXPY for this problem is that I can specify which parameters to regularize, in contrast with Scikit-Learn's approach where all parameters are equally regularized, and that I can add non-negativity constraints (not used here in my MWE). Thank you for any help you can give! Code""" MWE for weird lasso fit behavior """
# imports
import numpy as np
import pandas as pd
import cvxpy as cp
import matplotlib.pyplot as plt
# example data for Gm + error = d
G = np.loadtxt("G.txt") # mappig matrix
W = np.diag(np.loadtxt("W.txt")) # weights matrix = 1 / variance
reg_indices = np.loadtxt("reg_indices.txt").astype(bool) # which parameters to regularize
time = pd.to_datetime(np.loadtxt("ts.time.txt")) # timestamps for d
d = np.loadtxt("ts.values.txt") # observations d
penalty = 10
# solver function
def solve(G, W, d, penalty, reg_indices):
GtWd = G.T @ W @ d
GtWG = G.T @ W @ G
m = cp.Variable(GtWG.shape[1])
lambd = cp.Parameter(shape=(), value=penalty, pos=True)
objective = cp.norm2(GtWG @ m - GtWd) + cp.norm1(cp.multiply(lambd, m[reg_indices]))
problem = cp.Problem(cp.Minimize(objective))
problem.solve(enforce_dpp=True, solver="CVXOPT", kktsolver="robust")
return m.value
# solve for different values of W
m1 = solve(G, W * 0.1, d, penalty, reg_indices)
m2 = solve(G, W * 1, d, penalty, reg_indices)
m3 = solve(G, W * 100, d, penalty, reg_indices)
# plot
fig, ax = plt.subplots(ncols=2, sharey=True, constrained_layout=True, figsize=(8, 3))
for a in ax:
a.plot(time, d, marker=".", c="k", label="Data")
a.plot(time, G @ m1, label="W*0.1")
a.plot(time, G @ m2, label="W*1")
a.plot(time, G @ m3, label="W*100")
ax[0].legend()
ax[1].set_xlim(time[1500], time[1600])
plt.savefig("lasso_fit_mwe.jpg")
plt.show() Example dataG.txt In the mapping matrix G, the columns represent the following spanning functions:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 12 replies
-
Hi, |
Beta Was this translation helpful? Give feedback.
You can still do the weighted least-squares with an L1 norm:
(I may have multiplied the squared error incorrectly.) This would have the smaller dimensionality you want when you solve the optimization problem. I don't think there is a substantial advantage to solving norm2 + norm1 over quadratic penalty + norm1. The general behavior should be similar.