Unstable (wrong?) fits when using L2 norm + L1 norm of part of parameters #2132

tobiscode · 2023-05-18T19:40:54Z

tobiscode
May 18, 2023

Problem

Hi community,

I'm coming to you with a question that has been bothering me and some other people for a while now and that I haven't been able to fix myself. As I'm not an expert in convex optimization, it's possible that I'm just missing something fundamental in my problem setup, so please bear with me.

I'm trying to fit a timeseries using least squares and L1 norm regularization on part of the parameters. In the design matrix, the unregularized parameters are all well-constrained and have full rank. The parameters to be regularized include linearly-dependent functions, hence the regularization, and fit transient periods.

The timeseries contains jumps that are fit by unregularized step functions. When performing the fit, those jumps are fit pretty horribly, which becomes very apparent in the timeseries, as well as what "should" be the right values for the steps. Here is the plot output from an MWE including example data that I've attached below:

On the left is the overall fit of three different models to the data (black dots). The only difference between the models is different weight matrices (just rescaled from the original one). On the right is the zoom-in for a period with steps in succession. The orange line is if I use the actual data variance, whereas the green line reduces the variance (increases the weight) by a factor of 100, and the blue line increases the variance (reduces the weight) by a factor of 10.

Clearly, the orange and blue model fits are way off, where it's apparent to me that there isn't any reason they shouldn't fit the steps just as well as the green line, since the steps are not regularized. Since the steps are also one after another, I could easily produce a better fit to the data by manually reducing the magnitude of the model steps pairwise, which should give me a better fit to the data (reduced L2 norm of data residuals) with no change to the regularized parameters (and therefore no change in the L1 norm term). Manually changing the variance to give the better results is of course also not feasible, because the value would need to be optimized, would differ between datasets, and would change the physical meaning of the fit.

So here's my question - why does this happen? Did I formulate the CVXPY problem wrong? Do I need to add an extra parameter estimating a variance scaling? Or is this a numerical issue within the convex optimizer that can or cannot be circumnavigated?

Things I've already tried to no satisfactory conclusion:

Using no regularization at all: fit as expected, although this isn't an option because now the linearly-dependent functions are heavily trading off with each other and the solution isn't interpretable,
Regularizing all parameters: fit as expected, although now I'm penalizing parameters I don't want to regularize, and hence this is not the solution I want either,
Passing CVXPY the unreduced form (after Cholesky decomposition) G', d' instead of G.T @ W @ G, G.T @ W @ d, which increased the computation time beyond anything reasonable, and also didn't solve the problem,
Different backend solvers in CVXPY, no significant effect,
Different solver tolerances and other settings, no significant effect,
Different CVXPY versions, no effect.

I should mention that the main reason I'm using CVXPY for this problem is that I can specify which parameters to regularize, in contrast with Scikit-Learn's approach where all parameters are equally regularized, and that I can add non-negativity constraints (not used here in my MWE).

Thank you for any help you can give!

Code

""" MWE for weird lasso fit behavior """

# imports
import numpy as np
import pandas as pd
import cvxpy as cp
import matplotlib.pyplot as plt

# example data for Gm + error = d
G = np.loadtxt("G.txt")  # mappig matrix
W = np.diag(np.loadtxt("W.txt"))  # weights matrix = 1 / variance
reg_indices = np.loadtxt("reg_indices.txt").astype(bool)  # which parameters to regularize
time = pd.to_datetime(np.loadtxt("ts.time.txt"))  # timestamps for d
d = np.loadtxt("ts.values.txt")  # observations d
penalty = 10


# solver function
def solve(G, W, d, penalty, reg_indices):
    GtWd = G.T @ W @ d
    GtWG = G.T @ W @ G
    m = cp.Variable(GtWG.shape[1])
    lambd = cp.Parameter(shape=(), value=penalty, pos=True)
    objective = cp.norm2(GtWG @ m - GtWd) + cp.norm1(cp.multiply(lambd, m[reg_indices]))
    problem = cp.Problem(cp.Minimize(objective))
    problem.solve(enforce_dpp=True, solver="CVXOPT", kktsolver="robust")
    return m.value


# solve for different values of W
m1 = solve(G, W * 0.1, d, penalty, reg_indices)
m2 = solve(G, W * 1, d, penalty, reg_indices)
m3 = solve(G, W * 100, d, penalty, reg_indices)

# plot
fig, ax = plt.subplots(ncols=2, sharey=True, constrained_layout=True, figsize=(8, 3))
for a in ax:
    a.plot(time, d, marker=".", c="k", label="Data")
    a.plot(time, G @ m1, label="W*0.1")
    a.plot(time, G @ m2, label="W*1")
    a.plot(time, G @ m3, label="W*100")
ax[0].legend()
ax[1].set_xlim(time[1500], time[1600])
plt.savefig("lasso_fit_mwe.jpg")
plt.show()

Example data

G.txt
W.txt
reg_indices.txt
ts.time.txt
ts.values.txt

In the mapping matrix G, the columns represent the following spanning functions:

0: offset
1: linear slope
2-5: two different sinusoids
6-10: step functions
11 onwards: spline functions

Answered by SteveDiamond

Jul 9, 2023

You can still do the weighted least-squares with an L1 norm:

objective = cp.quad_form(m, G.T @ W @ G) - d.T @ W @ G @ m + cp.norm1(cp.multiply(lambd, m[reg_indices]))

(I may have multiplied the squared error incorrectly.) This would have the smaller dimensionality you want when you solve the optimization problem. I don't think there is a substantial advantage to solving norm2 + norm1 over quadratic penalty + norm1. The general behavior should be similar.

View full answer

tobiscode · 2023-07-07T22:20:20Z

tobiscode
Jul 7, 2023
Author

Hi,
I'm still wondering what's happening here and how to avoid it, and would appreciate any pointers/help you can give.
Thanks!

12 replies

tobiscode Jul 7, 2023
Author

Hi, the result is exactly the same with MOSEK (standard settings), which makes me think I'm not properly understanding the way to setup this problem...

Edit: I should correct that it's actually 135 variables (the splines make the problem large), so yes I should've tried out MOSEK sooner, although as I said the problem still arises.

SteveDiamond Jul 8, 2023
Maintainer

Ok np. I would say then that the problem is being solved as specified, so it's an issue with the formulation. I see you are regularizing the spline coordinates only. Does the output make sense if you remove the spline coordinates and don't regularize?

I'm a bit confused about the objective term cp.norm2(G.T @ W @ G @ m - G.T @ W @ G @ d). I apologize if I don't understand your problem, but wouldn't the penalty be on fitting the time series d? I could understand doing a weighted L2 norm using W, but I don't understand why you're multiplying by G.T. But probably I am missing something obvious.

Edit: or is the idea to map d into another space, weight it, then map it back out to the time series space?

tobiscode Jul 8, 2023
Author

The behavior is as expected if I remove the regularized components, yes - which again adds to my confusion because they shouldn't affect the steps.

Thanks for pointing your attention to the objective term, it made me start thinking about this a bit more. The idea is that this is the L2 norm of a weighted least squares, which has the cost function (e.g., Wikipedia) (G.T @ W @ G) @ m = (G.T @ W) @ d. The nice property of this formulation, as you suspect, is not just that it includes the weights/covariance information, but it also reduces the dimensionality of the problem from the number of observations to the number of parameters.

It does however appear to matter that this formulation is only valid if the problem is overdetermined. I was thinking it would still work because of the regularization term, but apparently not, because if I use the following cost function: cp.norm2(np.sqrt(W) @ (G @ m - d)) + cp.norm1(cp.multiply(lambd, m[reg_indices])) then the solver takes much longer, but the result is correct.

This now leads me to a couple of questions:

What even is the correct way of adding observation uncertainty into a weighted least square formulation such as mine? I'm not a convex optimization expert and I haven't been able to find simple formulations, but I'm guessing there is a way to enforce the covariance via some inequalities instead?
Would there still be a way of staying in the reduced space for computational efficiency, and map the L1 norm part into this as well? Changing the L1 norm to cp.norm1(cp.multiply(lambd, (G[:, reg_indices].T @ W @ G[:, reg_indices]) @ m[reg_indices])) leads to wrong results and sometimes even makes the solvers fail. Edit: I'm guessing that this might actually be a moot question since the reduced L2 formulation is already incorrect and would need to be changed as well, and this is probably also outside the scope of a convex optimization package...

Thank you so much for your help.

SteveDiamond Jul 9, 2023
Maintainer

You can still do the weighted least-squares with an L1 norm:

objective = cp.quad_form(m, G.T @ W @ G) - d.T @ W @ G @ m + cp.norm1(cp.multiply(lambd, m[reg_indices]))

(I may have multiplied the squared error incorrectly.) This would have the smaller dimensionality you want when you solve the optimization problem. I don't think there is a substantial advantage to solving norm2 + norm1 over quadratic penalty + norm1. The general behavior should be similar.

Answer selected by tobiscode

tobiscode Jul 9, 2023
Author

Using the objective function you suggested yields the following error:

ArpackNoConvergence: ARPACK error -1: ARPACK error -1: No convergence (1351 iterations, 0/1 eigenvectors converged)


        CVXPY note: This failure was encountered while trying to certify
        that a matrix is positive semi-definite (see [1] for a definition).
        In rare cases, this method fails for numerical reasons even when the matrix is
        positive semi-definite. If you know that you're in that situation, you can
        replace the matrix A by cvxpy.psd_wrap(A).

Is this because G.T @ W @ G is not of full rank?

I was also thinking that I could do a whitening transformation of the problem to merge the weights into the problem. However, this would not help with the large problem dimension.

Edit: Using psd_wrap as suggested fixes the issue again - is this safe to use in this context, or will this lead to other problems down the road when using larger systems? I did check that G.T @ W @ G is "mostly" postitive semidefinite, i..e the only negative/complex eigenvalues are within machine precision of zero.

SteveDiamond Jul 10, 2023
Maintainer

Using psd_wrap is fine. The solver will throw an error if it's "too" indefinite. We have some fixes in the master repo that make the positive semidefinite detection more accurate.

tobiscode Jul 10, 2023
Author

Thank you so much for your help! I'll go forward using the quad_form formulation then. I still have a larger dataset to test this on (which doesn't fit into the MWE) but hopefully that won't change things.
Cheers!

SteveDiamond Jul 10, 2023
Maintainer

You're welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable (wrong?) fits when using L2 norm + L1 norm of part of parameters #2132

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Unstable (wrong?) fits when using L2 norm + L1 norm of part of parameters #2132

tobiscode May 18, 2023

Problem

Code

Example data

Replies: 1 comment · 12 replies

tobiscode Jul 7, 2023 Author

tobiscode Jul 7, 2023 Author

SteveDiamond Jul 8, 2023 Maintainer

tobiscode Jul 8, 2023 Author

SteveDiamond Jul 9, 2023 Maintainer

tobiscode Jul 9, 2023 Author

SteveDiamond Jul 10, 2023 Maintainer

tobiscode Jul 10, 2023 Author

SteveDiamond Jul 10, 2023 Maintainer

tobiscode
May 18, 2023

Replies: 1 comment 12 replies

tobiscode
Jul 7, 2023
Author

tobiscode Jul 7, 2023
Author

SteveDiamond Jul 8, 2023
Maintainer

tobiscode Jul 8, 2023
Author

SteveDiamond Jul 9, 2023
Maintainer

tobiscode Jul 9, 2023
Author

SteveDiamond Jul 10, 2023
Maintainer

tobiscode Jul 10, 2023
Author

SteveDiamond Jul 10, 2023
Maintainer