Regularization: Lasso vs Ridge

Lasso vs Ridge

Regularization, Lasso, Ridge

Why regularization exists

Linear regression has one goal: minimise the sum of squared errors on the training set.

The standard ordinary least square loss is:

$$ J(\beta) = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$

When you have many features relative to observations, the model goes in overfitting, memorising noise from the sample: In-sample error drops close to zero, while out-of-sample error explodes.

The expected prediction error decomposes into three terms:

$$ \min_{\beta} \left\{ \sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^{p}|\beta_j| \right\} $$

Bias is the systematic error from oversimplifying the world based on the training set.

Variance is how much the estimate would change if you change the training sample.

The third term is the irreducible noise.

When we add features, we reduce bias, but we inflate variance. This is the so-called bias-variance trade-off.

Regularization is the approach that controls this trade-off.

Regularization adds a penalty to the loss function proportional to the size of the coefficients. Two penalty shapes matter: L2 (Ridge) and L1 (Lasso).

Important remark: Observation needs to be standardized (with mean zero and variance one), such that each features is equally treated by the regularization approach.

Ridge (L2)

Ridge adds the sum of squared coefficients:

$$ J(\beta)_{\text{Ridge}} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$$

The parameter lambda controls how strongly large coefficients are penalised.

Ridge shrinks coefficients toward zero but never sets them to exactly zero.

Lasso (L1)

Lasso (Least absolute shrinkage and selection operator) adds the sum of the absolute values of coefficients:

$$ \min_{\beta} \left\{ \sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^{p}|\beta_j| \right\} $$

No closed-form solution exists, as the absolute value function is not differentiable at zero, so optimisation runs through iterative algorithms such as coordinate descent.

Lasso shrinks coefficients, and it can also set some of them to exactly zero. That is automatic feature selection.

The geometric difference is about corners

Why lasso acts as a selection operator while ridge no?

If we consider an OLS approach in two dimensions (two parameters to estimate, beta_0 and beta_1). The loss function is a parabola.

The OLS error contours are ellipses centred at the unconstrained solution. The penalty defines a constraint region the coefficients must stay inside:

  • Ridge: a circle defined by beta_0 squared plus beta_1 squared less or equal to t

  • Lasso: a diamond defined by the absolute value of beta_0 plus the absolute value of beta_1 less or equal to t

The optimal solution is the point where the error ellipse first touches the constraint region. On a circle, every boundary point looks the same: smooth, one tangent line. On a diamond, four corners stick out along the axes, exactly where one coefficient is zero.

The loss function expanding outward from the OLS centre is far more likely to hit a corner than a flat edge of the diamond. A corner is non-differentiable, which means many ellipse orientations all snap to the same point. A flat edge requires the ellipse to arrive at exactly the 45 degree slope, a rare coincidence.

Why this matters for credit risk modelling

PD models usually start with 50 to 100 candidate features. Most of them carry redundant or noisy information.

OLS would keep all of them and the output would be an overfitted model.

Ridge would keep all of them with smaller coefficients, which stabilises the solution but does not simplify it.

Lasso would drop the irrelevant features and returns a simplify model that regulators can read, validators can challenge, and credit officer can explain.