Regularization in Machine Learning

Elastic-Net

ElasticNet Objective Function (Updated Notation):

\[\min_{\beta} \left( \frac{1}{2} | y - X \beta |_2^2 + \lambda \alpha | \beta |_1 + \frac{\lambda (1 - \alpha)}{2} | \beta |_2^2 \right)\]

where:

Update Rule for ElasticNet:

We focus on each coefficient $\beta_j$ in a coordinate descent approach, keeping the other coefficients fixed. Let’s derive the update for $\beta_j$.

  1. Residual Sum of Squares (RSS):

    • The derivative with respect to $\beta_j$ is: $$X_j^T (y - X \beta) +X_j_2^2 \beta_j $$
    • The residual $r$ is defined as: \(r = y - X \beta\) which can be expanded as: \(y - X \beta = y - \sum_{k=1}^p X_k \beta_k\)

where $X_k$ is the $k$-th column of $X$, and $\beta_k$ is the corresponding coefficient.

where $X_{-j} \beta_{-j}$ is the contribution of all predictors except $X_j$.

In coordinate descent, we aim to update $\beta_j$ while keeping all other coefficients fixed. The gradient of the residual sum of squares term with respect to $\beta_j$ is:

\(X_j^T (y - X \beta)\) By substituting $y - X \beta$ with $(y - X_{-j} \beta_{-j}) - X_j \beta_j$, we get:

\[X_j^T \left( (y - X_{-j} \beta_{-j}) - X_j \beta_j \right)\]

Expanding this term gives:

\[X_j^T (y - X_{-j} \beta_{-j}) + X_j^T X_j \beta_j\]

Rearranging: Now, we isolate the term involving \beta_j:

\[X_j^T (y - X_{-j} \beta_{-j}) - | X_j |_2^2 \beta_j\]

This term, $X_j^T (y - X_{-j} \beta_{-j})$, is often called the partial residual for $\beta_j$, as it represents the residual $y - X \beta$ without the contribution of $X_j \beta_j$.

  1. L1 Penalty:

    • The derivative of $\lambda \alpha\beta_1$ with respect to $\beta_j$ is $\lambda \alpha \cdot \text{sign}(\beta_j)$
  2. L2 Penalty:

    • The derivative of $\frac{\lambda (1 - \alpha)}{2}\beta_2^2$ with respect to $\beta_j$ is $\lambda (1 - \alpha) \beta_j$.
  3. Combining these, the partial derivative of the objective function with respect to $\beta_j$ is:

\[\frac{\partial \mathcal{L}}{\partial \beta_j} = - X_j^T (y - X \beta) + | X_j |_2^2 \beta_j + \lambda (1 - \alpha) \beta_j + \lambda \alpha \cdot \text{sign}(\beta_j)\]
  1. To update $\beta_j$, we set the partial derivative to zero and rearrange:
\[\beta_j \left( | X_j |2^2 + \lambda (1 - \alpha) \right) = X_j^T (y - X{-j} \beta_{-j}) - \lambda \alpha \cdot \text{sign}(\beta_j)\]

where $X_{-j} \beta_{-j}$ represents the prediction excluding $X_j \beta_j$.

Dividing through by $X_j_2^2 + \lambda (1 - \alpha)$, we get:
\[\beta_j = \frac{S \left( X_j^T (y - X_{-j} \beta_{-j}), \lambda \alpha \right)}{| X_j |_2^2 + \lambda (1 - \alpha)}\]

where $S(z, \lambda)$ is the soft-thresholding operator defined as: \(S(z, \lambda) = \text{sign}(z) \cdot \max(|z| - \lambda, 0)\)

References