There are two types of regularization methods commonly used, L1 and L2. Your assistance is greatly appreciated. Additionally, $L_2$ regularization is also known as Tikhonov regularization or ridge regression.

Question:

I have a theoretical query that pertains to my research on L2 (Ridge) and L1 (Lasso) regularizations. I am familiar with the formulas and understand the goals of these two distinct procedures. While L1 takes the absolute value, L2 takes the squared value of coefficients multiplied by lambda. However, I am unsure why

L2 regularization

cannot decrease parameters to zero while L1 can, as this is not evident from the formula. Can someone please clarify this with an example?

L1 regularization

:

L2 Regularization:

Thanks for your help!

Solution 1:

Let’s analyze the ridge’s coefficient estimates concerning the estimates obtained from OLS.

The expression in the matrix that reduces the loss is the one that is being referred to.

The formula for calculating the ridge regression coefficient, denoted as $hat{beta}^{text{ridge}}$, involves taking the inverse of the sum of the transposed design matrix ($mathbf{X}^Tmathbf{X}$) and a scaled identity matrix ($lambdamathbf{I}$), multiplied by the transposed design matrix and the response vector ($mathbf{y}$).

As per the writers of the book “Elements of Statistical Learning”, it is possible to express every component of that particular vector.

The expression can be rewritten as $frac{hat{beta}_j}{1+lambda}$.

The reason for the impossibility of this coefficient being 0 is apparent. This is in contrast to the Lasso approach, where the coefficients can never be 0, especially when working with orthonormal columns of the design matrix.

The formula calculates the non-zero coefficients for a LASSO regression model, where beta hat j is the estimated coefficient for the jth predictor variable and lambda is the tuning parameter. The sign function returns the sign of beta hat j, while the maximum function returns the positive difference between the absolute value of beta hat j and lambda.

In this context, the notation $(x)_{+}$ represents the positive component of x, which is equal to x if it is greater than 0, and 0 otherwise. In the case of orthonormality, it is evident why the LASSO is capable of estimating certain coefficients as being equal to 0.

I’m simply reciting from my ESL materials. Now, let me present a more comprehensive point.

Solution 2:

The reason behind the dissimilarity in the regularizations’ gradients is what causes it.

The L1 regularization, which is based on MSDT_A1, results in a gradient of 1 as shown on the right side of the image, regardless of the magnitude of |w|. This causes |w| to rapidly shrink towards 0.

Regarding L2 regularization, as the weight parameter (w) approaches its smallest value of 0, the gradient of the regularization ($w^2$) also approaches 0 and becomes increasingly difficult to achieve.