Ridge regression, LASSO, and elastic net explained

glmnet is a R package for ridge regression, LASSO regression, and elastic net. The authors of the package, Trevor Hastie and Junyang Qian, have written a beautiful vignette accompanying the package to demonstrate how to use the package: here is the link to the version hosted on the homepage of T. Hastie (and an ealier version written in 2014).

Below are some background information one probably loves to know before using the package to solve any real-world problem. I summarized them from various resources; in case you have questions, critics, or comments, please do not hesitate to contact me.

Ridge regression

Compared with ordinary linear regression with the least-square method, ridge regression is more suited when some covariates are correlated, a situation known as collinearity. When collinearity exists, the least-square method will lead to huge variance of the estimates. Because ridge regression adds bias to the estimates, it can reduce the variance of the estimates, thanks to the so-called bias-variance tradeoff.

In essence, ridge regression applies a regularization which can be tuned by a non-negative parameter \(\gamma\): when \(\gamma\)=0, no regularization is applied, and the solution becomes identical to the least-square solution; otherwise, the regularization is applied, and consequently large regression coefficients are shrinked, with the desired outcome of reducing overfitting.

More details can be found on the Wikipedia page about ridge regression. The method has many different names because it was developed independently in many contexts. In statistics it is also known as Tikhonov regularization.

Lasso regression

LASSO, or Least Aboslute Shrinkage and Selection Operator, is another regression method that performs both variable selection and regularization. It selects only a subset of provided covariates in the final model rather than using all of them for regression. As a consequence, the resulting model has fewer variables and is easier to interpret, at the same time the model maintains a high accuracy of prediction.

Technically, LASSO achieves its goals by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, and therefore forces certain coefficients be set to zero. This process is also known as \(l1\)-norm regularization, in contrast to the \(l2\)-norm regularization normalization applied in ridge regression, which forces the sum of the squares of the coefficients to be less than a fixed value.

One limitation of LASSO when applied to computational biology questions is that when the number of covariates is larger than the same size (\(p>n\)), LASSO can only select \(n\) covariates, even when more are associated with the outcome, and it tends to select only one covariate from any set of highly correlated covariates.

To understand this limitation in a real setting, imagine a prediction problem using gene expression data to predict the toxicological outcome of a compound treatment. In case a group of genes that are highly correlated with each other, say genes involved in the process of DNA damage and repair, are associated with the outcome, LASSO will likely select only one gene from them. While this may be fine from a pure prediction point of view, the result of feature selection can be misleading if one wants to test which biological functions are enriched in the genes that are associated with the outcome, because while many genes associated with DNA damage and repair are associated, only one of them will be reported by LASSO.

Elastic net

In order to overcome the limitations of LASSO, Hastie and others developed the elastic net method.

Elastic net can be viewed as a hybrid of ridge and LASSO. It extends LASSO by adding a ridge regression-like penalty which improves performance when \(p>n\). The result is that strongly correlated covariates can be selected together.

The elastic-net penalty is controlled by a parameter \(\alpha\) between 0 and 1: it bridges the gap between LASSO (\(\alpha\)=1) and ridge (\(\alpha\)=0).

From the discussions in the previous sections, we understand that ridge penalty shrinks the coefficients of correlated predictors towards each other while LASSO tends to pick one of them and discard the others. The elastic-net penalty mixes these two: if predictors are correlated in groups, an \(\alpha\)=0.5 tends to select the groups in or out together.

Another advantage of using \(\alpha\) is for numerical stability; for example, the elastic net with \(\alpha=1−\epsilon\) for some small \(\epsilon\)>0 performs much like the LASSO, but removes any degeneracies and wild behavior caused by extreme correlations.

Conclusion

Ridge regression, LASSO, and elastic net can be very useful in the field of computational biology, especially when it comes to regression problems with many covariates that are potentially correlated with each other, and situations where feature selection is desired.