cat articles/ensemble-linear

Finding optimal weighted-ensemble coefficients with constrained least squares

created 2022-06-21

In the Kaggle competition U.S. Patent Phrase to Phrase Matching, I created about 20 ensemble candidates near the final stage by combining several pretrained models and multiple promising parameter settings. We then tried different ways to mix those models. The final submissions were a 6-model ensemble and a 9-model ensemble.

At that point, asking a human to choose the optimal ensemble weights from model performance and intuition is not very realistic. Of course, a domain expert may sometimes make better choices, but I wanted the machine to calculate the weights automatically. Changing the mixing ratio to minimize MSE is just a linear combination, so I thought the optimum should be computable.

A Simple Example

Suppose we have data like this. y is the true value, and X contains the predictions from each ensemble model.

import numpy as np
y = np.array([0.5, 0.75, 0.25, 1.0, 0.5])
X = np.array([
    [0.52, 0.9, 0.41, 0.99, 0.51],
    [0.52, 0.7, 0.41, 0.99, 0.51],
    [0.48, 0.73, 0.12, 0.97, 0.47],
    [0.45, 0.35, 0.25, 0.9, 0.49],
])

First, look at the MSE for each row of X.

np.square(X - y).mean(axis=1)
=> array([0.00974, 0.00574, 0.0039 , 0.03452])

If we simply average the predictions, the MSE is below. It is worse than the best single model.

np.square(X - y).mean(axis=0).mean(axis=0)
=> 0.013475

Least Squares

One way to find optimal coefficients is least squares, or linear regression. Let's try it.

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X.T, y)
reg.coef_
=> array([ 0.43575566, -0.05397578,  0.46076883,  0.21063718])

Ignoring the negative coefficient for a moment, using these parameters as weights gives the least-squares prediction.

X.T @ reg.coef_
=>array([0.51448131, 0.76448131, 0.26448131, 1.01448131, 0.51448131])

The values look close to the target, so calculate the MSE. It is good.

np.square(X.T @ reg.coef_ - y).mean(axis=0)
=> 0.00020970822203200185

But should an ensemble model really receive a negative coefficient? I want the coefficients to be positive.

Solving with Constrained Least Squares

So we can treat this as a least-squares problem with the constraint that coefficients must be non-negative. If there is an implementation of a constrained least-squares solver, this should be easy. Searching for restricted least square method scipy led me directly to one in SciPy. scipy.optimize.nnls solves non-negative least squares, so I used it.

weights, rnorm = scipy.optimize.nnls(X.T, y)
weights
=> array([0.29260857, 0.08404164, 0.52487508, 0.12761238])

Now we have positive coefficients. Use them as ensemble weights.

X.T @ weights
=> array([0.50522372, 0.75      , 0.24931469, 0.99686367, 0.50131296])
np.square(X.T @ weights - y).mean(axis=0)
=> 7.863453999510499e-06

This gives a minimum MSE with positive coefficients. For simple linear blending of ensemble results like this, solving with scipy.optimize.nnls looks easy and useful.

The ensemble predictions calculated with these weights correlated well between CV and both public and private LB. Being able to quickly compute the optimal blending ratio without hand-tuning weights was helpful in the final stage of a Kaggle competition, when time was short and everyone was mentally tired. This example uses simple linear-combination weights. There are many other ways to weight ensembles, so the implementation of "optimal weighting" will depend on the purpose.

After the competition ended, I learned that this method is a way of solving the linear problem in what is called stacking. I also tried optimizing ensembles by context with LightGBM, or GBDT. It optimized CV but did not work well for LB score. According to this discussion, neural networks can overfit easily, so GBDT is not well suited here. That makes sense.

People who know linear algebra can probably derive these coefficients naturally. In my case, I had recently read Optimization Mathematics Starting from Vectors and Matrices, and that was the first time I understood, at least a little, what linearity means. If I had not read that book, I probably would not have thought of keywords such as "constrained non-negative least squares". Fundamentals matter.