cat articles/kaggle-fp3-comp

Kaggle Feedback Prize - English Language Learning: team gold medal, 15th place, and Kaggle Master

The Kaggle competition Feedback Prize - English Language Learning ended, and we received a gold medal with 15th place out of about 2,650 teams. With this, I had two gold medals and one silver medal in total, meeting the requirements for Kaggle Competitions Master, the title I had aimed for when I first started competitions. I do not think I could have reached this result alone, so I am grateful to my teammates @masakiaota and @olivineryo.

At the end of the competition we were 8th on the public LB among about 2,700 teams, inside the gold-medal range. But the public LB used only 26% of the full LB data. Scores were shown only to the second decimal place, so the display was coarse. Also, some submissions that were quite bad on our CV ranked oddly high on the public LB, so we could not observe a reliable CV-LB correlation. Because of that, we did not trust the public LB and expected a large shake in the final private LB ranking.

For final submission, we submitted three versions: the best CV model, the best LB model, and the best CV model without pseudo labels. Pseudo labels made CV overwhelmingly better, but there were concerns about over-optimizing CV or leaking information. This competition had another prize category, so we had three submission slots and chose those three. If only two submissions had been allowed, the choice would have been very painful.

The result was that although we were 8th on public LB, we dropped to 17th on private LB and unfortunately missed the gold medal range, which was 15th or higher at the time. Missing gold by two places was extremely frustrating. Still, some teams dropped tens or hundreds of ranks, and the public LB first-place team dropped 175 places, so a nine-place drop could have been much worse. It reconfirmed how hard it is to win gold.

15th place
15th place

Update: after invalid users were removed and the ranking was finalized, we moved up to 15th and received a gold medal. I am very happy.


Competition Task

This was the third competition in the Feedback Prize, or FP, series, which evaluates English written by students. The task was to infer scores, in 0.5-point increments from 1.0 to 5.0, across six columns such as grammar, vocabulary, and syntax. The evaluation metric was MCRMSE, mean columnwise root mean squared error, so the mean of RMSE for each column. Since data from the previous FP1 and FP2 competitions existed, we tried various approaches while thinking about whether they could be used as pseudo labels and how to optimize the six evaluation axes.

Solution

We submitted three ensemble models combining the ideas below. The best private LB result came from an ensemble of 11 models that did not use pseudo labels.

Backbone model selection

On the public LB, deberta-v3-base scored better than deberta-v3-large, even though deberta-v3-large has the stronger generalization performance in general. I accepted that result and trained and optimized around it. I thought that for this competition, v3-base, with 12 layers, might score better than the more complex v3-large, with 24 layers. Since the essays were written by students and were not complex prose, perhaps the simpler model was better.

In the final private LB, however, v3-large was indeed strong. We should not have trusted LB and should have built our strategy around CV and v3-large.

  • deberta-v3-base
    • Overwhelmingly strong on this competition's public LB.
  • deberta-v1-large
    • Somewhat effective on both CV and LB.
  • deberta-v3-large
    • Effective on CV, but mixing it in caused public LB to drop considerably.
    • On private LB, that was not the case.

Our team's own approaches

These were things not discussed publicly, or at least not widely adopted as "this is effective, let's use it". Most of these ideas came from my teammates, and I did not contribute much to them. I am grateful to have had excellent teammates.

  • Attention pooling separately for the six columns
    • Improved the score substantially compared with mean pooling.
  • Applying LayerNorm after pooling
    • This also improved the score substantially.
  • maxlen: 640
    • Truncating to 640 improved scores compared with longer maxlen.
    • For deberta-v3-large only, using a longer maxlen and training with splits using Sliding Window Attention gave better scores.
  • Pseudo labeling
    • Very effective for CV, but not very effective for LB.
    • Using pseudo labels with leaked information improved CV even more, so we carefully removed, or tried to remove, information leaks.
    • Pseudo-label data used past FP1 and FP2 competition data.
    • Pseudo-label scores were created by ensembling predictions from trained models.
    • A single-model pseudo label was learned too quickly, but using ensembled pseudo labels kept improving CV.
    • A single model trained with pseudo labels, not the strongest on CV but moderately improved, was best on private LB. It hurt that we did not choose it for final submission.
  • Post-processing
    • Clamp values below 1.0 to 1.0 and values above 5.0 to 5.0. This slightly improved the score.
    • When ensembling, find optimal weights per each of the six columns as an optimization problem that minimizes CV score. For example, with four ensemble members, solve for and apply 4 x 6 = 24 weights. This slightly improved the score.
    • We used scipy.optimize.minimize to define and solve the optimization problem. I had not known that SciPy could solve it so easily; it was convenient.

Approaches discussed publicly

  • Layer reinitialization
    • Reinitialize the final n layers of the pretrained model. Reinitializing only the final layer worked best.
    • This helped a lot.
  • LLRD, or Layer-wise Learning Rate Decay
    • Gradually decrease the learning rate by layer.
    • We decayed LR by 0.8 for large models and by 0.7225 for base models.
    • This helped a lot.
  • Layer freezing
    • Do not train the first n layers. When LLRD is applied, the early layers of large models, 24 layers, are barely trained anyway, so freezing the first 12 layers of large models sped up training.
  • AWP
  • Multi-sample dropout
    • The score barely changed, but training became more stable, so we used it.
    • Added 0.2 * 5 dropout.

Useful to learn, even though it did not improve our score

RAPIDS SVR. In the RAPIDS SVR approach, neural networks are used only to extract feature embeddings, and then those embeddings are trained with SVR, Support Vector Regression, on CUDA using cuML. SVR training itself finishes instantly or within a few seconds, depending on the GPU. I wondered whether such a method could work, but it produced a score better than my early baseline. Some top solutions also seem to have used RAPIDS SVR, so it is quite possible that I simply failed to use it effectively for performance improvement.

SVR was useful not only because it quickly produced reasonably good scores, but also because it correlated to some extent with the performance of NN backbone models. NN models that scored poorly with SVR often also scored poorly after full NN training. That made SVR useful as a guide when choosing backbone models, especially near the end when we had to consider many models for the final ensemble.

I also had not known about NVIDIA's cuML library itself, so learning about it was valuable. cuML can run basic machine learning algorithms similar to those in scikit-learn on CUDA, and some algorithms become extremely fast. Its interface is usually sklearn-compatible, so it is easy to understand. In the future, if a machine learning task takes minutes or more on CPU, I would like to consider using cuML.


After the Competition

After this competition, I should become a long-awaited Kaggle Competitions Master. Getting gold or silver medals in three medal competitions in a row from my first competition was possible largely because of teammates, except for the solo competition, and even there the previous team experience helped. I am especially grateful to Aota, who invited me into Kaggle competitions.

That said, I felt this during my previous solo competition, and this competition made it clear again: I was not the person who came up with the idea that won gold. At my current level, I do not have the skill needed to win a solo gold medal, which is required to become a Kaggle Competitions Grandmaster. There is a large wall in front of me. If I know the approach for a problem domain, I feel I might be able to win solo silver after trying a few times, but I cannot yet imagine winning solo gold. People who have won solo gold medals are truly impressive.

So far, I have only joined competitions solvable with natural language processing Transformer encoder approaches. If I join another competition, rather than prioritizing medals by solving a similar task, I would like to try a task where my score may be weaker but I can gain different knowledge. I am still a beginner, and most machine learning algorithms and problem-solving approaches stimulate my curiosity. I would like to keep enjoying the process while broadening my knowledge.

cat related_articles/kaggle-fp3-comp.yaml

  1. Solo silver medal, 43rd place, in Kaggle Feedback Prize - Predicting Effective ArgumentsI joined Feedback Prize - Predicting Effective Arguments solo, finished 43rd out of 1,566 teams, and wrote down what worked, what failed, and what solo participation felt like.
  2. My first Kaggle competition ended with a team gold medal, 8th placeI joined my first Kaggle competition through a strong team, learned how collaborative competition work is organized, and ended up with a gold medal in the U.S. Patent Phrase to Phrase Matching competition.
  3. RAPIDS SVR and SVC: fast training without fine-tuning, evaluated on MARC-jaAn introduction to RAPIDS SVR and SVC, using neural-network embeddings as features without fine-tuning and evaluating the approach on the Japanese MARC-ja classification dataset.