cat articles/kaggle-feedback-prize-solo-silver

Solo silver medal, 43rd place, in Kaggle Feedback Prize - Predicting Effective Arguments

created 2022-08-24

I joined the Kaggle competition Feedback Prize - Predicting Effective Arguments solo and finished 43rd out of 1,566 teams, earning a silver medal. This was still a provisional ranking at the time, so the final rank could change slightly.

In my previous competition, my first Kaggle competition, I was blessed with a team and happened to win a gold medal. Through that experience I learned how fun Kaggle can be, and I wanted to join another competition. I also thought I would prefer a team, because staying motivated alone seemed hard. This time, however, I started without a team, or more precisely, I did not know many people and could not form one. The previous competition also taught me that team participation can sometimes lead to a gold medal even without enough individual skill, as in my case, so I wanted to see what result I could leave as a solo participant.

At first I worried that my motivation would not last. But my first baseline was already in the silver-medal range, so I started with the feeling that maybe the gold range was not impossible, and that kept me motivated through the end. In the result, I was nowhere near the gold range, and I recognized both my lack of skill and the narrowness of my toolbox. Still, solo participation gave me its own lessons. Unlike a team, I had to write every piece of code myself, understand the intent of that code, and try a wide range of methods considered good in similar competitions. That broadened my knowledge. On the other hand, I did not get the sense of unity that comes from team participation, nor the unexpected knowledge that teammates can bring. Both styles have tradeoffs.

The PPPM competition I joined previously also used Transformer encoder models for NLP, and this competition likely required similar models. A good part of the knowledge from the previous competition carried over. I think that was one reason I could stay motivated as a solo participant: I was not starting from a place where I knew nothing at all.

Near the end, before the team-merge deadline, several teams invited me to merge. That gave me a real sense that these things happen when you participate solo. By that point, leaving a result as a solo participant had also become one of my goals, so although I appreciated the invitations, I declined them this time.

What Kind of Competition Was It?

This competition was a variant of the earlier Feedback Prize - Evaluating Student Writing. The task was a classification problem: given part of an essay written by a U.S. student, predict whether it was Ineffective, Adequate, or Effective.

The actual data had about 37,000 rows. Each essay_id had a separate long essay text associated with it, and discourse_text was one part of that essay. Many public baselines created inputs like discourse_type + discourse_text + [SEP] + essay_text and trained on them. The problem with this approach is that essay_text appears repeatedly, causing overfitting very quickly. It also seemed that rows from the same essay were written by the same student, so each essay had its own tendency across Ineffective, Adequate, and Effective. That tendency seemed useful, because a student who writes a good essay probably writes generally good discourse segments.

Solution Notes

This was not a gold-medal solution and may not be broadly useful, but these are the methods I tried, what worked, what improved training efficiency, and what did not work well for me.

Look at essays, not rows

Instead of treating the problem row by row, I looked at it by essay_id. There are about 37,000 rows, but only about 4,200 essays, around one eighth as many, which improves training speed. Since each target discourse_text is contained in the essay, the task can be seen as classifying specific spans that appear within one text. In that form, a model similar to NER can classify the spans.

For example, I created essay text with special tokens [TAR_START] and [TAR_END] around target spans, like this, and turned the region between those tokens into the classification target.

[TAR_START]Lead so you want to take all the cars out of the city ok cool[TAR_END]. [TAR_START]Claim this will save the pepol like 1,000,000 dolers a year[TAR_END] and will [TAR_START]Claim reduce polution[TAR_END] and stuf i dont know. [TAR_START]Counterclaim i gess its a good idea but on the other hand nah i mean lookif pepal want to ruin the world with gas thats there choce man[TAR_END]. but i just relised that the thing seid to agrewith the pasige or whatever so yah. ummmmmmm i dont know [TAR_START]Position the eirth is cool so why distroit with gasis or something[TAR_END]. look to be honist...

For the classification representation, I tried averaging between [TAR_START] and [TAR_END], using only the [TAR_START] and [TAR_END] tokens, using only [TAR_START], and other variants. The best was using only [TAR_START]. I also tried not using special tokens, using only CLS and SEP, removing TAR_END, replacing it with SEP, and so on. The custom [TAR_START] and [TAR_END] special tokens worked better.

The tokenizer max_length was 1024. Even at that length, some essays overflowed, so I added some ad hoc processing to pack the discourse_text appropriately.

In the end, I trained a 4-fold deberta-v3-large model with CV 0.5911, LB 0.585, private LB 0.586; a deberta-large model with CV 0.6034, LB 0.595, private LB 0.597; and earlier row-based models such as deberta-v3-large with CV 0.6179, LB 0.608, private LB 0.604. I ensembled these. The result was LB 0.582 and private LB 0.583.

Methods that worked

AWP, or Adversarial Weight Perturbation
- Improved the score by about 0.05 to 0.1.
- AWP has perturbation width and range parameters. Increasing the perturbation width by epoch improved the score a little. If I had more time, I would have liked to try changing it with a scheduler.
Lowercasing and removing symbols in text processing
- Improved the score by about 0.03.
Back-translation augmentation
- I ran retranslated data through the model and used only essays where the model's accuracy exceeded 90%.
- Improved the score by about 0.03.
- Using only retranslated text caused overfitting. When I used data with about 20% [MASK], training progressed stably.

Methods that did not work well

This only means I could not make them work. Someone with different knowledge might well make them effective.

LSTM or Bi-LSTM before the final output.
Pseudo-labeling using the previous Feedback Prize - Evaluating Student Writing data.
- This still seems useful if done properly.
- https://www.kaggle.com/code/rolianklay/pseudo-labeling-how-to-get-pseudo-labels
Adding class weights to torch.nn.CrossEntropyLoss.
- The data was imbalanced, so I tried weights, but it did not work.
Removing essays with very few discourse_id values.
- This removes data like one discourse ID per essay.
deberta-xlarge and deberta-xxlarge.
- Training did not progress.
allenai/longformer-large-4096.
- Gradients became NaN and training did not progress.
- allenai/longformer-base-4096 trained, but the score was poor.

Training-efficiency methods

The text was fairly long in this competition, so training speed and memory efficiency were important. Optimization approaches for Transformers summarizes these techniques. The linked article has details, but here is a rough summary.

8-bit optimizers
- Use an 8-bit optimizer instead of a 16-bit one to save more memory.
- Specifically, bitsandbytes can replace AdamW and worked smoothly.
- I did not notice a score drop, and memory usage really decreased.
- Replacing torch.nn.Embedding with bnb.nn.StableEmbedding did not work smoothly for me. If I had replaced it well, the result might have improved further.
Gradient checkpointing
- Computes while discarding unnecessary gradients. If discarded values are needed again, the backward graph is rebuilt, so speed decreases.
- With Transformers, it can be enabled with model.gradient_checkpointing_enable().
- The score barely changed. Training slowed down, but memory usage decreased greatly.
Automatic mixed precision, or AMP
- Places and computes safe parts in fp16 instead of fp32.
- The important point is to use it with GradScaler to avoid gradient overflow.
Gradient accumulation
- When memory is limited, batch size becomes small. By splitting loss application, you can train similarly to using a larger batch size.
Freezing
- Layers near the input have lower learning rates, and sometimes not training them gives better results.
- So those layers can be frozen.
Fast tokenizers
- Use Transformers tokenizers written in Rust.
- Recently, Rust implementations are used by default when available, so there is often nothing special to do.
Uniform dynamic padding
- Dynamic padding is pulled up by the longest token length in each batch. If examples are sorted by token length beforehand, awkwardly long token lengths are less likely to appear.
- It is hard to use during training because you usually want random ordering, but it can speed up inference.

After the Competition

At first I thought maybe I had a chance at the gold range. In the end, I was in the middle of the silver range and nowhere near gold. That helped me reconfirm both my current position and my lack of skill. Small incremental improvements were not enough to reach a gold-medal score; a more drastic improvement would have been needed. I am looking forward to reading the top solutions.

Even though this was an NLP competition similar to the previous one, using a Transformer encoder model, I finished in the middle of the silver range. For Kaggle competitions in other topics, whether I can win a medal at all is still uncertain.

Trying it individually instead of as a team was good. I was able to leave some result on my own, stay motivated by myself, and enjoy the work. One more silver medal would make me a Kaggle Competitions Master, so I would like to keep joining competitions that interest me, whether as a team or solo. Most competitions are probably tasks I have never done before, so even if I do not win a medal, I expect to gain a lot of knowledge from any of them.