cat articles/ncd-classifier
Implementing and trying gzip + kNN text classification from the paper that beats BERT
The recently published paper “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (Jiang et al., Findings 2023) says that it uses the length of data compressed with gzip and performs better than BERT on text classification tasks. That sounded interesting, so I implemented the method myself and tried it. As a result, for text category classification using the livedoor news corpus, it actually achieved a better result than Japanese BERT.
What kind of method is it?
What it does is simple. First, calculate NCD, or Normalized Compression Distance. The examples use gzip as the compression algorithm.
- Compress individual data
xandy, and call their compressed lengthsC(x)andC(y). - Concatenate the two pieces of data into
xy, compress it, and call the compressed lengthC(xy). - Calculate NCD:
NCD(x, y) = [C(xy) - min(C(x), C(y))] / max(C(x), C(y))
If the information is similar, NCD becomes small. "Hello world!" concatenated with "Hello!" should compress well and have a small NCD, while "Hello world!" and "Good!" should be farther apart.
Then sort the training data by this distance, take the top-k items, and use the most common category among those top-k items as the prediction. In other words, it decides from the k nearest training items under NCD distance, so it is kNN.
Implementation
The original paper's implementation is available at https://github.com/bazingagin/npc_gzip. The method should be simple, but the code was hard to use, so I implemented a version that can be used quickly with sklearn-like fit and predict interfaces.
Trying it on the livedoor news corpus
Using this implementation, let's classify categories in the livedoor news corpus. In this Japanese BERT implementation article, 1475 items, about 20% of the whole dataset, were used as test data, and the accuracy was 0.9261.
This time I split the dataset into train and test at 8:2, with 5894 train items and 1473 test items.
The accuracy was 0.9457, which appears to beat Japanese BERT. It is impressive for such a simple mechanism. You may wonder why I did not prepare validation data, but this method only calculates distance, or NCD, and does not pretrain anything, so validation data would not be meaningful here.
# Accuracy
0.9456890699253224
# Confusion matrix
[[150 0 0 0 1 1 0 0 0]
[ 0 166 2 2 3 0 0 0 2]
[ 0 1 164 0 0 2 0 0 0]
[ 0 0 1 156 0 0 5 0 0]
[ 5 1 3 2 103 5 0 3 4]
[ 5 0 1 1 5 148 0 4 3]
[ 0 0 0 5 1 0 182 0 0]
[ 1 0 1 0 3 7 0 151 0]
[ 0 0 0 0 0 0 0 0 173]]
Trying it on MARC-ja
Next, let's evaluate MARC-ja from the JGLUE dataset. MARC-ja has about 190,000 items with positive and negative labels, with roughly 90% positive and 10% negative. Japanese BERT reportedly gets accuracy 0.958. If everything were predicted as positive, accuracy would be around 0.9.
With NCD Classifier, the accuracy was 0.802. Very bad. At first I thought the implementation must be wrong and checked it repeatedly, but the dataset distribution is heavily skewed, and many texts are too short. Livedoor news articles, for example, have a reasonable length. It seems this method does not perform well under these conditions.
# Accuracy
0.8020870180403255
# Confusion matrix
[[4077 755]
[ 364 458]]
Trying it on AGNews
The paper reports a score of 0.937 on AGNews, but my implementation only reached about 0.898. I do not know the cause of this difference. It may come from implementation differences, the data used, or some difference in data processing. My implementation may be wrong, so please let me know if you notice a problem.
# Accuracy
0.8976315789473684
# Confusion matrix
[[1718 47 83 52]
[ 20 1838 23 19]
[ 72 31 1635 162]
[ 81 37 151 1631]]
Rough summary
I confirmed that simple text classification with NCD using gzip plus kNN, without training, can outperform BERT in some cases. As the paper says, it seems likely to work well on small datasets of a few hundred to a few thousand items where the text length is reasonably long, such as news articles. Among the datasets I tried, the livedoor news corpus is exactly that kind of dataset.
This paper's approach looks worth trying as one classifier for casual text classification that does not require pretraining. The implementation is also simple.
However, as the paper says, computational cost grows as the data grows. At prediction time, if M is the number of train items and N is the number of items to predict, the cost is roughly O(M*N). The more data you have, the more prediction cost you pay. It is not learning features; it is doing a direct full calculation, so it is slow. For example, the MARC-ja dataset has M = 187528 and N = 5654, so the computation cost is large. Even using 32 virtual cores on a Ryzen 7950X CPU fully, predicting 5654 items takes about 30 minutes.
Still, it was an interesting paper showing that a simple approach without pretraining can produce enough performance depending on the use case.