cat articles/weekly-ai-news

Starting Weekly AI News: automated summaries with clustering and GPT

AI-related topics continue to be numerous, and I thought it would be useful to know roughly what became a topic each week. So I started a weekly newsletter on Substack. The content is created fully automatically. For example, the AI news summary for the week going back from July 28, 2023 looked like this:

I would not call it perfect, but I think it gathers reasonably notable topics in a decent way. If you are interested, please subscribe or read it through a feed reader.

Substack signup form

The newsletter title is a tribute to Weekly Kaggle News.


That would be only publicity, so I will also write about the internal implementation. It has not changed drastically from the material I linked before, but roughly:

  • Generate features, or sentence vectors, from title + summary using multilingual-e5-small.
    • I use small so that it runs on the low-spec VPS environment. Subjectively, small did not feel much less accurate.
  • Add a standardized article timestamp vector to the 384 dimensions from e5-small, making a 385-dimensional representation.
  • Run KMeans without dimensionality reduction. The number of clusters is total article count divided by 8, chosen roughly. With about 250 target articles, this gives around 30 clusters.
    • Reducing dimensions with UMAP or PCA did not produce very good results.
  • Look at overall distances, extract only articles near each cluster center, and use clusters where at least N articles remain.

This extracts clusters that look meaningful as groups of articles from the week. Then I generate titles and summaries for those clusters with gpt-3.5-turbo. It is basically ordinary BERTopic-like clustering plus GPT-based topic representation. In other words, a topic model implementation.


Recent BERTopic implementations on GitHub also seem to include OpenAI and LLM-based features, such as creating sentence vectors with OpenAI embeddings, or ada-v2, in addition to sentence-transformers, and creating topic representations with ChatGPT or GPT-4. By default, it seems to include c-TF-IDF keyword extraction in the prompt for generation. If you want to try this quickly with a library, BERTopic may be a good option.


Incidentally, newsletter article creation is fully automatic, but Substack itself does not seem to have a mechanism that lets me send the newsletter by calling an API. The final delivery flow is manual through the Web UI, which is unfortunate for my own workload.

cat related_articles/weekly-ai-news.yaml

  1. Launching AI News and how I used OpenAI behind itI launched AI News, a site that collects AI, data science, and machine learning topics and summarizes them into three lines with AI. This article describes why I built it and how I used OpenAI APIs for classification and summarization.
  2. Releasing a Japanese StaticEmbedding Model for Practical 100x Faster Text EmbeddingsI released static-embedding-japanese, a fast non-Transformer embedding model for Japanese and English text, and evaluated it on JMTEB.
  3. Quantizing fastText to build a practical 1.7 MB text classifierI built a text classifier for AI News with fastText and quantization, reducing the model to 1.7 MB while keeping practical accuracy and recall for filtering AI-related English articles.