cat articles/weekly-ai-news
Starting Weekly AI News: automated summaries with clustering and GPT
AI-related topics continue to be numerous, and I thought it would be useful to know roughly what became a topic each week. So I started a weekly newsletter on Substack. The content is created fully automatically. For example, the AI news summary for the week going back from July 28, 2023 looked like this:
I would not call it perfect, but I think it gathers reasonably notable topics in a decent way. If you are interested, please subscribe or read it through a feed reader.
The newsletter title is a tribute to Weekly Kaggle News.
That would be only publicity, so I will also write about the internal implementation. It has not changed drastically from the material I linked before, but roughly:
- Generate features, or sentence vectors, from title + summary using multilingual-e5-small.
- I use small so that it runs on the low-spec VPS environment. Subjectively, small did not feel much less accurate.
- Add a standardized article timestamp vector to the 384 dimensions from e5-small, making a 385-dimensional representation.
- Run KMeans without dimensionality reduction. The number of clusters is total article count divided by 8, chosen roughly. With about 250 target articles, this gives around 30 clusters.
- Reducing dimensions with UMAP or PCA did not produce very good results.
- Look at overall distances, extract only articles near each cluster center, and use clusters where at least N articles remain.
This extracts clusters that look meaningful as groups of articles from the week. Then I generate titles and summaries for those clusters with gpt-3.5-turbo. It is basically ordinary BERTopic-like clustering plus GPT-based topic representation. In other words, a topic model implementation.
Recent BERTopic implementations on GitHub also seem to include OpenAI and LLM-based features, such as creating sentence vectors with OpenAI embeddings, or ada-v2, in addition to sentence-transformers, and creating topic representations with ChatGPT or GPT-4. By default, it seems to include c-TF-IDF keyword extraction in the prompt for generation. If you want to try this quickly with a library, BERTopic may be a good option.
Incidentally, newsletter article creation is fully automatic, but Substack itself does not seem to have a mechanism that lets me send the newsletter by calling an API. The final delivery flow is manual through the Web UI, which is unfortunate for my own workload.