cat articles/ainews

Launching AI News and how I used OpenAI behind it

created 2023-04-24

I launched a site called AI News. It collects topics related to AI, data science, and machine learning, summarizes them into three lines with AI, and publishes them. It is also available through Twitter @AINewsDev and an Atom feed. I have been running it for a few days, and although it is my own site, it has already been useful for collecting information. English articles are summarized in Japanese too, which is convenient.

Why I built it

I used to read ML-related information conveniently through ML-News, made by @syou6162. Around the time Twitter API pricing became an issue, it seems to have become unavailable.

That made it harder to follow data science and machine learning topics, and I had been thinking about building a similar site someday. Then ChatGPT, or GPT-4, appeared as an actually useful LLM. As everyone knows, topics around AI then exploded. There was too much information to read, but I still wanted to read the things I cared about properly. It would be useful to look at a reliable overview first and then decide whether to read the full article. The description written in an article's HTML is often only the first hundred characters or so, so it is not enough for that judgment. Then I realized that this was exactly the kind of thing an LLM such as ChatGPT could do, so I built it.

Implementation notes

Web scraping, article body extraction, and the website implementation are ordinary web development topics, so I will leave them aside for now and focus first on ChatGPT and the OpenAI API.

Article classification

I needed an implementation that takes scraped web articles and decides whether each article is related to AI. If you do machine learning, this sounds easy when correct labels exist, especially for binary classification of AI-related or not. The problem is that creating correct labels is tedious. I wanted AI itself to judge the articles instead of doing all of it by hand.

So I first asked GPT-3.5 to label the data. However, asking it to score how AI-like a topic is on a numeric scale from 0.0 to 1.0 was surprisingly unstable. I tried hard to write prompts that would make the output look like a probability distribution, but my prompting ability was not enough. What I wanted was a softmax-like probability distribution, so instead of describing it in prose, I wrote more directly that the values should be as if passed through a softmax function. That worked better. The final prompt is here:

https://gist.github.com/hotchpotch/8cb74d7a2ed1730faf1ec1ba089f93cf

I made it evaluate multiple AI-like categories and an "Others" category. When I fed it the roughly 400-character summaries described later, I often got output like this. Each value is between 0.0 and 1.0, and the total is 1.0, so it feels softmax-like.

{
    "AI": 0.0,
    "Machine Learning": 0.0,
    "Data Science": 0.0,
    "Data Analysis": 0.0,
    "Statistics": 0.8,
    "Deep Learning": 0.0,
    "kaggle": 0.0,
    "ChatGPT": 0.0,
    "MLOps": 0.0,
    "Generative AI": 0.0,
    "LLM": 0.0,
    "Others": 0.2
}

But sometimes it produced output like this. The values are between 0.0 and 1.0, but the total is greater than 1.0. What happened to softmax?

{
    "AI": 0.5,
    "Machine Learning": 1.0,
    "Data Science": 1.0,
    "Data Analysis": 1.0,
    "Statistics": 0.5,
    "Deep Learning": 0.0,
    "kaggle": 0.0,
    "ChatGPT": 0.8,
    "MLOps": 0.0,
    "Generative AI": 0.0,
    "LLM": 0.0,
    "Others": 0.2
}

If I pass that output through an actual softmax function, I get this. The values form a 0.0 to 1.0 distribution and sum to 1.0, so I can use this.

{
    "AI": 0.08285351386643752,
    "Machine Learning": 0.13660235066384355,
    "Data Science": 0.13660235066384355,
    "Data Analysis": 0.13660235066384355,
    "Statistics": 0.08285351386643752,
    "Deep Learning": 0.05025319642492017,
    "kaggle": 0.05025319642492017,
    "ChatGPT": 0.11184054543123119,
    "MLOps": 0.05025319642492017,
    "Generative AI": 0.05025319642492017,
    "LLM": 0.05025319642492017,
    "Others": 0.06137939271976228
}

Using this data, I treated items where "Others" was the highest score as non-AI and then checked the labels by hand. After correcting wrong labels by hand, I had N=550 labels: 200 AI-related and 350 others. Compared with the labels before manual correction, the accuracy was about 94%. That is quite high. The labeling is biased because I checked likely mistakes based on the AI output, but even so, the accuracy was good. Many of the mistakes were also borderline cases. I could probably improve it further by tuning the prompt or using GPT-4, but the goal was to create correct labels for training a classifier, and that was achieved, so I considered this good enough for now.

I listed many categories because when I wrote the task as something like "machine learning, AI, data science, or anything else", GPT-3.5's output felt less stable.

I corrected labels by hand while looking at a screen like this. It was much easier than labeling everything from scratch, although still tedious.

Building a classifier

Creating 550 correct labels was manageable, so next I built a classifier to decide whether an article is AI-related. For feature generation, I used OpenAI's Embeddings API, text-embedding-ada-002, to convert article bodies into 1536-dimensional vectors. The price per 1K tokens is also 20% of gpt-3.5-turbo, which is nice.

According to OpenAI's blog, text-similarity-davinci-001 seems to be more accurate for classification prediction. Still, I want to use embeddings for various things in the future, so I chose text-embedding-ada-002 for its generality.

Now I had 1536-dimensional features, so I split the labeled data into train, validation, and test sets and built a classifier. This time I used lightGBM, familiar to Kagglers. It had been several months since I last used lightGBM, and reading the documentation while implementing it felt bothersome, so I asked ChatGPT. It quickly produced working code, which surprised me. I was able to use it almost as-is.

https://gist.github.com/hotchpotch/81cf130279f4df9aeccd20e51678cff4

The code in that gist splits the data into 80% train, 10% validation, and 10% test, but because the final amount of data was not large, I adjusted it to 60% train, 30% validation, and 10% test. The trained model achieved validation accuracy 0.987 and test accuracy 1.0. Test accuracy of 100%! Of course the test data is only about 55 items, so it may be chance. When I changed the random seed casually, accuracy ranged from 0.96 to 1.0. Even with non-fine-tuned text-embedding-ada-002 features, the score was very good for a classification task. For NLP classification with only 330 training samples, that is impressive.

This completed the classifier for deciding whether an article is AI-related. Since then I have added various data sources, so at the moment some non-AI articles occasionally slip through and get displayed. I plan to retrain the classifier later and make it smarter.

Creating article summaries

For article summaries, if money were no issue, asking gpt-4 to summarize the whole article would be the most accurate. But the token cost is 15 times higher than gpt-3.5-turbo. Fifteen times. That is a lot for a hobby project, so I wanted to keep the cost as low as possible.

I asked GPT-4 and GPT-3.5 to summarize roughly the first 4K tokens of article text into about 400 Japanese characters and compared the results subjectively. GPT-4 produced better summaries, but they did not feel overwhelmingly better than GPT-3.5. Considering cost and processing time, I first use GPT-3.5 to summarize the first roughly 4K tokens into about 400 characters.

The difference between GPT-4 and GPT-3.5 became clearer when compressing the information further. When I reduced summaries to around 80 characters for Twitter posts, GPT-4 was much better. When the prompt specified a constraint such as "around 80 characters in Japanese", GPT-4 followed the constraint much more closely. GPT-3.5 sometimes produced much longer text, so GPT-4's ability to respect the character-limit constraint was valuable for Twitter posting.

GPT-4 is also much better at handling several tasks in one prompt. If I ask it in one request to create "around 80 characters", "around 80 characters in a casual style", "a three-emoji summary", and "a three-line summary", GPT-4 almost always does it. GPT-3.5 seems to struggle with doing multiple tasks in a single run.

At the moment I use this prompt:

https://gist.github.com/hotchpotch/427b2c24a1368a6f54d79d3f282c9445

Running that prompt through GPT-4 gives results for multiple tasks like this:

{
  "Bullets": ["約2650チーム中15位で金メダル獲得", "Kaggle Competitions Masterの称号取得", "CV・LB相関が観測できず、最終結果は大幅なshake予想"],
  "Summary": "Kaggleのコンペティションで15位の金メダルを獲得し、Kaggle Competitions Masterの称号を手に入れた。",
  "SummaryEmojis": "🏆Kaggleのコンペで15位の金メダル🥇を獲得し、Kaggle Competitions Master👑の称号を手に入れた🎉",
  "Emojis": "🥇👑🎉"
}

So summary generation is split into two steps:

First, summarize to about 400 characters with GPT-3.5 to save money and time.
Then, create multiple shorter summaries from that 400-character summary with GPT-4. This costs more money and time, but the quality is higher.

As people often say, prompts written in English generally produced better output than prompts written in Japanese. I am not good at English, so I translated my prompts with DeepL, but even that produced better results in English.

Ordinary web development

In addition to the OpenAI and machine learning work described above, I also implemented the following. This part took around 70% of the total development time, which is about what I expected.

Scrapers for various sites in Python
Saving data to the backend database and related systems in Python
Batch job implementation
Website implementation with Next.js, TypeScript, and Chakra UI

I wrote much of this implementation based on code generated by the ChatGPT 3.5 and 4 web UI. For Python code, which I also write in normal work, I sometimes felt it would have been faster to write it myself. Still, it was very useful for small pieces of implementation, such as writing a simple function or a regular expression.

It had been a long time since I used Next.js and TypeScript, and Chakra UI was new to me, so ChatGPT-generated code was especially useful there because I had less knowledge. However, the Next.js code was probably based on version 11 or 12 from ChatGPT's training data rather than the current version 13, so it sometimes produced deprecated structures. That is part of the charm.

ChatGPT 4.0 produced higher-quality code, but it was slower, so I mostly used 3.5 for small code generation. I used 4.0 when 3.5 looked suspicious or when I needed to include many conditions in the code. Use the right tool for the job.

Even with the current GPT-3.5 and 4.0, better VS Code integration alone would make development much more convenient. If code generation gets smarter over the next one or two years, development where I define requirements, review diffs, press y/N, and occasionally give feedback starts to feel realistic.

The future of AI News

For now I have only built the minimum necessary pieces, so I plan to keep improving it bit by bit. It is a personal sandbox, and it is also a website I can use conveniently myself, so maintaining it is fun in the way tending a bonsai might be, although I have never actually tended one. The article embeddings are currently used only for binary classification, but they should be useful for many other things too. I expect I will keep tinkering with it for a while.