cat articles/gpt-image

Generating answers from images with ChatGPT 3.5 and extracting information through BLIP-2 prompts

This is about how image-based answer generation, similar to what can be done with the ChatGPT Web UI and GPT-4, can also be useful with ChatGPT 3.5 and BLIP-2 if the requirements match. What I really want to talk about is information extraction through prompts using BLIP-2.

recipe
recipe

The reason I tried this was that I wondered how GPT-4 generates answers from images. While looking into it, I found BLIP-2 explained as an example of incorporating VQA tasks into an LLM in Current Status and Prospects of Vision and Language (GPT-4).

vqa_blip
vqa_blip

I had roughly thought of BLIP as something for generating captions from images. BLIP-2, however, trains with an open LLM behind it, which lets it learn information that cannot be represented by image-caption pairs alone and improves accuracy.

blip-2
blip-2

What is useful for users is that it is not limited to ordinary caption generation. You can extract information by making it solve a Q&A task about an image.

tokyo_tower
tokyo_tower

For example, "a night view of Tokyo Tower is shown" is generated by having GPT-3.5 answer based on a simple caption. The flow for the opening example, "please suggest several recipes", is as follows:

  • Use image_question_prompt with "What ingredients are there?" to extract information from the image with BLIP-2.
    • The QA task only accepts English, so translate with Meta's NLLB-200, which is for non-commercial use only.
    • The prompt becomes Question: What are the ingredients? Answer:.
  • The extracted text becomes "Vegetables, carrots, cabbage, cauliflower, broccoli, and potatoes".
  • After that it is easy. Use that text and have GPT-3.5 answer with few-shot prompting.

That is the simple mechanism. If you write the information-extraction prompt directly in English for image_question_prompt, you can process it without the translation step.


This cannot handle advanced context in the way ChatGPT with GPT-4 can. But if the task can be combined with information extraction that BLIP-2 can perform, it can work reasonably well.

What I really want to say in this article is what I wrote above: with an appropriate prompt, BLIP-2 can perform a certain amount of information extraction. ChatGPT's own API will probably become able to handle image-based information extraction tasks, but BLIP-2 plus open LLM models should also continue to improve. I expect zero-shot information extraction with open models to become more accurate, and I am looking forward to that.

References

cat related_articles/gpt-image.yaml

  1. Enjoying Stable Diffusion again from a technical perspectiveAfter using Stable Diffusion again through stable-diffusion-webui, I wrote notes on the surrounding techniques I had not followed closely: ControlNet, LoRA, textual inversion embeddings, and checkpoint merging.
  2. How secon.dev was implemented, December 2022 editionA snapshot of how secon.dev worked in late 2022: Markdown files synced through Dropbox, static builds with Next.js, related-article generation, image processing on GCS, and the parts I wanted to improve next.
  3. Launching AI News and how I used OpenAI behind itI launched AI News, a site that collects AI, data science, and machine learning topics and summarizes them into three lines with AI. This article describes why I built it and how I used OpenAI APIs for classification and summarization.