cat articles/gpt-image

Generating answers from images with ChatGPT 3.5 and extracting information through BLIP-2 prompts

created 2023-07-05

This is about how image-based answer generation, similar to what can be done with the ChatGPT Web UI and GPT-4, can also be useful with ChatGPT 3.5 and BLIP-2 if the requirements match. What I really want to talk about is information extraction through prompts using BLIP-2.

The reason I tried this was that I wondered how GPT-4 generates answers from images. While looking into it, I found BLIP-2 explained as an example of incorporating VQA tasks into an LLM in Current Status and Prospects of Vision and Language (GPT-4).

I had roughly thought of BLIP as something for generating captions from images. BLIP-2, however, trains with an open LLM behind it, which lets it learn information that cannot be represented by image-caption pairs alone and improves accuracy.

What is useful for users is that it is not limited to ordinary caption generation. You can extract information by making it solve a Q&A task about an image.

For example, "a night view of Tokyo Tower is shown" is generated by having GPT-3.5 answer based on a simple caption. The flow for the opening example, "please suggest several recipes", is as follows:

Use image_question_prompt with "What ingredients are there?" to extract information from the image with BLIP-2.
- The QA task only accepts English, so translate with Meta's NLLB-200, which is for non-commercial use only.
- The prompt becomes Question: What are the ingredients? Answer:.
The extracted text becomes "Vegetables, carrots, cabbage, cauliflower, broccoli, and potatoes".
After that it is easy. Use that text and have GPT-3.5 answer with few-shot prompting.

That is the simple mechanism. If you write the information-extraction prompt directly in English for image_question_prompt, you can process it without the translation step.

This cannot handle advanced context in the way ChatGPT with GPT-4 can. But if the task can be combined with information extraction that BLIP-2 can perform, it can work reasonably well.

What I really want to say in this article is what I wrote above: with an appropriate prompt, BLIP-2 can perform a certain amount of information extraction. ChatGPT's own API will probably become able to handle image-based information extraction tasks, but BLIP-2 plus open LLM models should also continue to improve. I expect zero-shot information extraction with open models to become more accurate, and I am looking forward to that.

References