cat articles/sd-tech

Enjoying Stable Diffusion again from a technical perspective

Recently I used Stable Diffusion again through stable-diffusion-webui, and there were several technical things I did not know. These are my notes.

ControlNet

ControlNet is an extremely powerful neural network for controlling generated images so that they follow specific conditions. It works properly with any base model.

As of v1.1, it supports conditions such as depth, semantic segmentation, human pose, fake scribbles, HED boundary, M-LSD lines, and Canny edge. From a source image, it can use composition, pose, segmentation, edge extraction, masked-region completion, and many other conditions for image generation. The way it combines existing datasets and architectures is also exciting. The range of applications is wide.

It is innovative enough that anyone who has not used ControlNet image generation should try it. In the illustration-generation context, people often focus only on pose control, but it can reproduce many kinds of composition. It is seriously impressive. More types of conditional image generation will probably become possible from here.

There is also a ControlNet WebUI extension for SD-WebUI, so it can be used easily from sd-web.

Clear explanations and related material:

LoRA: Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning

The original LoRA paper, LoRA: Low-Rank Adaptation of Large Language Models, is about LLMs, or Transformers. The LoRA used with Stable Diffusion often uses these implementations. The cloneofsimo/lora implementation can train for high-quality output based on Pivotal Tuning Inversion, or PTI.

LoRA adds low-rank matrices to a base model and trains only those parts. This reduces training cost. Because the resulting parameters are small, both the parameter file size and memory usage are smaller.

From the point of view of someone using LoRA for image generation, LoRA is easy to layer. You can apply LoRA-trained parameter B and LoRA-trained parameter C on top of base model A, and generate images with characteristics from both B and C without major changes. In SD-WebUI, you can quickly specify which LoRA to apply and at what strength from a text prompt, such as <lora:model_a:1.0>, <lora:model_b:0.7>.

A clear Japanese deck about LoRA for LLMs is here.

Textual Inversion embeddings

In Stable Diffusion, one of the generation inputs is the embedding output, or sentence vector, from the CLIP text encoder. My understanding, although it is a little vague, is that textual inversion adds a new word that has learned a specific feature when creating these embeddings, and adjusts generation toward the intended output.

Normally, humans adjust output by putting strings into the text prompt. With textual inversion, you can directly insert text embeddings that have learned a specific expression, so you can make finer adjustments than ordinary text can express. After training, the embedding data has the same rank as a sentence vector, so the file size is extremely small.

Well-known examples include EasyNegative, which learns characteristics of words used in negative prompts and lets you add a good negative prompt easily, and badhandv4 and bad_prompt, which suppress strange arms and fingers.

Merging checkpoint models

Checkpoint merging combines checkpoint models to create another checkpoint. What it does is simple: linearly combine model parameters, essentially adding them with weights. It is surprisingly simple. This alone can create a model C that has characteristics of both models A and B. The many models named XxxMix that you see around are merged models made from multiple models. That said, not all merged models are only simple linear combinations. Some seem to use techniques such as changing the ratio by network layer, though that is still a kind of linear combination.


SD-WebUI keeps making these new techniques easy to use. For example, LoRA files, embeddings, and base models work by putting them in directories, and extensions such as sd-webui-controlnet can add features. The UI is not exactly approachable, but the system is well made. There is also a wiki page that roughly explains all features. If you use plain Stable Diffusion, you have to gather information yourself about what is possible, but SD-WebUI usually includes current trends, so it is also useful for understanding what is popular.

For image generation AI, I mostly use Midjourney. But touching Stable Diffusion again made me feel the evolution and interest of the ecosystem that comes from being open source and having published checkpoints. Looking closely, many parts are interesting, and it feels like I could get pulled deeply into it, so I am not digging too far. Still, it seems like an interesting area to get absorbed in.

After writing this much, I remembered that A New Era of AI Art: Image Generation Technology and Applications Using CLIP and Stable Diffusion had been sitting unread, so I started reading it. It covers the topics in this note, of course, and also explores many experimental approaches around the CLIP encoder as well as Stable Diffusion. It includes results showing what happens when each approach is applied, so you can understand how generated images change. Reading it is definitely enjoyable.

cat related_articles/sd-tech.yaml

  1. Launching AI News and how I used OpenAI behind itI launched AI News, a site that collects AI, data science, and machine learning topics and summarizes them into three lines with AI. This article describes why I built it and how I used OpenAI APIs for classification and summarization.
  2. Reading Basic Statistics by Kimio Miyakawa: statistics before machine learningAfter several months of studying machine learning, I realized I was missing the statistical foundations needed to understand data, experiments, estimation, testing, and model evaluation.
  3. Generating answers from images with ChatGPT 3.5 and extracting information through BLIP-2 promptsA note on using BLIP-2 with ChatGPT 3.5 for image-based answer generation when the task fits, and more importantly, on extracting image information through prompts to BLIP-2.