RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths




Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a “painter” for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a project webpage: this https URL.

Zero-shot Learning

先解释一下什么是Zero-shot Learning?从字面上来看,即是对某(些)类别完全不提供训练样本,也就是没有标注样本的迁移任务被称为零次学习。

zero-shot learning是为了能够识别在测试中出现,但在训练中没有遇到过的数据类别,我们可以学习到一个映射X->Y。如果这个映射足够好的话,我们就可以处理没有看到的类了,故可以被认为是迁移学习。


One-shot Learning

什么是One-shot Learning?one-shot learning即是对某(些)类别只提供一个或者少量的训练样本,也就是说只有一个标注样本的迁移任务被称为一次学习。

one-shot learning指的是我们在训练样本很少,甚至只有一个的情况下,依旧能做预测。要点就在于学到好的X->Y的映射关系,然后应用到其他问题上。

one-shot learning其实和zero-shot learning类似,只不过zero-shot learning提供的是无标注的样本而one-shot learning会提供少量或一个样本,也可以称为few-shot learning。 来自网上的一个定义:

FID score

解释一下就是: 这个模型中,只有少量的有 label (标签)的训练样本 S ,S 中包括 N 个样本,yi 代表各样本的 label。因为测试样本集中每个样本都有一个正确的类别,我们希望,再来新的待分类的测试样本 x’ 时候,正确预测出 x’ 标签是 y’。

注: 把每个类别 yi 的单个样本换成 k个样本就 变成了k-shot learning , few-shot 一般指的是 k 不超过 20。 参考:few-shot learning是什么


1、初探zero-shot/one-shot learning

2、【Stable Diffusion】FID、CLIP、cfg-scales都是什么




您的电子邮箱地址不会被公开。 必填项已用 * 标注