Sora 原理解释

Sora：可作为世界模拟器的视频生成模型

这篇是翻译 OpenAI 官网的技术报告，更多观点性内容看这里：报道和观点

左侧展开有案例大全｜原理解释｜影视学习资料

原文：https://openai.com/research/video-generation-models-as-world-simulators

📗

大家对这篇技术文档的更多解读，或许可以帮你更好理解

中学生能看懂：Sora 原理解读

这篇讲解的更容易懂：天才程序员周弈帆：OpenAI 视频模型 Sora 科研贡献速览

https://mp.weixin.qq.com/s/Prn1G_EpXvnM4me9a_SPBw

清熙-解读OpenAI Sora文生视频技术原理

https://mp.weixin.qq.com/s/Y-vmxmPu4_-tHaeP35hDJg

量子位-爆火Sora参数规模仅30亿？谢赛宁等大佬技术分析来了（这篇还引用了咱们社区里的聊天记录）

https://mp.weixin.qq.com/s/ZpZJ9XpbH8QYarMbxXM6SQ

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

我们探索了在视频数据上大规模训练生成模型。我们同时在变化时长、分辨率和宽高比的视频和图像上训练文本条件扩散模型。我们利用一种 Transformer 架构，该架构在视频和图像的潜空间时空分块（Patch）上运行。我们最大的型号 Sora 能够生成一分钟的高保真视频。结果表明，扩展视频生成模型是构建物理世界通用模拟器的一条有前途的途径。

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

本技术报告侧重于（1）我们将各类视觉数据转换为统一表示的方法，该方法使得大规模训练生成模型成为可能，以及（2）对Sora能力和局限性的定性评估。模型和实现细节不包括在本报告中。

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks, generative adversarial networks, autoregressive transformers, and diffusion models. These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

以往的许多研究通过使用各种方法对视频数据进行生成建模，包括循环网络、生成对抗网络、自回归变换器和扩散模型。这些工作通常关注于视觉数据的狭窄类别、较短视频或固定大小的视频。Sora是一个视觉数据的通用模型——它可以生成持续时间、纵横比和分辨率各异的视频和图像，长达一分钟的高清视频。

Turning visual data into patches

将视觉数据转化为分块

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data. The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual *patches*. Patches have previously been shown to be an effective representation for models of visual data.We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

我们从大型语言模型中获得灵感，这些模型通过在互联网规模的数据上训练获得了通用能力。大语言模型范式的成功部分得益于使用优雅统一各种文本模态（代码、数学和各种自然语言）的令牌。在这项工作中，我们考虑视觉数据的生成模型如何继承这些好处。LLM拥有文本令牌，而Sora拥有视觉分块（patches）。先前的研究已经显示，分块是视觉数据模型的一种有效表示。我们发现，分块是训练多种类型视频和图像的生成模型的一种高度可扩展且有效的表示。

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches.

在高层次上，为了将视频转换为分块，我们首先将视频压缩到一个低维潜在空间，然后将该表示分解为时空块（spacetime patches）。

Video compression network

视频压缩网络

We train a network that reduces the dimensionality of visual data. This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

我们训练了一个降低视觉数据维度的网络。这个网络接受原始视频作为输入，并输出在时间和空间上都被压缩的潜在表示。Sora在这个压缩的潜在空间上进行训练，并随后生成视频。我们还训练了一个相应的解码器模型，将生成的潜在表示映射回像素空间。

Spacetime Latent Patches

时空潜空间块

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

给定一个压缩的输入视频，我们提取一序列的时空分块，作为Transformer变换器的Token。这个方案也适用于图像，因为图像只是带有单一帧的视频。我们的基于分块的表示使Sora能够训练具有不同分辨率、持续时间和纵横比的视频和图像。在推理时，我们可以通过在适当大小的网格中随机初始化分块来控制生成视频的大小。

Scaling transformers for video generation

用于视频生成的扩展性Transformers转换器

Sora is a diffusion model; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion *transformer*. Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, computer vision, and image generation.

Sora是一个扩散模型；给定输入的噪声块（以及像文本提示这样的条件信息），它被训练来预测原始的“干净”分块。重要的是，Sora是一个扩散Transformers变换器。变换器在包括语言建模、计算机视觉和图像生成在内的多个领域展示了显著的扩展性。

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

在这项工作中，我们发现扩散变换器（Diffusion transformers）作为视频模型也能有效扩展。下面，我们展示了随着训练计算增加，固定种子和输入的视频样本质量显著提高的比较。

Base compute

4x compute

32x compute

Variable durations, resolutions, aspect ratios

可变持续时间、分辨率、纵横比

Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.

以往的图像和视频生成方法通常将视频调整大小、裁剪或修剪到标准大小——例如，256x256分辨率的4秒视频。相反，我们发现，改为在其原生大小上进行训练提供了几个好处。

Sampling flexibility

采样灵活性

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything in between. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.

Sora能够采样宽屏1920x1080p视频、垂直1080x1920视频以及之间的所有内容。这使Sora可以直接以不同设备的原生纵横比创建内容。它还让我们能够在生成全分辨率内容之前，快速原型化较小尺寸的内容——全部使用相同的模型。

Improved framing and composition

改进的取景和构图

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.

我们通过经验发现，在其原生纵横比上训练视频改善了构图和框架。我们将Sora与我们的模型的一个版本进行比较，该版本将所有训练视频裁剪为正方形，这是训练生成模型时的常见做法。训练有素的模型（左侧）有时会生成部分视图中只有主题的视频。相比之下，Sora（右侧）的视频具有改进的框架。

Language understanding 语言理解

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

训练文本到视频生成系统需要大量带有相应文本标题的视频。我们应用在DALL·E 3中介绍的重新标题技术到视频。我们首先训练一个高度描述性的标题生成模型，然后用它为我们训练集中的所有视频生成文本标题。我们发现，在高度描述性的视频标题上训练改善了文本的忠实度以及视频的整体质量。

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

类似于DALL·E 3，我们还利用GPT将简短的用户提示转换为更长的详细标题，然后发送给视频模型。这使得Sora能够生成高质量的视频，准确地遵循用户提示。

	an-adorable-kangaroo-wearing-a-green-dress-and-a-sun-hat-taking-a-pleasant-stroll-in-Johannesburg-South-Africa-during-a-beautiful-sunset
	a-toy-robot-wearing-a-green-dress-and-a-sun-hat-taking-a-pleasant-stroll-in-Antarctica-during-a-beautiful-sunset
	an-old-man-wearing-blue-jeans-and-a-white-t-shirt-taking-a-pleasant-stroll-in-Antarctica-during-a-winter-storm

Prompting with images and videos

使用图像和视频进行提示

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

所有上述结果和我们的介绍页面展示的文本到视频样本。但是Sora也可以用其他输入进行提示，例如预先存在的图像或视频。这一能力使Sora能够执行广泛的图像和视频编辑任务——创建完美循环的视频、动画静态图像、向前或向后延长视频等。

Animating DALL·E images

动画 DALL·E 图像

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2 and DALL·E 3 images.

Sora能够根据图像和提示输入生成视频。下面我们展示了基于DALL·E 2和DALL·E 3图像生成的示例视频。

A Shiba Inu dog wearing a beret and black turtleneck.

一只戴着贝雷帽和黑色高领毛衣的柴犬。

Monster Illustration in flat design style of a diverse family of monsters. The group includes a furry brown monster, a sleek black monster with antennas, a spotted green monster, and a tiny polka-dotted monster, all interacting in a playful environment.

怪物插图采用扁平化设计风格，呈现出多样化的怪物家族。该小组包括一个毛茸茸的棕色怪物、一个带天线的光滑黑色怪物、一个斑点绿色怪物和一个小小的圆点怪物，所有这些都在俏皮的环境中互动。

An image of a realistic cloud that spells “SORA”.

拼成“SORA”的逼真云的图像。

In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two surfers, seizing the moment, skillfully navigate the face of the wave.

在一个华丽的历史大厅里，一个巨大的潮汐达到顶峰并开始崩溃。两名冲浪者抓住时机，熟练地驾驭海浪。

Extending generated videos

扩展生成的视频

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

Sora还能够向前或向后延长视频。下面是四个视频，它们都是从生成视频的一个片段向后延长开始的。因此，这四个视频的开始各不相同，但所有四个视频都导致相同的结尾。

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.

我们可以使用这种方法向前和向后延长视频，以产生无缝的无限循环。

Video-to-video editing

视频到视频编辑

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit, to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.

扩散模型已经使从文本提示编辑图像和视频的多种方法成为可能。下面我们将其中一种方法，SDEdit，应用于Sora。这项技术使Sora能够在保留输入图像或视频的主要特征的同时，根据文本提示进行修改。

Input video 输入视频

change the setting to be in a lush jungle	change the setting to the 1920s with an old school car. make sure to keep the red color	make it go underwater
change the video setting to be different than a mountain? perhaps joshua tree?	put the video in space with a rainbow road	make it in claymation animation style
recreate in the style of a charcoal drawing, making sure to be black and white	change the setting to be cyberpunk	change the video to a medieval theme

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

我们也可以使用Sora在两个输入视频之间逐渐插值，创建在完全不同的主题和场景构成之间的无缝过渡。在下面的例子中，中间的视频在左侧和右侧对应的视频之间插值。

Image generation capabilities 图像生成功能

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.

Sora也能够生成图像。我们通过在具有一个帧的时间范围的空间网格中排列高斯噪声的分块来实现这一点。该模型可以生成不同大小的图像——分辨率高达2048x2048。

和 MJ V6 的对比博主：https://twitter.com/doganuraldesign/status/1758444092328194217

Emerging simulation capabilities 新兴的仿真功能

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.

我们发现，当在大规模训练时，视频模型展现出许多有趣的新兴能力。这些能力使Sora能够模拟物理世界中的人、动物和环境的某些方面。这些属性没有任何明确的三维、物体等归纳偏置——它们完全是规模现象。

3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

三维一致性。Sora可以生成具有动态摄像机移动的视频。随着摄像机的移动和旋转，人物和场景元素在三维空间中一致地移动。

Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

长距离连贯性和物体持久性。对于视频生成系统来说，一个重大挑战一直是在采样长视频时保持时间上的连续性。我们发现，Sora通常能够有效地模拟短距离和长距离依赖关系，尽管并非总是如此。例如，我们的模型可以在人物、动物和物体被遮挡或离开画面时仍然保持它们的存在。同样，它可以在单个样本中生成同一角色的多个镜头，贯穿视频始终保持他们的外观。

Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

与世界互动。 Sora有时可以模拟以简单方式影响世界状态的行为。例如，画家可以在画布上留下新的笔触，这些笔触随着时间的推移而持续存在，或者一个人可以吃汉堡并留下咬痕。

Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

模拟数字世界。 Sora还能够模拟人工过程——一个例子是视频游戏。Sora可以在同时控制《Minecraft》中的玩家和高保真度渲染世界及其动态的基础上，采用基本策略。通过提及“Minecraft”的字幕提示，可以零样本激发这些能力。

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.

这些能力表明，继续扩展视频模型的规模是通往开发能够高度模拟物理和数字世界及其中的对象、动物和人类的高能力模拟器的有前途的道路。

Discussion 讨论

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.

Sora目前作为模拟器存在许多局限性。例如，它不准确地模拟许多基本互动的物理效果，如玻璃破碎。其他互动，如吃食物，不总是产生正确的对象状态变化。我们在我们的介绍页面中列举了模型的其他常见失败模式——如在长时间样本中发展的不一致性或物体的自发出现。

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.

我们相信，Sora今天所展示的能力表明，继续扩大视频模型的规模是通往开发能够模拟物理和数字世界及其中的对象、动物和人类的高能力模拟器的有前途的道路。

文献

Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. "Unsupervised learning of video representations using lstms." International conference on machine learning. PMLR, 2015.↩︎
Chiappa, Silvia, et al. "Recurrent environment simulators." arXiv preprint arXiv:1704.02254 (2017).↩︎
Ha, David, and Jürgen Schmidhuber. "World models." arXiv preprint arXiv:1803.10122 (2018).↩︎
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." Advances in neural information processing systems 29 (2016).↩︎
Tulyakov, Sergey, et al. "Mocogan: Decomposing motion and content for video generation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.↩︎
Clark, Aidan, Jeff Donahue, and Karen Simonyan. "Adversarial video generation on complex datasets." arXiv preprint arXiv:1907.06571 (2019).↩︎ 克拉克、艾丹、杰夫·多纳休和凯伦·西蒙尼扬。“复杂数据集上的对抗性视频生成。” arXiv 预印本 arXiv：1907.06571 （2019）。↩︎
Brooks, Tim, et al. "Generating long videos of dynamic scenes." Advances in Neural Information Processing Systems 35 (2022): 31769-31781.↩︎ Brooks， Tim， et al. “生成动态场景的长视频。”神经信息处理系统进展 35 （2022）： 31769-31781.↩︎
Yan, Wilson, et al. "Videogpt: Video generation using vq-vae and transformers." arXiv preprint arXiv:2104.10157 (2021).↩︎ Yan， Wilson， et al. “Videogpt： Video generation using vq-vae and transformers.” arXiv 预印本 arXiv：2104.10157 （2021）.↩︎
Wu, Chenfei, et al. "Nüwa: Visual synthesis pre-training for neural visual world creation." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.↩︎ Wu， Chenfei， et al. “女娲：用于神经视觉世界创建的视觉合成预训练。”欧洲计算机视觉会议。Cham：施普林格·自然瑞士，2022 年。↩︎
Ho, Jonathan, et al. "Imagen video: High definition video generation with diffusion models." arXiv preprint arXiv:2210.02303 (2022).↩︎ Ho， Jonathan， et al. “Imagen video： High definition video generation with diffusion models.” arXiv 预印本 arXiv：2210.02303 （2022）.↩︎
Blattmann, Andreas, et al. "Align your latents: High-resolution video synthesis with latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.↩︎ Blattmann， Andreas， et al. “Align your latents： High-resolution video synthesis with latent diffusion models.”IEEE/CVF 计算机视觉和模式识别会议论文集。2023. ↩︎
Gupta, Agrim, et al. "Photorealistic video generation with diffusion models." arXiv preprint arXiv:2312.06662 (2023).↩︎ Gupta， Agrim， et al. “使用扩散模型生成逼真的视频。” arXiv 预印本 arXiv：2312.06662 （2023）.↩︎
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).↩︎↩︎ Vaswani， Ashish， et al. “你只需要关注。”神经信息处理系统进展 30 （2017）。↩︎ ↩︎
Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.↩︎↩︎ Brown， Tom， et al. “语言模型是少数学习者。”神经信息处理系统进展 33 （2020）： 1877-1901.↩︎ ↩︎
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).↩︎↩︎ Dosovitskiy， Alexey， et al. “An image is worth 16x16 words： Transformers for image recognition at scale.” arXiv 预印本 arXiv：2010.11929 （2020）.↩︎ ↩︎
Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021.↩︎↩︎ Arnab， Anurag， et al. “Vivit： A video vision transformer.”IEEE/CVF计算机视觉国际会议论文集。2021. ↩︎ ︎ ↩
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.↩︎↩︎
Dehghani, Mostafa, et al. "Patch n'Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution." arXiv preprint arXiv:2307.06304 (2023).↩︎↩︎
Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.↩︎ Rombach， Robin， et al. “使用潜在扩散模型进行高分辨率图像合成。”IEEE/CVF计算机视觉和模式识别会议论文集。2022. ↩︎
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).↩︎ Kingma、Diederik P. 和 Max Welling。“Auto-encoding variational bayes.” arXiv 预印本 arXiv：1312.6114 （2013）.↩︎
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International conference on machine learning. PMLR, 2015.↩︎ Sohl-Dickstein， Jascha， et al. “使用非平衡热力学的深度无监督学习。”机器学习国际会议。PMLR，2015 年。↩︎
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.↩︎ Ho、Jonathan、Ajay Jain 和 Pieter Abbeel。“去噪扩散概率模型。”神经信息处理系统进展 33 （2020）： 6840-6851.↩︎
Nichol, Alexander Quinn, and Prafulla Dhariwal. "Improved denoising diffusion probabilistic models." International Conference on Machine Learning. PMLR, 2021.↩︎ Nichol、Alexander Quinn 和 Prafulla Dhariwal。“改进的去噪扩散概率模型。”机器学习国际会议。PMLR，2021 年。↩︎
Dhariwal, Prafulla, and Alexander Quinn Nichol. "Diffusion Models Beat GANs on Image Synthesis." Advances in Neural Information Processing Systems. 2021.↩︎ Dhariwal、Prafulla 和 Alexander Quinn Nichol。“扩散模型在图像合成方面击败了 GAN。”神经信息处理系统的进展。2021. ↩︎
Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in Neural Information Processing Systems 35 (2022): 26565-26577.↩︎ Karras， Tero， et al. “阐明基于扩散的生成模型的设计空间。”神经信息处理系统进展 35 （2022）： 26565-26577.↩︎
Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.↩︎ Peebles、William 和 Saining Xie。“带变压器的可扩展扩散模型。”IEEE/CVF计算机视觉国际会议论文集。2023. ↩︎
Chen, Mark, et al. "Generative pretraining from pixels." International conference on machine learning. PMLR, 2020.↩︎ Chen， Mark， et al. “来自像素的生成式预训练。”机器学习国际会议。PMLR，2020 年。↩︎
Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International Conference on Machine Learning. PMLR, 2021.↩︎ Ramesh， Aditya， et al. “零样本文本到图像生成”。机器学习国际会议。PMLR，2021 年。↩︎
Yu, Jiahui, et al. "Scaling autoregressive models for content-rich text-to-image generation." arXiv preprint arXiv:2206.10789 2.3 (2022): 5.↩︎ Yu， Jiahui， et al. “Scaling autoregressive models for content-rich-text-to-image generation.” arXiv 预印本 arXiv：2206.10789 2.3 （2022）： 5.↩︎
Betker, James, et al. "Improving image generation with better captions." Computer Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8↩︎↩︎ Betker， James， et al. “使用更好的字幕改进图像生成。”计算机科学。https://cdn.openai.com/papers/dall-e-3。PDF 2.3 （2023）： 8 ↩︎ ︎ ↩
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.↩︎ Ramesh， Aditya， et al. “使用剪辑潜伏物生成分层文本条件图像。” arXiv 预印本 arXiv：2204.06125 1.2 （2022）： 3.↩︎
Meng, Chenlin, et al. "Sdedit: Guided image synthesis and editing with stochastic differential equations." arXiv preprint arXiv:2108.01073 (2021).↩︎

辅助工具

Garman把SORA 32篇索引文章放到GLMs，有啥不懂可以问它

https://open.bigmodel.cn/shareapp/v1/?share_code=3MdNDUTIrkZVesAR5hPoY

🤖

问题：从原始视频数据到模型训练再到模型生成视频全流程

在文档中提到的从原始视频数据到模型训练再到模型生成视频的全流程包括以下步骤：

数据准备：使用包括内部数据集和公开数据集在内的多种数据源，对图像和视频进行尺寸调整，包括空间上的抗锯齿双线性缩放和时间上的跳帧处理。
模型训练：
- 基础训练阶段使用正方形图像比例进行训练。
- 在特定数据子集上对基础模型进行微调，以生成9:16比例的视频。
- 采用了位置嵌入的插值和窗口尺寸的缩放。
视频生成：
- 首先使用图像LDM生成单帧图像。
- 然后运行预测模型，以生成的单帧为条件，生成关键帧序列。
- 在视频扩展时，再次调用预测模型，但以两帧（捕捉方向信息）为条件来生成连贯的运动。
微调与优化：
- 对于长时间视频生成，模型在帧预测任务上进行联合训练。
- 在推断阶段，优化目标是加快小批量图像生成的速度。
模型初始化：
- 由于视频数据集规模相对较小，从预训练的图像模型（如ViT）初始化视频模型，以促进更有效的训练。

以上步骤涵盖了从数据准备到模型训练再到视频生成的全流程。文档中未提及的具体细节，如数据预处理的具体参数设置、模型训练的详细配置等，则不在上述内容中。

精简版

作者夙愿与 GPT 问答

🤖

问题：从原始视频数据到模型训练再到模型生成视频全流程

1.收集视频及其文本标注:

初始步骤是收集大量视频数据，并获取或创建这些视频对应的文本标注。这些文本简要描述了视频内容，是训练模型理解视频主题的关键。

2.预处理视频数据:

对视频进行预处理，包括调整分辨率、格式转换、裁剪长度等，以确保数据格式统一，适合模型处理。

3.生成高度描述性的文本标注:

使用DALLE3的技术，首先训练一个模型，这个模型专门用于为视频内容生成高度描述性的文本标注。这一步是为了提升文本标注的质量，让其更加详细和具体。

对训练集中的所有视频应用这个模型，产生新的、更加详细的文本标注。

4.视频压缩和时空潜伏斑块生成:

开发并训练一个视频压缩网络，将高维的视频数据压缩到一个低维的潜在空间，简化后的表示更容易被模型处理。将压缩后的视频表示分解成时空潜伏斑块，这些斑块既包含空间上的信息也包含随时间变化的信息。

5.利用Transformer架构:

基于Transformer架构，处理这些时空潜伏斑块。由于Transformer架构在处理序列数据(如文本)方面的强大能力，这里用于捕获视频斑块之间复杂的时空关系。

6.扩散模型训练:

Sora作为一个扩散模型，通过预测从含声斑块到原始清晰斑块的转换过程进行训练。这个过程涉及到大量的迭代，逐步提高生成视频的质量。

7.GPT模型的应用:

"类似于DALLE3，Sora在处理用户提供的文本提示时，也可以利用GPT模型来扩展或优化这些提示。GPT模型可以将简短的用户提示转化成更详细、更富有描述性的文本，这有助于Sora更准确地理解并生成符合用户意图的视频。

8.生成视频:

用户提供一个文本提示，Sora根据这个提示在潜在空间中初始化视频的生成过程。·利用训练好的扩散模型，Sora从这些初始化的时空潜伏斑块开始，逐步生成清晰的视频内容。

9.视频解码和后处理:

使用与视频压缩相对应的解码器将潜在空间中的视频转换回原始像素视频。

对生成的视频进行可能的后处理，如调整分辨率、裁剪等，以满足发布或展示的需求。

通过以上步骤，从视频数据的收集和高度描述性标注的生成(利用DALLE3技术)，到视频的压缩、处理(利用Transformer架构和时空潜伏斑块)，再到基于用户文本提示的视频生成(结合GPT模型进行文本优化)，每一个环节都是精心设计的，确保生成的视频不仅在视觉上高质量，而且能准确反映用户的意图