跳转到内容

谷歌Generating audio for video

https://deepmind.google/discover/blog/generating-audio-for-video/

下面的视频建议打开声音

Video-to-audio research uses video pixels and text prompts to generate rich soundtracks

视频转音频研究利用视频像素和文字提示生成丰富的背景音乐

Video generation models are advancing at an incredible pace, but many current systems can only generate silent output. One of the next major steps toward bringing generated movies to life is creating soundtracks for these silent videos.

视频生成模型正以惊人的速度发展,但目前许多系统只能生成无声输出。要使生成的电影栩栩如生,下一个重要步骤就是为这些无声视频创建配乐。

Today, we're sharing progress on our video-to-audio (V2A) technology, which makes synchronized audiovisual generation possible. V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.

今天,我们将与大家分享我们的视频音频(V2A)技术的进展,该技术使同步视听生成成为可能。V2A 将视频像素与自然语言文本提示相结合,为屏幕上的动作生成丰富的音效。

Our V2A technology is pairable with video generation models like Veo to create shots with a dramatic score, realistic sound effects or dialogue that matches the characters and tone of a video. 我们的 V2A 技术可与 Veo 等视频生成模型搭配使用,以创建具有戏剧性配乐、逼真音效或对话的镜头,从而与视频中的人物和基调相匹配。

It can also generate soundtracks for a range of traditional footage, including archival material, silent films and more — opening a wider range of creative opportunities.

它还可以为各种传统素材(包括档案资料、默片等)生成配乐,从而带来更多的创作机会。

Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete

Prompt for audio: Cute baby dinosaur chirps, jungle ambience, egg cracking

Prompt for audio: jellyfish pulsating under water, marine life, ocean

Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd

Prompt for audio: cars skidding, car engine throttling, angelic electronic music

Prompt for audio: a slow mellow harmonica plays as the sun goes down on the prairie

Prompt for audio: Wolf howling at the moon

Enhanced creative control 增强创意控制,同样的视频,通过不同的prompt控制音效

Importantly, V2A can generate an unlimited number of soundtracks for any video input. Optionally, a ‘positive prompt’ can be defined to guide the generated output toward desired sounds, or a ‘negative prompt’ to guide it away from undesired sounds.

重要的是,V2A 可以为任何视频输入生成数量不限的音轨。此外,还可以定义 "积极提示",引导生成的输出朝着所需的声音方向发展,或定义 "消极提示",引导输出远离不想要的声音。

This flexibility gives users more control over V2A’s audio output, making it possible to rapidly experiment with different audio outputs and choose the best match.

这种灵活性使用户可以对 V2A 的音频输出进行更多控制,从而可以快速尝试不同的音频输出并选择最佳匹配。

Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi

Prompt for audio: Ethereal cello atmosphere

Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi

How it works 工作原理

We experimented with autoregressive and diffusion approaches to discover the most scalable AI architecture, and the diffusion-based approach for audio generation gave the most realistic and compelling results for synchronizing video and audio information.

我们尝试了自回归和扩散方法,以发现最具可扩展性的人工智能架构,而基于扩散的音频生成方法在同步视频和音频信息方面取得了最真实、最令人信服的结果。

Our V2A system starts by encoding video input into a compressed representation. Then, the diffusion model iteratively refines the audio from random noise. This process is guided by the visual input and natural language prompts given to generate synchronized, realistic audio that closely aligns with the prompt. Finally, the audio output is decoded, turned into an audio waveform and combined with the video data.

我们的 V2A 系统首先将视频输入编码为压缩表示。然后,扩散模型从随机噪声中迭代改进音频。这一过程以视觉输入和自然语言提示为指导,生成与提示密切配合的同步逼真音频。最后,对音频输出进行解码,将其转化为音频波形,并与视频数据相结合。

Diagram of our V2A system, taking video pixel and audio prompt input to generate an audio waveform synchronized to the underlying video. First, V2A encodes the video and audio prompt input and iteratively runs it through the diffusion model. Then it generates compressed audio, which is decoded into an audio waveform.

我们的 V2A 系统示意图,该系统利用视频像素和音频提示输入生成与底层视频同步的音频波形。首先,V2A 对视频和音频提示输入进行编码,并通过扩散模型反复运行。然后生成压缩音频,并解码为音频波形。

To generate higher quality audio and add the ability to guide the model towards generating specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of sound and transcripts of spoken dialogue.

为了生成更高质量的音频,并增加引导模型生成特定声音的能力,我们在训练过程中添加了更多信息,包括人工智能生成的注释,其中包含声音的详细描述和口语对话誊本。

By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts.

通过对视频、音频和附加注释进行训练,我们的技术可以学会将特定音频事件与各种视觉场景联系起来,同时对注释或文本中提供的信息做出响应。

Further research underway 正在开展进一步研究

Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional.

与现有的视频音频解决方案相比,我们的研究与众不同,因为它可以理解原始像素,而且可以选择添加文字提示。

Also, the system doesn't need manual alignment of the generated sound with the video, which involves tediously adjusting different elements of sounds, visuals and timings.

此外,该系统无需手动调整生成的声音和视频,因为手动调整需要对声音、视觉效果和时间等不同元素进行繁琐的调整。

Still, there are a number of other limitations we’re trying to address and further research is underway.

不过,我们还在努力解决其他一些限制因素,进一步的研究正在进行中。

Since the quality of the audio output is dependent on the quality of the video input, artifacts or distortions in the video, which are outside the model’s training distribution, can lead to a noticeable drop in audio quality.

由于音频输出的质量取决于视频输入的质量,视频中超出模型训练分布范围的假象或失真会导致音频质量明显下降。

We’re also improving lip synchronization for videos that involve speech. V2A attempts to generate speech from the input transcripts and synchronize it with characters' lip movements. But the paired video generation model may not be conditioned on transcripts. This creates a mismatch, often resulting in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript.

改进语音视频的唇部同步。

V2A 尝试从输入的文本中生成语音,并与人物的唇部动作同步。但是,配对视频生成模型可能不以文本为条件。这就造成了不匹配,往往会导致不可思议的唇部同步,因为视频模型生成的嘴部动作无法与文字记录相匹配。

Prompt for audio: Music, Transcript: “this turkey looks amazing, I’m so hungry”

Our commitment to safety and transparency 我们对安全和透明度的承诺

We’re committed to developing and deploying AI technologies responsibly. To make sure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development.

我们致力于以负责任的方式开发和部署人工智能技术。为了确保我们的 V2A 技术能对创意社区产生积极影响,我们正在收集来自顶尖创作者和电影制作人的不同观点和见解,并利用这些宝贵的反馈意见来指导我们正在进行的研究和开发工作。

We’ve also incorporated our SynthID toolkit into our V2A research to watermark all AI-generated content to help safeguard against the potential for misuse of this technology. 我们还在 V2A 研究中加入了 SynthID 工具包,对所有人工智能生成的内容进行水印处理,以防止这项技术被滥用。

Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing. Initial results are showing this technology will become a promising approach for bringing generated movies to life.

在考虑向更广泛的公众开放之前,我们的 V2A 技术将经过严格的安全评估和测试。初步结果表明,这项技术将成为将生成的电影变为现实的可行方法。

Note: All examples are generated by our V2A technology, which is paired with Veo, our most capable generative video model. 注:所有示例均由我们的 V2A 技术生成,该技术与我们最强大的生成视频模型 Veo 搭配使用。