ICLR 2025 submission
Exploring temporally aligned audio-video generation from text.
Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual-diffusion-transformer (d-DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi-stage training strategy that separates video and audio learning before joint fine-tuning. Our empirical evaluations demonstrate that SyncFlow produces audio and video outputs that are more correlated than baseline methods with significantly enhanced audio quality and audio-visual correspondence. Moreover, we demonstrate strong zero-shot capabilities of SyncFlow, including zero-shot video-to-audio generation and adaptation to novel video resolutions without further training.
The video shows a pheasant in a chicken coop, surrounded by other birds.
The video shows a water pump pumping water into a stream. The pump is metallic.
The video shows a rocky shoreline with waves crashing against the rocks.
The video shows two cats standing on rocks, facing each other.
The video features a close-up shot of a rubber toy being squeezed.
The video shows two pigs in a white and red cage, surrounded by food.
The video begins with a close-up shot of a monkey sitting on a wooden branch.
The video shows fireworks exploding in the night sky.
A person is hitting plants and removing them from pots or the ground.
A person is hitting the piece of paper with a piece of candy on it.
The video depicts a person hitting bamboo plants, rocks, and other objects.
The person is hitting various plants, trees, and objects with a stick.
A person is hitting various objects such as a phone, a computer, and more.
A person is hitting the faucet and sink surfaces with a stick.
A person is hitting various kitchen utensils and appliances with a stick.
A person is hitting books and papers on a shelf with a wooden stick.
The video shows a grey and white pigeon with red eyes perched on a fence.
The video shows a close-up shot of a person driving a car at night.
The video shows two large ships in a choppy sea with seagulls flying overhead.
The video shows a large lizard sitting on the ground, surrounded by grass.
The video shows a small pond with a black tarp attached to two poles.
The video shows a close-up of a wooden kitchen cabinet with an open drawer.